O'Reilly logo
live online training icon Live Online training

Scale Your Python Processing with Dask

Crunch Big Data Easily in Python, From a Few Cores to a Few Thousand Machines

Adam Breindel

Python is a (maybe the) preeminent language for data science. And the SciPy ecosystem of tools enables hundreds of different use cases, from astronomy to financial time series analysis to natural language processing. Most Python tools assume your data fits in memory, and many do not support parallel execution. But today, we have much more data and much more compute power, so we want to scale our open source Python tools to huge datasets and huge compute clusters.

The open-source Dask project supports scaling the Python data ecosystem in a straightforward and understandable way, and works well from single laptops to thousand-machine clusters. Dask scales things like Pandas Dataframes, scikit-learn ML, NumPy tensor operations, as well as allowing lower level, custom task scheduling for more unusual algorithms. Dask plays nice with all of the toys you want -- just a few examples include Kubernetes for scaling, GPUs for acceleration, Parquet for data ingestion, and Datashader for visualization.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • What Dask is and why it exists
  • How Dask fits into the Python and big data landscape
  • How Dask can help you process more data faster, from a laptop up to a big cluster

And you’ll be able to:

  • Get started building systems with Dask
  • Add Dask and start migrate existing components incrementally
  • Analyze data and train ML models with Dask

This training course is for you because...

  • You are a data engineer, data scientist, or natural/social scientist
  • You work with Python and data
  • You want to become a practitioner or leader who focuses on pragmatic, effective solutions

Prerequisites

  • Python, basic to intermediate level
  • Python data science stack (Pandas, NumPy, scikit-learn) at a basic level

Recommended preparation:

Recommended follow-up:

About your instructor

  • Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam's experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (55 minutes)

  • Presentation: About Dask - What it is, where it came from, what problems it solves
  • Discussion: Options for setting up and deploying Dask
  • Presentation: Pandas-style Analytics with Pandas and Dask DataFrame
  • Exercise: Try a Hands-on Analytics Example
  • Q&A
  • Break (5 minutes)

Dask Graphical User Interfaces (30 minutes)

  • Presentation: Monitoring Workers, Tasks, and Memory
  • Poll
  • Presentation: Using Dask’s Built-In Profiling to Understand Performance
  • Exercise: Analyze the Performance of Data Transformation
  • Q&A

Machine Learning (25 minutes)

  • Presentation: Scikit-Style Featurization with Dask
  • Discussion: Current Algorithm Support and Integration
  • Presentation: Modeling Task
  • Exercise: Try and Alternate Model
  • Q&A
  • Break (5 minutes)

Additional Data Structure Overview (25 minutes)

  • Presentation: Dask Array
  • Discussion: What Can We Do with Dask Array?
  • Presentation: Dask Bag
  • Exercise: Look at Lower-Level Task Graph Opportunities in the Docs
  • Q&A

Best Practices and Extended Q&A (35 minutes)

  • Presentation: Managing Partitions and Tasks
  • Discussion: File Formats and Data Structures
  • Presentation: Caching

Q&A (at least 15 minutes reserved)