O'Reilly logo
live online training icon Live Online training

Data Engineering for Data Scientists

Build resilient pipelines to support stronger models

Max Humber

Organizations, big and small, are making significant investments in data science. New data hires are told to model anything and everything so that the organization might find a competitive edge. The problem is that few companies are investing enough in infrastructure or hiring the number of data engineers required to support modeling efforts. As a result, data scientists arrive on the scene and quickly burn out because they can’t build the models they want to build… because the pipelines don’t exist.

In this course, Max Humber will teach you how to build resilient pipelines with industry-leading tools. Specifically, this course will introduce Airflow (the open source standard for automating data pipeline workflows), Python Fire (a library for automatically generating command line interfaces), and scikit-learn/pandas (used for data wrangling). All so that you can get back to actual data science!

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to build data pipelines
  • How to monitor the performance of your models
  • How to validate data before passing it to your models

And you’ll be able to:

  • Productionize the outputs of your data models
  • Author and execute Airflow jobs
  • Write Airflow compatible SQL and Python scripts

This training course is for you because...

  • You are a new Data Engineer or a Data Scientist on a small team
  • You work with machine learning models
  • You want your models to be supported by industry leading tools

Prerequisites

  • Experience with pandas, scikit-learn, and at least some experience with SQL databases.
  • Optionally, it may be helpful to have ownership over models running in production.

Recommended preparation:

  • Install Airflow on your local machine before the course begins.

Recommended follow-up:

About your instructor

  • Max is a Lead Instructor at General Assembly and the author of Personal Finance with Python. He was the first Data Scientist at Borrowell and the second Data Engineer at Wealthsimple.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (5 minutes)

  • Who am I and who are you?
  • Poll: Machine learning models in production? Ownership? # of DS/DEs on your team?
  • Introduce the “Data Hierarchy of Needs”
  • Learning agenda

Model Extending (50 minutes)

  • Migrate code from Jupyter Notebooks to Python Scripts
  • Exercise: Make models “command-line compatible” with Python Fire
  • Protect against invalid data with DataFrameMapper
  • Solve the “Hamburger Emoji” Problem
  • Add model performance logging with Rollbar
  • Q&A
  • Break (5 minutes)

Model Saving (20 minutes)

  • Move away from flat csv files
  • Query SQL data with pandas
  • Introduce python-dotenv for managing secrets
  • Exercise: Write model results to SQL using pandas
  • Q&A

Model Scheduling (40 minutes)

  • Configure Airflow
  • Author and execute Airflow jobs
  • Exercise: Move SQL and Python modeling scripts over to an Airflow job
  • Monitor the schedule and job performance
  • Q&A