O'Reilly logo
live online training icon Live Online training

Building Distributed Pipelines for Data Science Using Kafka, Spark, and Cassandra

Learn how to introduce a distributed data science pipeline in your organization

Andy Petrella

Building a distributed pipeline is a huge—and complex—undertaking. If you want to ensure yours is scalable, has fast in-memory processing, can handle real-time or streaming data feeds with high throughput and low-latency, is well suited for ad-hoc queries, can be spread across multiple data centers, is built to allocate resources efficiently, and is designed to allow for future changes, join Andy Petrella and François Bayart for this immensely practical hands-on course.

What you'll learn-and how you can apply it

By the end of this course, you'll have a solid understanding of:

  • The most important technologies for a distributed pipeline, when they should be used—and how
  • How to integrate scalable technologies into your company’s existing data architecture
  • How to build a successful, scalable, elastic, distributed pipeline using a lean approach

This training course is for you because...

  • You’re a data scientist with experience with data modeling, business intelligence, or a traditional data pipeline and need to deal with bigger or faster data

  • You’re a software or data engineer with experience in architecting solutions in Scala, Java, or Python and you need to integrate scalable technologies in your company’s architecture


  • Intermediate knowledge of an object-oriented language and basic knowledge of a functional programming language, as well as basic experience with a JVM

  • Understanding of classic web architecture and service-oriented architecture

  • Basic understanding of ETL, streaming data, and distributed data architectures

  • Intermediate understanding of Docker and UNIX, as well as some basic knowledge about networks (IP, DNS, SSH, etc.)


For the online training class, we'll be using as the simplest environment to run most of the pipeline.This environment will be available from a single docker image. Please click the link below and follow the setup instructions.


Recommended Preparation

Scala and the JVM as a big data platform: Lessons from Apache Spark

Architecture Patterns Part 1

Introduction to Big Data

Learning Docker

Learning DNS

About your instructor

  • Andy is an entrepreneur with Mathematics and Geospatial data analysis background focused on unleashing unexploited business potentials leveraging new technologies in machine learning, artificial intelligence and cognitive systems. In the open source community, Andy has been known for its Spark Notebook project bridging distributed data science gap with the Scala ecosystem.

    Andy is the CEO of Kensu Inc., an Analytics and AI Governance company, which created the Kensu Data Activity Manager (DAM), the first of its kind GCP (Governance, Compliance & Performance) Solution for Data Science. DAM automatically and in real-time creates the data mapping across tools and teams to be the one-stop shop for DPO and data managers for all aspects of GCP.


The timeframes are only estimates and may vary according to how the class is progressing

Day 1

  • Introduction, Spark, Spark Notebook, and Kafka
  • Assignment #1

Day 2

  • Streaming: Spark, Kafka, and Cassandra
  • Data analysis and external libraries
  • Assignment #2

Day 3

  • Microservices, cluster management, job orchestration, and live demo of end-to-end distributed pipeline
  • Final discussion & wrap up