O'Reilly logo
live online training icon Live Online training

Real-time data foundations: Spark

Apache Spark at speed

Ted Malaska

Join Ted Malaska to explore Apache Spark for streaming use cases. You'll learn the architecture differences between building Spark ETL or training jobs and streaming applications as you walk through core concepts like windowing, state management, configurations, deployment, and performance.

Special note: This is the second course in a four-part series focused on building a foundation in near-real-time processing of IoT data. Although these courses are designed to be taken in any order, we suggest you take Real-time data foundations: Kafka first for best results.

  1. Real Time Data Foundations: Kafka
  2. Real Time Data Foundations: Spark
  3. Real Time Data Foundations: Flink
  4. Real Time Data Foundations: Time Series Architectures

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • The architecture design behind Spark Streaming
  • How to use Spark as a streaming engine

And you’ll be able to:

  • Work with core APIs
  • Use Spark for deployment, state management, and monitoring

This training course is for you because...

  • You're a data engineer who wants to bridge the gap from batch to streaming.
  • You're a product manager who is trying to figure out what use cases and functionality are provided by stream processing.

Prerequisites

  • A basic understanding of working with data
  • Familiarity with Java or Scala (useful but not required)

Materials or downloads needed in advance:

  • A machine with Docker and the IDE of your choice installed
  • A GitHub account

Recommended preparation:

About your instructor

  • Ted Malaska is the director of enterprise architecture at Capital One. Previously, he was on the Battle.net team at Blizzard Entertainment, he was also a principal solutions architect at Cloudera, where he helped clients succeed with Hadoop and the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is the coauthor of Hadoop Application Architectures, a frequent conference speaker, and a blogger on data architectures.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

  • Apache Spark architecture (component rundown) (25 minutes)
  • Window and state management (15 minutes)
  • Spark use cases (10 minutes)
  • Break (10 minutes)
  • Deployment (10 minutes)
  • Failure and recovery (10 minutes)
  • Configuration (10 minutes)