O'Reilly logo
live online training icon Live Online training

Modern real-time streaming architectures

Arun Kejariwal
Karthik Ramasamy

Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming from both the deluge of high-velocity data streams and the need for instant data-driven insights. It's now critical to mine business insights from data streams in a robust and timely fashion, which requires a highly available, reliable, and performant end-to-end stream processing system.

Join Karthik Ramasamy and Arun Kejariwal to explore state-of-the-art streaming systems and their deployment in production at internet scale. Karthik and Arun walk you through the different types of analysis carried out on data streams and lead a deep dive into commonly used algorithms (also referred to as data sketches) for each. You'll discover the typical challenges in modern real-time big data platforms and learn how to address them. Along the way, Karthik and Arun explain how advances in technology might impact the streaming architectures and applications of the future, investigate the interplay between storage and stream processing, and speculate about future developments.

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • The various facets of a stream processing pipeline
  • The types of analysis that can be carried out on data streams
  • Commonly used algorithms for analyzing your data streams

And you’ll be able to:

  • Determine and use the stream processing framework best suited to your needs
  • Carry out analysis on data streams

This training course is for you because...

  • You're an analyst with a background in marketing who needs to mine inbound data streams to guide decision making.
  • You're an engineer who has been tasked with building a stream processing pipeline for inbound data streams.

Prerequisites

  • A basic understanding of working with data
  • Familiarity with Java or Scala (useful but not required)

Recommended preparation:

An Introduction to Time Series with Team Apache (video)

About your instructor

  • Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

  • Karthik Ramasamy is the engineering manager and technical lead for Real Time Analytics at Twitter. He has two decades of experience working in parallel databases, big data infrastructure and networking. He cofounded Locomatix, a company that specializes in real timestreaming processing on Hadoop and Cassandra using SQL that was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks where he designed and delivered platforms, protocols, databases and high availability solutions for network routers that are widely deployed in the Internet. Before joining Juniper at University of Wisconsin, he worked extensively in parallel database systems, query processing, scale out technologies, storage engine and online analytical systems. Several of these research were spun as a company later acquired by Teradata. He is the author of several publications, patents and one of the best selling book “Network Routing: Algorithms, Protocols and Architectures.” He has a Ph.D. in Computer Science from UW Madison with a focus on databases.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

  • Overview of real-time streaming (10 minutes)
  • Messaging (20 minutes)
  • Operations (20 minutes)
  • Break (10 minutes)
  • Data sketches (50 minutes)
  • Break (10 minutes)
  • Storage (20 minutes)
  • Unification (20 minutes)
  • Wrap-up and Q&A (20 minutes)