O'Reilly logo
live online training icon Live Online training

Apache Hadoop, Spark and Big Data Foundations

Learn the value proposition behind scalable data analytics tools

Douglas Eadline

The live training course will cover the essential introductory aspects of Hadoop, Spark and Big Data. A concise and essential overview of the Hadoop and Spark ecosystem will be presented. After completing the workshop attendees will gain a workable understanding of the Hadoop/Spark value proposition for their organization and a clear background on Big Data technologies.

What you'll learn-and how you can apply it

  • An understanding of Hadoop as a data platform
  • How the "Data Lake" and Big Data are changing data analytics
  • A basic understanding of the differences and similarities of Hadoop tools
  • Attendees will be able to navigate market congestion and understand how these technologies can work for their organization
  • Developer types can build on a solid foundation and learn how to use various tools mentioned in the presentation

This training course is for you because...

  • CIO and other managers who need to "get up to speed" quickly on scalable big data technologies
  • Developers or Administrators (devops) wanting to learn how all the key pieces of the have Hadoop and Spark ecosystem fit together
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

  • Basic understanding of data center operations (servers, storage, networks, database)

Recommended preparation:

Hadoop and Spark Fundamentals, Third Edition (video)

Hadoop 2 QuickStart Guide (book)

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Why is Hadoop Such a Big Deal? (50 mins)

  • A Brief History of Apache Hadoop
  • What is Big Data?
  • Hadoop as a Data Lake
  • Apache Hadoop V2 is a Platform
  • The Apache Hadoop Project Ecosystem
  • Hadoop Interfaces for New Users
  • Questions: 10 minutes

Break: 5 minutes

Segment 2: Hadoop Distributed File System (HDFS) Basics (25 mins)

  • How HDFS works
  • Questions: 10 minutes

Segment 3: Hadoop MapReduce Framework (25 mins)

  • The MapReduce Model
  • MapReduce Data Flow
  • Questions: 10 minutes

Break: 5 minutes

Segment 4: Making life Easier: Spark (20 mins)

  • Spark Basics and Components
  • Spark RRDs and Dataframes
  • Spark vs MapReduce
  • Questions: 10 minutes

Segment 5: Real World Applications/Wrap-up (15 mins)

  • Questions: 10 minutes