O'Reilly logo
live online training icon Live Online training

Hands-on Introduction to Apache Hadoop and Spark Programming

A quick-start introduction to the important facets of big data analytics

Douglas Eadline

The live training course will provide a "first touch" hands-on experience needed to start using essential tools in the Apache Hadoop and Spark ecosystem. Tools that will be presented include Hadoop Distributed File Systems (HDFS) Apache Pig, Hive, Sqoop, Flume, Spark, and the Zeppelin we GUI. The topics are presented in a "soup-to-nuts" fashion with minimal assumptions about prior experience. As part of the course, students can download a small Hadoop/Spark virtual machine to run the course examples. After completing the course attendees will gain the skills needed to begin their own analytics projects.

What you'll learn-and how you can apply it

  • Be able to navigate and use the Hadoop Distributed File Systems (HDFS).
  • Learn how to run, monitor, inspect, and stop applications in a Hadoop environment.
  • Learn how to start and run Apache Pig, Hive, and Spark applications from the command line.
  • Start and use the Zeppelin Web GUI for Hive and Spark application development.
  • Use Flume and Sqoop to import/export log files and databases into HDFS.

This training course is for you because...

  • Beginning developers who want to quickly learn how to navigate the Hadoop and Spark development environment
  • Administrators who are tasked with providing and supporting a Hadoop/Spark environment to their organization
  • Data Scientists that do not have experience with scalable tools like Hadoop or Spark

Prerequisites

  • Basic understanding of Linux command line including bash shell, simple text editing, and some experience with Python.
  • If you want to run the examples, a functioning Hadoop/Spark environment (see below)

Setup Instructions:

To run the class examples, download the Hadoop Minimal Virtual Machine (3.3G) by clicking: https://tinyurl.com/ya69odu7. Installation notes available here: https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-Install.3.txt.

You may also consider the larger Hortonwords HDP Sandbox (12G) available at https://hortonworks.com/products/sandbox/.

If you wish to follow along, install and test a sandbox at least one day before the class.

The following two resources offer other methods to install a Hadoop and/or a Spark environment directly from the Apache web site using a Linux desktop or laptop. They also provide instructions on how to install the Hortonworks HDP Sandbox using Virtual box. The following resources also provide step-by-step notes files to assist with installation.

Lessons 1, 3, 5, 6, 7, and 9 from Hadoop Fundamentals LiveLessons

Chapters 1 ,3, 4, and 7 from Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem

Recommended Preparation:

Optional suggestions for additional reading or videos that provide context for your course. Ideally, these are available in Safari—some corporate firewalls block access to other sites.

Apache Hadoop, Spark and Big Data Foundations Online Live Training: This three-hour class is offered monthly before this longer two-day course. This class provides important, in-depth topic coverage. Completing the Foundations course is recommended.

Hadoop and Spark Fundamentals, Third Edition (video)

Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale (book)

Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (book)

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1

Note: All example commands are available in annotated notes files that can be used to run the same examples after the course is complete. Commands can be cut/paste/run from the notes files allowing students to repeat (or modify) all course examples.

Segment 1: Introduction and Quick Overview of Hadoop and Spark (40 mins)

  • Instructor explains how course will work (sit back and watch, try on your own later)
  • This section is all slides and provide background on Hadoop and Spark
  • There will be about 10 minutes for questions

Segment 2: Using the Hadoop Distributed File System (HDFS) (25 mins)

  • Instructor will provide background on HDFS and demonstrate how to use basic commands on a real cluster.
  • If needed, there will be a 10 minute question and answer period

Break: 10 minutes

Segment 3: Running and Monitoring Hadoop Applications (35 mins)

  • Instructor demonstrate how to run Hadoop example applications and benchmarks
  • A live tour of the YARN web GUI will be presented for a running application
  • If needed, there will be a 10 minute question and answer period

Segment 4: Using Apache Pig (20 mins)

  • Instructor will present a simple Apache Pig example
  • Starting Pig locally, on a cluster, and with Tez acceleration will be demonstrated
  • If needed, there will be a 5 minute question and answer period

Break: 10 minutes

Segment 5: Using Apache Hive (30 mins)

  • Instructor will demonstrate a simple interactive Hive-SQL example using example data
  • Running the same example from a script will also be presented
  • If needed, there will be a 10 minute question and answer period.

DAY 2

Segment 6: Running Apache Spark (pySpark) (35 mins)

  • The interactive pySpark word count example will be explained to illustrate RDDs, mapping, reducing, filtering, and lambda functions
  • A stand-alone pi estimator program will be demonstrated
  • If needed, there will be a 10 minute question and answer period

Break: 5 minutes

Segment 7: Running Apache Sqoop (30 mins)

  • A full example of taking data from MySQL to Hadoop/HDFS and back to MySQL will be demonstrated
  • Various Sqoop options will be demonstrated
  • If needed, there will be a 10 minute question and answer period

Segment 8: Using Apache Flume (20 mins)

  • A Flume example will demonstrate how to move web log data into Hadoop/HDFS
  • If needed, there will be a 5 minute question and answer period

Break: 10 minutes

Segment 9: Example Analytics Application using Apache Zeppelin (40 mins)

  • The major features of the Zeppelin web notebook will be demonstrated
  • A simple banking application notebook will be demonstrated using Apache Zeppelin
  • The example includes CSV input, RDD/Dataframe usage, and interactive plotting
  • If needed, there will be a 10 minute question and answer period.

Break: 10 minutes

Segment 10: Wrap-up/Where to Go Next (20 mins)

  • A brief summary of course take-aways
  • The download URL for all course notes, data, and a DOS to Linux/HDFS cheat-sheet
  • Resources for installing Hadoop/Spark/Zeppelin on your hardware are provided
  • If needed, there will be a 5-10 minute final question and answer period.