O'Reilly logo
live online training icon Live Online training

Practical Linux Command Line for Data Engineers and Analysts

Learn to navigate Linux systems and perform essential tasks for Hadoop and Spark analytics

Douglas Eadline

The advent of Linux based analytics systems using Apache Hadoop and Spark provides scalable tools for insight and learning. As with any UNIX based platform, all essential operations can be performed using the command line. Indeed, in many situations there are operations that are efficiently performed by using the Linux command line. Although "pointing and clicking" using a GUI is often preferred, these interfaces can be restrictive and limit functionality. A good working knowledge of the Linux command line will actually allow many key operations to be streamlined and easily executed. Many of the commands and features available through the Linux command line will actually help improve the throughput of today's data analyst.

What you'll learn-and how you can apply it

  • Understand why the command line is still important
  • Learn how to access a Linux server using the command line in Windows and Mac computers
  • Understand the basic Linux filesystem layout and navigate its contents
  • Learn the essential commands and tools used in a modern scalable analytics environment
  • Understand the basic vi text editor commands so you can view and edit files
  • Learn about ways to move data to/from Linux and Hadoop/Spark systems
  • Learn how to run Hadoop and Spark applications from the command line
  • Learn how to create simple scripts to automate many processes

This training course is for you because...

  • You are interested in learning only the essential and useful aspects of the Linux command line.
  • You want to learn how to connect to and perform useful tasks on almost any Linux server.
  • You are especially interested in Hadoop/Spark clusters.
  • You want a hands-on experience to try all of the commands and examples during and after the course (including a single server instance of Hadoop/Spark and other tools) -- a Linux Hadoop virtual machine is provided.

Prerequisites

  • A basic understanding of computer/server operation (processors, memory, disks, networking)

Setup Instructions

About your instructor

  • Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction and Course Goals (10 mins)

  • How to get the most out of this course
  • It is 2019, why do we need the Linux/Unix command line?
  • Advantages and disadvantages of the command line
  • Working with the command line in Windows, Mac, and Linux
  • Safe communication using Secure Shell (SSH)

Segment 2: The Linux Hadoop Minimal Virtual MachineText Terminal (15 mins)

  • Using Oracle Virtual Box
  • Starting the Virtual Machine
  • Connecting the VM using SSH
  • The Linux filesystem layout

Segment 3: Basic Linux Commands (35 mins)

  • What is a *nix shell?
  • Basic Linux commands
  • Basic shell commands
  • Input/Output and pipes
  • File permissions
  • Process management
  • Commands to access system information

Questions (10 min)

Break (5 mins)

Segment 4: Editing/Viewing Text Files: vi (Visual Editor) (20 mins)

  • Basic modes and navigation
  • Insert/delete copy/paste
  • Search/Replace

Segment 5: Moving Data to/from Your Local File System (15 mins)

  • Compressing and archiving using tar and zip
  • Secure copy (scp)
  • Web get (wget)
  • Data integrity

Segment 6: Moving Data into Hadoop HDFS (15 mins)

  • What is Hadoop HDFS and why is it different
  • Your local file-system is not Hadoop HDFS
  • Using HDFS wrapper commands

Segment 7: Bash Scripting Basics (20 mins)

  • Creating a bash script using the following:
  • Bash variables
  • If-then tests
  • Control structures
  • Input and output

Questions (10 min)

Break (5 mins)

Segment 8: Running Command Line Analytics Tools (20 mins)

  • Running/Observing a Hive job
  • Running/Observing a PySpark job

Segment 9: Course Wrap-up and Additional Resources (5 mins)

Remaining questions