Live Online training

# Foundational data science with R

## What you'll learn-and how you can apply it

At the end of this live, online training, you’ll understand:

• How to summarize data sets with key statistics
• Which statistics are optimal for large data sets
• The trade-off between different summary measures.
• The importance of color, transparency and shape in data visualisations
• Mathematical distribution, and how it relates to “real” data
• How key algorithms work

And you’ll be able to:

• Summarize data sets
• Graphically describe data
• Compare groups of data using principled statistical techniques
• Describe relationships among data sets with correlation and regression models
• Use insight to predict future values

## This training course is for you because...

You are a:

• Programmer, interested in data science but with little or no statistics or mathematical background.
• Manager who wants to summarize data sets.
• Someone who uses data, but doesn’t have the necessary training to analyze and summarize it.

Prerequisites

No experience with R is necessary, but participants are expected to understand basic programming via another language, e.g. python, matlab, C, or Java. The course will be taught using R, but the focus is on the methods, rather than programming.

• Colin Gillespie is a Senior Lecturer in Statistics at Newcastle University, UK, and the co-author of Efficient R Programming by O’Reilly. His research interests are high-performance statistical computing and Bayesian statistics. He is regularly employed as a consultant by Jumping Rivers and has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming. ## Schedule

The timeframes are only estimates and may vary according to how the class is progressing

DAY 1

Introduction and course overview (20 minutes)

• Introduction
• Course overview

Condensing data with numerical summaries (90 minutes)

Measures of location

• Mean, median, mode
• Example
• Exercise / Q&A (25 minutes)

• Variance, standard deviation, quartiles, range
• Example
• Exercise / Q&A (25 minutes)

Streaming data

• Mean vs median
• variance vs quartiles
• Example
• Exercise / Q&A (10 minutes)

Break (5 min)

What, why and how of visualisation (90 minutes)

Scatter plot Colors

• Number of points–should you summarize?
• Transparency
• log scales
• Examples
• Exercise/Q&A (25 minutes)

Histogram

• How do determine the number of bins
• Examples
• Barplot
• ordinal data
• Examples
• Boxplot
• Great for comparison
• Examples
• Exercises/Q&A (25 minutes)

Wrap up

DAY 2

The normal distribution-what’s the point? (30 minutes)

• Why does the normal distribution come from?
• Shape: the famous bell shaped curve
• Key parameters
• The 2 standard deviations rule
• Scaling data
• (Data - mean)/sd
• Example
• Exercise/Q&A (10 minutes) Break (5 min)

How to compare groups (60 minutes)

• The t-test
• The t-distribution
• Assumptions: normality, independent
• Example:
• OK Cupid data. Are the “daters” heights different from the standard population?
• The central limit theorem (basically, don’t worry about normality too much if your data set is big enough)
• Confidence intervals:
• Standard errors vs standard deviation
• Example
• Exercise/Q&A

Break (5 min)

Capturing relationships with linear regression (90 minutes)

• Correlation: linear relationship between two variables
• Examples
• Exercise/Q&A
• Simple linear regression
• Assumptions
• Residuals: Observed - expected
• Examples
• Exercise/Q&A

Wrap-up (5 min)