O'Reilly logo
live online training icon Live Online training

Machine Learning with the Tidyverse in R

Jared Lander

R has many tools for machine learning such as glmnet for penalized regression and xgboost for boosted trees. While each package has its own interface, people have long relied on caret for a consistent experience and for features such as preprocessing and cross-validation. In this workshop we look at the next generation of machine learning in R from the author of caret: tidymodels. We will use recipes for preprocessing, parsnip for model training and rsample for cross-validation and yardstick for evaluation. Users should have experience with R, linear models and tree based models. They should be prepared with a recent version of R installed along with RStudio and the tidymodels and coefplot packages.

What you'll learn-and how you can apply it

  • Curse of Dimensionality
  • Building Sparse Design Matrices
  • Penalized L1 (lasso) and L2 (ridge) Regression with Elastic Net
  • glmnet
  • Decision Trees
  • Boosted Trees
  • Random Forests
  • XGBoost
  • caret

This training course is for you because...

  • You want to learn about modern machine learning tools
  • You want to train large, interpretable models
  • You want to train boosted trees
  • You want to make accurate predictions

Prerequisites

Working knowledge of the basics of R - Experience fitting linear models in R

Recommended Preparation:

From R for Everyone, Second Edition(https://sunburn.in/?page=library/view/r-for-everyone/9780134546988/): - Chapter 19 - Chapter 20 - Chapter 21 - Section 22.1 - Section 23.4 - Section 23.5 - Section 23.6 - Chapter 26

Course Set-up:

  • Install R and RStudio
  • Please follow the instructions at https://github.com/jaredlander/LearningR to properly setup your environment and download the data we will be using

About your instructor

  • Jared P. Lander is the Chief Data Scientist of Lander Analytics, a data science and artificial intelligence consulting and training firm based in New York City; the organizer of the New York Open Statistical Programming Meetup—the world’s largest R meetup—–and the New York R Conference); author of R for Everyone and an adjunct professor at Columbia University. With an M.A. from Columbia University in statistics and a B.S. from Muhlenberg College in mathematics, he has experience in both academic research and industry. Very active in the data community, Jared is a frequent speaker at conferences, universities and meetups around the world. His writings on statistics can be found at jaredlander.com and his work has been featured in publications such as Forbes and the Wall Street Journal.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

  • Creating train and test sets with rsample
  • Preprocessing and feature engineering with recipes
  • Scale variables
  • Create dummy variables
  • Take logarithms
  • Impute missing data
  • Other transformations
  • Fitting penalized regression models with glmnet
    • lasso regression
    • ridge regression
    • elastic net
  • Fitting boosted trees for classification with xgboost
  • Experimenting with hyperparameters
  • Use parsnip as a unified modeling interface
  • Assess accuracy with yardstick