O'Reilly logo
live online training icon Live Online training

Inside Unsupervised Learning: Group Segmentation using Clustering

Build systems to segment users into distinct and homogenous groups

Ankur Patel

Many industry experts consider unsupervised learning the next frontier in artificial intelligence, one that may hold the key to general artificial intelligence. Since the majority of the world's data is unlabeled, conventional supervised learning cannot be applied; unsupervised learning is necessary. Unsupervised learning can be applied to unlabeled datasets to discover meaningful patterns buried deep in the data, patterns that otherwise would be near impossible for humans to uncover.

In this 90-minute course, O’Reilly author Ankur Patel will explore one of the core concepts in unsupervised learning, clustering. Clustering is able to segment entities (e.g., users) into distinct and homogenous groups such that members of a group are very similar to members within the group but distinctly different from members in other groups. This group segmentation is possible without requiring any labels whatsoever and instead relies on separating entities based on behavior.

For example, via clustering, online shoppers could be grouped into budget-conscious shoppers, high-end shoppers, frequent shoppers, seasonal shoppers, technophiles, audiophiles, sneakerheads, back-to-school shoppers, young parents, senior citizens, and millennials. To perform clustering well, good feature engineering is required. In this course, we will explore retail shopping, perform feature engineering, and segment users based on their shopping behavior. We will also explore how clustering allows efficient labeling, turning unlabeled problems into labeled ones, opening up the realm of semi-supervised learning.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to perform good feature engineering
  • How to cluster users into distinct and homogenous groups
  • How to efficiently label a dataset after clustering, turning an unsupervised problem into a semi-supervised one

And you’ll be able to:

  • Segments online shoppers based on their shopping behavior
  • Cluster entities of your choice into distinct and homogenous groups after performing good feature engineering

This training course is for you because...

  • You are a data scientist or engineer and want to work with unlabeled data
  • You want to perform clustering to solve a business use case

Prerequisites

  • Working knowledge of Python
  • Understanding of Machine Learning

Recommended preparation

Recommended follow-up

About your instructor

  • Ankur A. Patel is the Vice President of Data Science at 7Park Data, a Vista Equity Partners portfolio company. At 7Park Data, Ankur and his data science team use alternative data to build data products for hedge funds and corporations and develop machine learning as a service (MLaaS) for enterprise clients. MLaaS includes natural language processing (NLP), anomaly detection, clustering, and time series prediction. Prior to 7Park Data, Ankur led data science efforts in New York City for Israeli artificial intelligence firm ThetaRay, one of the world's pioneers in applied unsupervised learning.

    Ankur began his career as an analyst at J.P. Morgan, and then became the lead emerging markets sovereign credit trader for Bridgewater Associates, the world's largest global macro hedge fund, and later founded and managed R-Squared Macro, a machine learning-based hedge fund, for five years. A graduate of the Woodrow Wilson School at Princeton University, Ankur is the recipient of the Lieutenant John A. Larkin Memorial Prize.

    He currently resides in Tribeca in New York City but travels extensively internationally.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction to Unsupervised Learning - 15 minutes

  • How unsupervised learning fits into the machine learning ecosystem
  • Common problems in machine learning
  • Finding patterns without having labels
  • Efficiently labeling data
  • Motivation for Clustering - 15 minutes
  • Segment users into distinct and homogenous groups such that users within a group are very similar but very different from users in other groups
  • Introduction to online retail shopping and the rise of personalization
  • Efficiently label data
  • Q&A - 5 minutes
  • Break - 5 minutes

Data Preparation - 10 minutes

  • Explore data in Jupyter notebook
  • Prepare the online retail shopping dataset
  • Feature Engineering - 10 minutes
  • Introduce feature engineering
  • Perform feature engineering

Clustering - 20 minutes

  • Apply k-means and evaluate results
  • Apply hierarchical clustering and evaluate results
  • Apply DBSCAN and evaluate results
  • Apply hierarchical DBSCAN and evaluate results

Conclusion / Q&A - 10 minutes