O'Reilly logo
live online training icon Live Online training

Intermediate Natural Language Processing (NLP)

Real World Applications of Word Embeddings

Maryam Jahanshahi

Free-form text data comprises the vast majority of data on both our computer systems and the internet at large. Yet, in comparison to structured data, it remains woefully underanalyzed. Natural language processing is critical to generating structure out of text data, and enables us to derive important insights from text data.

Join Maryam Jahanshahi to create advanced natural language models. In this live training, we will focus on using word embeddings for analyzing text data. In contrast to more basic approaches such as term frequencies or one-hot encodings, word embeddings allow us to use the context of words to create powerful language models. Our focus in this class will be on practical implementation of both pretrained and custom word embeddings. We will cover all the considerations in handle and process natural language data, to building and tuning embeddings models.

This live training will primarily use libraries in the Python ecosystem, including spaCy, gensim and scikit-learn. We will use a combination of Python scripts and notebooks. The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.

What you'll learn-and how you can apply it

  • Preprocess a text corpus effectively for language modeling
  • Train and tune and text word embeddings
  • Use word embeddings to classify texts or cluster documents

This training course is for you because...

  • You are a data scientist or NLP engineer that has a working understanding of the fundamentals of natural language processing (tokenization, part of speech tagging, topic modeling)
  • You want to be able to transform a corpus of natural language into vector space representations that can be used as inputs for machine learning models
  • You want to develop custom language models for natural language data

Prerequisites

  • Familiarity with the basics of text preprocessing including tokenization, stemming/lemmatization
  • Familiarity with basic methods to represent text including one-hot encoding, term frequencies.
  • Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks.

Course Set-up:

  • The Course GitHub Repo contains links to:
    • The Kaggle Kernel which runs Notebooks on the cloud, and,
    • Instructions on how to set up these environments locally.
  • You only need to use one of these setups (Kaggle or the local environment). Broadly speaking, we will be using Python 3.6 or greater, and the spaCy, gensim and scikit-learn libraries.

Recommended Preparation:

If you need to brush up on Python or NLP before class, see the following videos: - Python Programming Language LiveLessons (video) - Modern Python LiveLessons (video) - Natural Language Processing (NLP) from Scratch LiveLessons (video)

Recommended Follow-up:

Deep Learning for NLP using Python (video)

About your instructor

  • Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction to Language Models (Length: 30 min)

  • Complexity of natural language requires specific techniques
  • Language models are probability distributions over a sequence of words
  • Key uses are in machine learning and unsupervised learning (search/IR and clustering/topic modeling)
  • The intuition behind vector space modeling
  • Description of some of the similarities and differences between different word embedding algorithms (word2vec, GloVe, PPMI)
  • Q&A / Break (Length: 10 min)

Segment 2: Using Pretrained Word Embeddings: We will demonstrate (using a Notebook) (Length: 30 min)

  • Ease of using pretrained embeddings
  • Design considerations in using pretrained models including: noise, sentiments, generalization
  • Some specific examples using different models (occupy in Twitter / Wikipedia / Common Crawl)
  • Limitations using pretrained models
  • Inputs: Implications of design decisions made during preprocessing on casing / stopwords / frequently occurring phrases
  • Output: Goal to learn similarity (example of word similarity tests)
  • Q&A / Break (Length: 10 min)

Segment 3: Training your own Word Embeddings (Length: 30 min)

  • Optimizing for different outputs (semantic relations vs semantic similarity)
  • Preprocessing for outputs
  • Testing word embedding models (visual inspection, similarity pairs)
  • Training a custom embedding model using spaCy to preprocess and the Gensim and scikit-learn API to train models
  • Note: Training an embedding can take many hours, so this notebook will focus on how to do it, and participants can continue to train or experiment in their own time.
  • Q&A / Break (Length: 10 min)

Segment 4: Applying Word Embeddings (Length: 40 min)

  • Using word embeddings as inputs to understand documents
  • Supervised Machine Learning including Document Classification
  • Unsupervised Models including Document Clustering
  • Using word embeddings to extract insights from texts
  • Static vs Dynamic Embeddings on a high level
  • Hacking dynamic embeddings for other types of ordinal structure (grouping by reviews stars)
  • Q&A / Break (Length: 10 min)