O'Reilly logo
live online training icon Live Online training

Reinforcement Learning: Building Recommender Systems

enter image description here

Matt Kirk

Have you ever made a decision that seems like a good idea at the time? And then years later ended up being a complete mistake? Has it gone the other way? Where you make a mistake now only for it to turn into something good later on?

This thought provoking idea is what has led to the field of Reinforcement Learning. The decisions a player makes in chess might make sense for the next move but cost the player the game.

How do we make decisions now that set us up for success over time and dynamically react to changes as they play out? That is what RL is all about!

If you work in an industry that is dynamic (finance, aerospace, cars, advertising, media, social media), RL can bring massive value to you. Having the ability to learn how to operate within an environment can lead to making better portfolio decisions, spending advertising dollars better, autonomous vehicles, automating processes, and much more.

This class we will delve into what you need to know about RL to get started. Starting from the bellman equations and value iteration up to deep q-networks as well as some resources for you to learn more after the class.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • Balancing exploitation and exploration within a dynamic environment. We will introduce the concept of the gittins index as well as other ideas on how this works in practice.
  • The tradeoff between model-free and model based reinforcement learning algorithm
  • The connection of reinforcement learning and supervised learning
  • Value Iteration (Bellman equations), Q-Learning, and DQNs to be used for model-free reinforcement learning. Q-Learning is an algorithm that learns the long term estimated reward for taking an action in a given state. DQN, or Deep Q-Network utilizes a neural net to assign value to actions in a given state (sometimes this is an image).
  • And you’ll be able to:

  • Build a simple model using value iteration to traverse a maze

  • Build a simplistic stock trader using Q-Learning
  • Play the game of breakout using a DQN.
  • Apply Value Iteration, Q-Learning, and DQNs to dynamic updating problems you face at work. Whether it’s trading stocks, choosing advertisements to serve up, or automating processes with an optimal policy.

This training course is for you because...

  • You are a data scientist with a background in supervised and unsupervised learning and want to learn reinforcement learning. For the data scientist who is tired of only classifying things in a point in time instead of over time.
  • You are a software engineer who wants to optimize an automated system over time using machine learning.


  • An introduction to supervised learning.
  • Understanding of what a classification and a regression is.
  • A basic knowledge of optimization theory, information theory, and algebra would be helpful.
  • A background in some of the deep learning techniques applied to images is also useful.

Recommended preparation:

About your instructor

  • Matt Kirk is a data architect, software engineer, and entrepreneur based out of Seattle, WA.

    For years, he struggled to piece together his quantitative finance background with his passion for building software.

    Then he discovered his affinity for solving problems with data.

    Now, he helps multi-million dollar companies with their data projects. From diamond recommendation engines to marketing automation tools, he loves educating engineering teams about methods to start their big data projects.

    To learn more about how you can get started with your big data project (beyond taking this class), check out matthewkirk.com for tips.


The timeframes are only estimates and may vary according to how the class is progressing

Introduction (10 minutes)

Reinforcement Learning (40 minutes).

  • Presentation: Why Reinforcement Learning?
  • Balance between exploration and exploitation
  • Learn over time instead of all at once.
  • Learning the policy to utilize over just a value.
  • A way to learn AI heuristics, and plans.
  • Presentation: Why now? Dota2, AlphaGo, and the other advancements
  • Presentation: What exactly is Reinforcement Learning?
  • Autoregressive supervised learning
  • Bellman equations
  • Markov Decision Processes.
  • Model-Free vs Modeled RL.
  • Presentation: Who is currently using RL effectively?
  • Hedgefunds
  • Self-driving cars
  • Games
  • OpenAI
  • Q&A
  • Quiz: (5 minutes)
  • Break (5 minutes)

Discussions: (5 minutes)

  • What would RL be suited for in their organization?
  • When would someone want to use a Model vs be Model Free?
  • When should someone optimize the policy vs the end reward?

Q-Learning (35 minutes)

  • Lecture: Value iteration with the bellman equations.
  • Lecture: Re-arrangement of Value Iteration to be Q-Learning. Or learning the optimal action based on an expected Q value or value at the terminal state.
  • Lecture: Walk through of multiple Q learning scenarios
  • What to pick as a reward?
  • What is a state?
  • What is an action?
  • Are the actions stochastic vs deterministic?
  • Q&A
  • Quiz: Test recollection of Q-Learning (5 minutes)

Q-Trader using straight Q-Learning (5-10 minutes)


  • Walk through hand coded states
  • Walk through hand coded actions
  • Determine reward as Sharpe Ratio.
  • Show my results
  • Break (5 minutes)

Lab (20 minutes to work on Q-Trader).

  • The goal is to fill in the dots, and to do better than what I did.
  • Some hints would be trying out different learning rates, different ways of increasing episode viewing, or others.

DQN (30-35 minutes)

  • Lecture: Instead of a tabular way of representing Q can we learn it using something else?
  • Lecture: Calculate Q using a neural net. Walk through the variations of DQNs including the double DQN.
  • Lecture: What are neural nets good at? How can we roll that into DQNs?
  • Convolutions
  • Max Pooling
  • Dropouts
  • Recurrent Layers
  • Q&A
  • Quiz: State check of knowledge (5 minutes)
  • Break (5 minutes)

Does a DQN work better than Q-Learning? (10 minutes)


  • The state is now amorphous
  • The action is still hand coded
  • The reward is still the same.
  • Show my results.

Lab (20 minutes)

  • Dropouts
  • Recurrent layers
  • Maxpooling
  • Convolutions
  • etc.

Wrap-up and Conclusion (10 minutes)

  • Put up all the algorithms shown before
  • There is always more to learn. A2C, AlphaZero, td-lambda. DQN is only scratching the surface.
  • Q&A