AAAI Tutorial
AAAI10 Tutorial: Reinforcement Learning Algorithms for MDPs
Sunday, July 11, 2010, 9:00 AM – 1:00 PM
Location: Atlanta Ballroom F, Seventh Floor
Slides without notes: Part 1, Part 2, Part 3, Part 4
The "slides with notes" pdfs below come in a format where every page to be shown on the big screen is followed by a page that is to be shown on the computer's private screen (i.e., they are designed for dual screen projection). We decided that the notes might be useful for the audience, so we share them with everyone. Here they come:
Slides with notes: Part 1, Part 2, Part 3, Part 4
Abstract
Reinforcement learning is a popular and highlydeveloped approach to artificial intelligence with a wide range of applications. By integrating ideas from dynamic programming, machine learning, and psychology, reinforcement learning methods have enabled much better solutions to largescale sequential decision problems than had previously been possible. This tutorial will cover Markov decision processes and approximate value functions as the formulation of the reinforcement learning problem, and temporaldifference learning, function approximation, and Monte Carlo methods as the principal solution methods. The focus will be on the algorithms and their properties. Applications of reinforcement learning in robotics, gameplaying, the web, and other areas will be highlighted. The main goal of the tutorial is to orient the AI researcher to the fundamentals and research topics in reinforcement learning, preparing them to evaluate possible applications and to access the literature efficiently.


Prerequisite knowledge: We will assume familiarity with basic mathematical concepts such as conditional probabilities, expected values, derivatives, vectors and matrices.
Reading material
The tutorial will follow closely the publication:
Attendees would thus benefit from reading through this paper.
Note that an extended, ca. 100 page long version of this short paper is expected to appear in print by the time of the conference in the AI Lectures Series of Morgan & Claypool Publishers. Here is the website of the book and you can buy it on Amazon here! If you want to check out the content before you buy, the manuscript is available from here.
Schedule
9:009:40 Part 0: Introduction
Contents:
 Defining and motivating the problems
 Classification of learning problems
 A quick glance at the coming attractions (what algorithms, what problems)


Main message: Sequential decision making problems are at the heart of AI and finding approximate solutions to them is a key challenge in AI
Concepts: MDPs, state, action, reward, transitions, open and closedloop control, the optimal control problem, statespace representation of the optimal control problem, stationary policies, value functions
Algorithms: Exhaustive enumeration of policies
Theory: Bellman equation for policies
Examples:
 OR like: Power management of computers (a 2 states MDP), queuing, inventory management
 AI like: Robot control, car driving, walking, focus of attention, information retrieval, adaptive user interfaces, ..
9:409:50 Break
9:5010:30 Part I: Foundations of computing optimal decisions
Main issues studied:
 Planning under uncertainty
 Dynamic programming algorithms
 Using samples to avoid intractability (i.e., MonteCarlo value estimation)
Main message: Dynamic programming trades off search time to memory but exact calculations lead to intractability. Samplebased methods can avoid this pitfall.
Concepts: Bellman equations (optimality, policy evaluation)
Algorithms: Value iteration, policy iteration, MonteCarlo value estimation
Theory: Vectors=functions, contractions, fixed points, basic error bounds, sample based planning
Examples:
 For DP algorithms power management
 For MonteCarlo: predicting power consumption, predicting the size of inventories, taxiout times on airports
10:3011:00 Official Break
11:0011:50 Part II: Learning to predict values
Main issue studied: How to learn the value function of a policy over a large state space?
Lessons:
 Features, representation matter.
 Sample size matters: Overfitting/underfitting can be a problem
 Incremental vs. nonincremental methods: use incremental when the feature space is large and observations are plentiful
Concepts: Parametric (linear) function approximation, nonlinear and nonparametric function approximation, incremental and nonincremental methods, onpolicy and offpolicy sampling, temporal differences
Algorithms: TD(lambda) for linear function approximation, GTD and variants, [R]LSTD(lambda), [R]LSPE(lambda)
Theory: Stochastic approximation, generalization error bounds
Examples: predicting power consumption, predicting the size of inventories, taxiout times on airports
11:5012:00 Break
12:0012:40 Part III: Learning to control
Main issue studied: How to learn a good policy?
Lessons:
 Controlling the tradeoff between exploration and exploitation is crucial for success
 Many algorithms in use, however, the case is far from being finished!
Concepts: explorationvs.exploitation dilemma, parametric (linear) function approximation, nonlinear and nonparametric function approximation, incremental and nonincremental methods, onpolicy and offpolicy sampling, temporal differences
Algorithms:
 Direct methods: Qlearning, Greedy GQ, Fitted Qiteration
 ActorCritic methods: Sarsa(lambda) + Natural ActorCritic
Theory: Error propagation, error bounds, asymptotic theory
Examples: Tetris, inverted pendulum, learning to walk, inventory control, TDGammon
12:4013:00 Part IV: Summary
Contents:
 Revisiting the concepts, ideas, algorithms, issues
 Guide to the available software
 Guide to the literature (what was left out)
 Surprising facts about RL