Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Arthur Guez and David Silver and Peter Dayan

Supplementary material: Videos Description

General Description
---------------------

Animation of the BAMCP algorithm running on the Infinite 2D grid task of Section 5.2.

In the main window, on the left half of video, a portion of the MDP is represented,
centered around the agent.

Green squares indicates a reward is available, grey indicates a reward has been consumed, and
blue indicates the absence of reward.

As in Figure S4, the parameters for the row and columns are shown
as orange and blue circles; they represent here the posterior mean parameters given past 
observed transitions.

The observations are represented in the bottom right corner. 
A subset of lazy samples are shown at each step in the top right corner.


Specific Description
---------------------

* video1.avi and video2.avi

Agent's prior matches the generative model: Beta(0.5,0.5) for both rows and columns parameters.

* video3.avi 

Agent's prior matches the generative model: Beta(1,3) for rows, Beta(0.5,0.5) for columns.

* video4.avi

Prior mismatch:
Agent's prior is Beta(0.5,0.5) for rows, Beta(1,4) for columns.
Generative model is Beta(0.5,0.5) for both rows and columns.

