Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)

Bibtex Metadata Paper


Brian Sallans, Geoffrey E. Hinton


The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Ac(cid:173) tions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.