Time and Place: Room T4 at Aalto CS building + zoom at 10:15am on thursday 28th
Abstract: The standard multi-armed bandit problem is simply the single-agent version of normal form games. Markov decision processes are the single-agent version of stochastic games. In this talk I will present reinforcement learning algorithms for all four of these settings, bringing out similarities, but also the difficulties of transferring from single-agent to multi-agent. All the methods build on foundations consisting of Q learning and fictitious play. The key trick to proving convergence is a technique known as “two timescales”, introduced by Borkar, which is often observed informally in machine learning experiments but is somewhat hidden from view.
Bio: David Leslie is a Professor of Statistical Learning and Director of Engagement, in the Department of Mathematics and Statistics at Lancaster University. He researches statistical learning, decision-making, and game theory. His research on bandit algorithms is used by many of the world's largest companies to balance exploration and exploitation in real time website optimisation. Prior to his position at Lancaster, he was a senior lecturer in the statistics group of the School of Mathematics, University of Bristol, where he was co-director of the EPSRC-funded cross-disciplinary decision-making research group at the University of Bristol.