(These are excerpts from my book "Intelligence is not Artificial")
Deep Reinforcement Learning: A brief History of Artificial Intelligence/ Part 7
The game of go/weichi had been a favorite field of research since the birth of deep learning.
In 2006 Remi Coulom in France applied Ulam's old Monte Carlo method to the problem of searching a tree data structure (one of the most common problems in computational math, but particularly difficult when the data are game moves), and thus obtained the Monte Carlo Tree Search algorithm, and then applied it to go/weichi ("Bandit Based MonteCarlo Planning", 2006). At the same time Levente Kocsis and Csaba Szepesvari developed the UCT algorithm (Upper Confidence Bounds for Trees algorithm), an application of the bandit algorithm to the Monte Carlo search, the bandit algorithm being a problem in probability theory first studied by Herbert Robbins in 1952 at the Institute for Advanced Study ("Some aspects of the sequential design of experiments", 1952).
These algorithms dramatically improved the chances by machines to beat go masters: in 2009 Fuego Go (developed at the University of Alberta) beat Zhou Junxun, in 2010 MogoTW (developed by a FrenchTaiwanese team) beat Catalin Taranu, in 2012 Tencho no Igo/ Zen (developed by Yoji Ojima) beat Takemiya Masaki, in 2013 Crazy Stone (by Remi Coulom) beat Yoshio Ishida, and in 2016 AlphaGo (developed by Google's DeepMind) beat Lee Sedol. DeepMind's victory was widely advertised. DeepMind used a slightly modified Monte Carlo algorithm but, more importantly, AlphaGo taught itself by playing against itself. AlphaGo's neural network was trained with 150,000 games played by go/weichi masters.
AlphaGo had played 200 million games by January 2017 when it briefly debuted online disguised under the moniker Master; and in May 2017 it beat the world's master Ke Jie.
DeepMind had previously combined convolutional networks with reinforcement learning to train a neural network to play videogames: in 2013 Volodymyr Mnih and others had trained convolutional networks with a variant of Qlearning, the "asynchronous actorcritic" algorithm or A3C, in order to improve the policy function ("Playing Atari with Deep Reinforcement Learning", 2013).
These Deep QNetwork (DQN) was the first in a progression of
deep reinforcement learning methods developed by DeepMind, leading to AlphaGo and then AlphaZero.
In 2016 the DeepMind team also adapted "deep Qlearning" to the more general domain of continuous action ("Continuous Control with Deep Reinforcement Learning", 2016).
Ironically, few people noticed that in September 2015 Matthew Lai unveiled an opensource chess engine called Giraffe that used deep reinforcement learning to teach itself how to play chess (at international master level) in 72 hours. It was designed by just one person and it ran on a the humble computer of his department at Imperial College London. (Lai was hired by Google DeepMind in January 2016, two months before AlphaGo's exploit against the Go master).
In 2016 Toyota demonstrated a selfteaching car, another application of deep reinforcement learning like AlphaGo: a number of cars are left to randomly roam the territory with the only rule that they have to avoid accidents. After a
while, the cars learn how to drive properly in the streets.
Poker was another game targeted by A.I. scientists, so much so that the University of Alberta even set up a Computer Poker Research Group. Here in 2007 Michael Bowling developed an algorithm called Counterfactual Regret Minimization or CFR ("Regret Minimization in Games with Incomplete Information", 2007), based on the "regret matching" algorithm invented in 2000 by Sergiu Hart and Andreu MasColell at the Einstein Institute of Mathematics in Israel ("A Simple Adaptive Procedure Leading to Correlated Equilibrium", 2000). These are techniques of selfplaying: given some rules describing a game, the algorithm plays against itself and develops its own strategy for playing the game better and better. It's yet another form of reinforcement learning, except that in this case reinforcement learning is used to devise the strategy from scratch, not to learn the strategy used by humans. The goal of CFR and its numerous variants is to approximate solutions for imperfect information games such as poker. CFR variants became the algorithms of choice for "poker bots" used in computer poker competitions. In 2015 Bowling's team developed Cepheus and Tuomas Sandholm at Carnegie Mellon University developed Claudico, that played the professionals at a Pittsburgh casino (Pennsylvania). Claudico lost but in 2017 Libratus, created by the same group, won. Libratus employed a new algorithm, called CFR+, introduced in 2014 by Finnish hacker Oskari Tammelin ("Solving Large Imperfect Information Games Using CFR+, 2014) that learns much faster compared with previous versions of CFR. However, the setting was absolutely unnatural, in particular to rule out card luck. It is safe to state that no human players had ever played poker in such a setting before. But it was telling that the machine started winning when the number of players was reduced and the duration of the tournament was extended: more players in a shorter time beat Claudico, but fewer players over a longer time lost to Libratus.
Deep reinforcement learning has three main problems: it requires a lot of data, which is unnatural (animals can learn something even if they saw it only once or twice), it fails all too easily in situations that differ just slightly from the situations for which the system has been trained, and its internal workings are largely inscrutable, which makes us uncomfortable to use them on missioncritical projects.
A different strand from DeepMind's A3C harkens back to the "relational Markov decision process", introduced by Carlos Guestrin ("Planning under Uncertainty in Complex Structured Environments", 2003); and to the ObjectOriented Markov Decision Process, introduced by Micheal Littman's student Carlos Diuk at Rutgers University ("An ObjectOriented Representation for Efficient Reinforcement Learning", 2008). The relational Markov decision process blends probabilistic and
relational representations with the express goal of planning action in the real world.
Marta Garnelo's "deep symbolic reinforcement" at Imperial College London ("Towards Deep Symbolic Reinforcement Learning",2016) and Dileep George's "schema networks" to transfer experience from one situation to other similar situations (Schema Networks", 2017) built upon these ideas, basically wedding firstorder logic representation with deep reinforcement learning.
Yet another route led to Trust Region Policy Optimization (TRPO), introduced in 2015 by Pieter Abbeel's student John Schulman at UC Berkeley. DQN, A3C and TRPO were just the tip of the iceberg: algorithms for optimization of reinforcement learning multipled and evolved rapidly after the success of these techniques in learning to play games.
Back to the Table of Contents
Purchase "Intelligence is not Artificial")
