In this second part on Deep Reinforcement Learning we are going to explore some of the combined methods (look into Part I if you don't know what I mean with combined methods). We will start with the so called Advantage Actor-Critic algorithm which combines some of the concepts we have learned about in the first part. Then we dive into the very popular Proximal Policy Optimization algorithm. Moreover I will give you an overview of the Alpha(Go)Zero algorithm from DeepMind. In the third part of this series we are going to talk more about Imitation Learning and Model-based RL.


Advantage Actor-Critic (A2C)

As mentioned before, Advantage Actor-Critic (A2C) algorithms combine the ideas from policy gradient methods (e.g. REINFORCE) and a learned value function (e.g. DQN). Here we reinforce a policy with a learned reinforcing signal generated by a learned value function. A2C algorithms therefore consist of two jointly learned components:

  • an actor which learns a parameterized policy and
  • a critic which learns a value function to evaluate state-action pairs (it provides a reinforcing signal to the actor)

The motivation for this is that a learned reinforcing signal can be much more informative for a policy than the rewards available from an environment. Instead of learning $Q^{\pi}(s,a)$ or $V^{\pi}$, it is common to learn the so called advantage function $A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)$ as the reinforcing signal. The key idea behind this is that it is better to select an action based on how it performs relative to the other actions available in a particular state, instead of using the absolute value (hence the name advantage actor-critic). The actors learn a parameterized policy $\pi_{\theta}$ using the policy gradient. This is similar to the REINFORCE algorithm, but instead of the Monte-Carlo estimate $R_{t}(\tau)$, the advantage is used:

$$ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{t} \Big[ A_{t}^{\pi} \nabla_{\theta} \log \pi_{\theta} (a_{t} | s_{t})\Big] $$

The critic is responsible for learning how to evaluate state-action-pairs and using this to generate $A^{\pi}$. To estimate the advante function, we will go over two possible methods: n-step returns and Generalized Advantage Estimation (GAE). In general the advantage function measures the extent to which an action is better or worse than the policy's average action in a particular state:

$$ A^{\pi}(s_{t},a_{t}) = Q^{\pi}(s_{t},a_{t}) - V^{\pi}(s_{t}) $$

One benefit of using the advantage instead of $Q^{\pi}$ or $V^{\pi}$ is that it avoids penalizing an action for the policy currently being in a particularly bad state, like in the following example:

$$ Q^{\pi}(s,a) = 110, \quad V^{\pi}(s) = 100, \quad A^{\pi}(s,a) = 10 $$

$$ Q^{\pi}(s,a) = -90, \quad V^{\pi}(s) = -100, \quad A^{\pi}(s,a) = 10 $$

The advantage function is better able to capture the long-term effects of an action because it considers all future time steps while ignoring the effects of all the actions to date.


n-step Returns

As seen before, to calculate $A^{\pi}$ we need estimates for both $Q^{\pi}(s,a)$ and $V^{\pi}(s)$. In the n-step Returns method we achieve this by learning $V^{\pi}(s)$ and estimating $Q^{\pi}(s,a)$ from it:

$$ Q^{\pi}(s,a) = \mathbb{E}_{\tau \sim \pi} \Big[ r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} … + \gamma^{n} r_{t+n} \Big] + \gamma^{n+1} \hat{V}^{\pi}(s_{t+n+1}) $$

The expectation part of the Q-value estimate show above is calculated based on a 3-step return, which means that we use our collected trajectory data to look three steps in the future and sum up the rewards multiplied by a discounting factor $\gamma^{t}$. This part of the equation is unbiased but has a high variance because it comes from only one trajectory. $n$ is a hyperparameter that needs to be tuned. The bigger the value of $n$, the higher the variance of the estimate. The return after the n-th step is calculated by the critic network. It has lower variance since it reflects an expectation over all of the trajectories seen so far, but it is biased because it is calculated using a function approximator. From this we now get a formula for estimating the advantage:

$$ A_{NSTEP}^{\pi}(s_{t}, a_{t}) = Q^{\pi}(s_{t}, a_{t}) - V^{\pi}(s_{t}) $$

$$ A_{NSTEP}^{\pi}(s_{t}, a_{t}) \approx r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} + … + \gamma^{n} r_{t+n} + \gamma^{n+1} \hat{V}^{\pi}(s_{t+n+1}) - \hat{V}^{\pi}(s_{t}) $$


Generalized Advantage Estimation (GAE)

Generalized Advantage Estimation was proposed as an improvement over the n-step returns estimate for the advantage function. It addresses the problem of having to explicitly choose the number of steps of returns $n$. The main idea is, instead of picking one value of $n$, we mix multiple values by calculating the advantage using a weighted average of individual advantages calculated with $n = 1, 2, 3, ..., k$. This significantly reduces the variance of the estimator while keeping the bias introduced as low as possible.

$$ A_{GAE}^{\pi}(s_{t}, a_{t}) = \sum^{\infty}_{l=0}(\gamma \lambda) \delta_{t+l} $$

$$ \text{where } \delta_{t} = r_{t} + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_{t}) $$

GAE is taking a weighted average over a number of advantage estimators with different bias and variance. It weights the high-bias, low-variance 1-step advantage the most but also includes contributions from lower-bias, higher-variance estimators using $2,3,...,n$ steps. The contributions decay at an exponential rate as the number of steps increases and the decay rate gets controlled by the coefficient $\lambda$ (the larger $\lambda$, the higher the variance). In contrast to $n$, $\lambda$ represents a softer choice than $n$, which means smaller values of $\lambda$ will more heavily weight the V-function estimate, whilst larger values will weight the actual rewards more.


Algorithm & Network Architecture

But how do we learn the value function $V^{\pi}$? We have different possibilities here, one of them is Temporal Difference Learning, similar to DQN. First we parameterize $V^{\pi}$ with $\theta$, then we generate $V_{tar}^{\pi}$ for each of the experiences an agent gathers. Finally we minimize the distance between $\hat{V}^{\pi}(s; \theta)$ and $V^{\pi}_{tar}$ using a simple regression loss such as MSE. You can use different methods to generate $V_{tar}^{\pi}$:

  • n-step estimate:

$$ V_{tar}^{\pi}(s) = r + \hat{V}^{\pi}(s’; \theta) $$

$$ V_{tar}^{\pi}(s_{t}) = r_{t} + \gamma r_{t+1} + \gamma^{2} r_{t+2} + … + \gamma^{n} r_{t+n} + \gamma^{n+1} \hat{V}^{\pi}(s_{t+n+1}) $$

  • Monte-Carlo estimate:

$$ V_{tar}^{\pi}(s_{t}) = \sum_{t’=t}^{T}\gamma^{t’-t} r_{t’} $$

  • GAE:

$$ V_{tar}^{\pi}(s_{t}) = A_{GAE}^{\pi}(s_{t}, a_{t}) + \hat{V}^{\pi}(s_{t}) $$

The choice is often related to the method used to estimate the advantage. It is also possible to use a different, more complicated optimization procedure (trust region method) when learning the value function $\hat{V}^{\pi}$ which you can find in the original GAE paper.

You can run the actor-critic algorithm online as well as batched, down below you see the simplified steps of an online version:

  1. take action $\mathbf{a} \sim \pi_{\theta}(\mathbf{a}|\mathbf{s})$, get $(\mathbf{s}, \mathbf{a}, \mathbf{s’}, r)$
  2. update $\hat{V}_{\phi}^{\pi}$ using target $r + \gamma \hat{V}_{\phi}^{\pi}(\mathbf{s’})$
  3. evaluate $\hat{A}^{\pi}(\mathbf{s}, \mathbf{a}) = r(\mathbf{s}, \mathbf{a}) + \gamma \hat{V}_{\phi}^{\pi}(\mathbf{s’}) - \hat{V}_{\phi}^{\pi}(\mathbf{s})$
  4. $\nabla_{\theta} J(\theta) \approx \nabla_{\theta} \log \pi_{\theta}(\mathbf{a}|\mathbf{s})\hat{A}^{\pi}(\mathbf{s}, \mathbf{a})$
  5. $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$
  6. repeat

To learn the parameterized functions for the actor and the critic, you can use two separate neural networks. But it is also possible and conceptually appealing to use a network structure with shared parameters, because learning $\pi$ and $V^{\pi}(s)$ for the same task are related and they share the same input. Moreover your total number of learnable parameters is reduced which makes it overall more sample efficient. The sharing of the lower layers could be interpreted as learning a common representation of the state space. Upper layers are separate because the networks have different tasks, so their structure and size can be different. But sharing parameters has also some downsides:

  • learning becomes more unstable because the two gradients can have different scales
  • we need to balance the two which is typically accomplished by adding a scalar weight to one of the losses for scaling
  • this adds another hyperparameter to tune

Proximal Policy Optimization (PPO)

Two of the main problems with policy gradient algorithms like REINFORCE are:

  • susceptibility to performance collapse (agent suddenly performs bad)
  • sample-inefficiency (on policy - no data reuse)

The PPO paper addresses both of these issues by introducing a surrugate objective which avoids performance collapse by guarantueeing monotonic policy improvement and enables off-policy data reuse. This leads often to more stable and sample efficient training. PPO methods have some benefits of trust region policy optimization (TRPO) but are much simpler to implement, more general and have better sample efficiency. PPO can be used to extend f.e. REINFORCE of Actor-Critic methods by replacing their original objective $J(\pi_{\theta})$ with the surrogate objective. It is currently one of the most popular policy gradient algorithms. The paper presents two different variants of the objective function which we are diving into both.

Adaptive KL Penalty

This PPO version incorporates the Kullback-Leiber Divergence (KL) as a penalty into the surrogate objective. The KL is a measure for information loss between two distributions, you can find a nice explanation here. The objective function looks like the following:

$$ J^{KPLEN}(\theta) = \max_{\substack{\theta}} \mathbb{E}_{t}\Big[r_{t}(\theta)A_{t}-\beta KL \big(\pi_{\theta}(a_{t}|s_{t}) \parallel \pi_{\theta_{old}}(a_{t}|s_{t})\big)\Big] $$

$$ \text{with } r(\theta) = \frac{\pi(a_{t}|s_{t})}{\pi_{old}(a_{t}|s_{t})} $$

$\beta$ is an adaptive coefficient which controls the size of the KL penalty (the larger $\beta$, the larger the difference between $\pi$ and $\pi_{old}$). Because $\beta$ is a new hyperparameter that has to work for different problems in different scenarios, it is not a constant value. Instead $\beta$ gets recalculated after every policy update based on a heuristic which allows it to adapt over time:

  • set target value $\delta_{tar}$ for the expectation of KL, init $\beta$ randomly
  • use multiple epochs of minibatch SGD, optimize the KL-penalized surrogate objective
  • compute $\delta = \mathbb{E}_{t}\Big[ KL \big(\pi_{\theta}(a_{t}|s_{t}) \parallel \pi_{\theta_{old}}(a_{t}|s_{t})\big) \Big]$
    • if $\delta < \delta_{tar}/1.5$ then $\beta \leftarrow \beta/2$
    • else if $\delta > \delta_{tar} \times 1.5$ then $\beta \leftarrow \beta \times 2$
    • else pass

This approach is simple to implement but you still have to choose a target value for $\delta_{tar}$. In general it is computationally expensiv e to calculate the KL, which is one reason why the second form of the PPO is often preferred.

Clipped Surrogate Objective

The clipped surrogate objective is much simpler and omits the calculation of the KL entirely:

$$ J^{CLIP}(\theta) = \mathbb{E}_{t} \Big[ \min \big( r_{t}(\theta) A_{t}, clip(r_{t}(\theta), 1 - \epsilon, 1 + \epsilon)A_{t} \big) \Big] $$

$\epsilon$ is a hyperparameter which defines the clipping neighbourhood and can be decayed during training. This objective function prevents parameter updates which could cause large and risky changes to the policy $\pi_{\theta}$. As mentioned before, you can use both of the described objectives with e.g. Actor-Critic methods or REINFORCE. For details take a look at the paper itself.

PPO algorithm Actor-Critic style

Figure 1. Algorithm - PPO Actor-Critic [2]

Down below you can find a few results from the paper on several MuJoCo environments after one million steps of training. At that times it outperformed the previous methods on almost all the continuous control environments.

MuJoCo environment results

Figure 2. MuJoCo environment results [2]


Alpha(Go)Zero

AlphaGo Zero is the successor of DeepMinds AlphaGo which was the first program to defeat a world champion in the game of Go. It differs from AlphaGo in several important aspects. It was trained solely by self-play without any use of expert data or human supervision, but knowing the rules of the game (model-based). It uses only a single neural network and defeated AlphaGo by 100 games to 0. With AlphaZero, DeepMind generalized and extended the method to the games Chess and Shogi.

AlphaGo Zero combines a neural network and Monte Carlo Tree Search (MCTS) in an elegant policy iteration framework to achieve stable learning and rapid improvement. Lets start by taking a closer look at the neural network.

Neural Network

The authors used a deep Convolutional Neural Network $f_{\theta}$ which is parameterized by $\theta$ and takes as input the state $\mathbf{s}$ of the board and outputs both, move probabilities and a value $(\mathbf{p}, v) = f_{\theta}(s)$. It combines the roles of the policy network and value network into a single architecture, similar to the shared network which can be used in Actor-Critic methods. The vector of the move probabilities $\mathbf{p}$ represents the probability of selecting each move $a$, $\mathbf{p}_{a} = Pr(a|s)$. The value $v$ is a scalar evaluation, estimating the probability of the current player winning from position $\mathbf{s}$. The network is trained at the end of each game of self-play based on the data of each time step $t$ which is stored in the form $(s_{t}, \pi_{t}, z_{t})$. $\pi_{t}$ represents the search probabilities generated by MCTS and $z_{t} \in {-1, 1}$ is the game winner from the perspective of the current player (gets added at the end of the game to each timestep). The loss function $l$ sumsof the mean-squared error and cross-entropy loss:

$$ l = (z - v)^{2} - \pi^{T} \log \mathbf{p} + c || \theta ||^{2} $$

where $c$ is a hyperparameter controlling the level of L2 weight regularization. Figure 3. down below illustrates this self-play RL procedure in AlphaGo Zero.

AlphaGo Zero self-play and neural network training

Figure 3. AlphaGo Zero - Self-play and neural network training [7]

The neural network we just talked about is trained from games of self-play by a novel reinforcement learning algorithm. In each position $s$, an MCTS is executed, guided by the neural network $f_{\theta}$ and it outputs search probabilities $\pi$ of playing each move. Based on these you can usually select a much stronger move than based on the raw move probabilities $\mathbf{p}$ output by the neural network $f_{\theta}(s)$ which makes MCTS to kind of a policy improvement operator. It uses the neural network $f_{\theta}$ for its simulations. Each edge $(s,a)$ in the search tree stores a prior probability $P(s,a)$, a visit count $N(s,a)$ and an action value $Q(s,a)$. Each simulation run starts from the root state and iteratively selects moves that maximize an upper confidence bound

$$ U(s,a) = Q(s,a) + c * P(s,a) * \frac{\sum_{b}N(s,b)}{1 + N(s,a)} $$

where c is a hyperparameter which controls the degree of exploration. We initialize our empty search tree with $\mathbf{s}$ as the root. A single simulation run proceeds as follows. First we compute the action $a$ that maximizes the upper confidence bound $U(s,a)$. We play this action and if the resulting state $\mathbf{s'}$ exists in our tree, we recursively search on $\mathbf{s'}$. If it doesn't exist, it gets added to the tree and $Q(s',a)$ and $N(s',a)$ are initialized to 0 for all $a$. The neural network is used to initialize $P(s', \cdot) = \mathbf{p}$ and $v(s') = v_{\theta}(s')$. Instead of performing a rollout, $v(s')$ gets propagated up along the path seen in the current simulation and update all $Q(s,a)$ values. If on the other hand a terminal state gets encountered, the reward gets propagated.

AlphaGo Zero Monte Carlo Tree Search Process

Figure 4. AlphaGo Zero - MCTS Search Process [7]

DeepMind trained the AlphaGo Zero system for approximately three days, generating 4.9 million games of self-play, using 1,600 simulations for each MCTS. The neural net they used consisted of 20 residual blocks (they also trained a bigger network for longer).


Summary

In this post we looked at three combined RL methods, A2C, PPO and AlphaGoZero. As Alpha(Go)Zero is a more specialized method for games, PPO and A2C are more generally applicable. I hope I was able to give you an understandable insight into the three algorithms and their most important components. In the next post of this series I am going to dive a bit deeper into model-based RL and imitation learning. Until then, stay tuned and healthy!

References

[1] Mnih et al. “Asynchronous Methods for Deep Reinforcement Learning” (2016).

[2] Schulman et al. “Proximal Policy Optimization Algorithms” (2017).

[3] Kullback-Leiber Divergence Explained (2017).

[4] Schulman et al. “Trust Region Policy Optimization” (2015).

[5] Silver et al. “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm” (2017).

[6] Schulman et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (2015).

[7] Silver et al. “Mastering the game of Go without human knowledge” (2017).