In the last installment in this series on selflearning AI agents, I introduced deep QLearning as an algorithm that can be used to teach AI to behave and solve tasks in discrete action spaces. However, this approach is not without its shortcomings that can potentially result in lower performance of the AI agent.
In what follows, I’ll introduce one common deep Qlearning problem and show you how the vanilla implementation can be extended to what we call double deep Qlearning, which generally leads to better AI agent performance.
Why Use Double Deep QLearning?
How Can I Practice Double Deep QLearning?
This example of OpenAI’s Gym CartPole problem was solved with the double Qlearning algorithm, presented here — as well as some techniques from last time. The welldocumented source code can be found in my GitHub repository.I have chosen CartPole as an example because the training time for this problem is low and you can reproduce it yourself very quickly. Clone the repository and execute run_training.py
to start the algorithm.
An ActionValue Function Refresher
In the first and second part of this series, I introduced the actionvalue function Q(s,a)
as the expected return G_t
the AI agent would get by starting in state s
, taking action a
and then following a certain policy π
.
The right part of the equation is also called the temporal difference target (TDTarget). The TDTarget is the sum of the immediate reward r
the agent received for the action a
in state s
and the discounted value Q(s’,a’)
(a’
being the action the agent will take in the next state s’
).
Q(s,a)
tells the agent the value (or quality) of a possible action a in a particular state s
.
Given a state s
, the actionvalue function calculates the quality/value for each possible action a_i
in this state as a scalar value. Higher quality means a better action with regards to the given objective. For an AI agent, a possible objective could be learning how to walk or how to play chess against humans.
Following a greedy policy w.r.t Q(s,a)
— i.e. taking the actions a’
that result in the highest values of Q(s,a’)
— leads to the Bellman optimality equation, which gives a recursive definition for Q(s,a)
. We can also use the Bellman equation to recursively calculate all values Q(s,a)
for any given action or state.
In part two of this series, I introduced temporal difference learning as a better approach to estimate the values Q(s,a)
. The objective in temporal difference learning was to minimize the distance between the TDTarget and Q(s,a)
, which suggests a convergence of Q(s,a)
towards its true values in the given environment. This is Qlearning.
Deep QNetworks
We’ve seen that a neural network approach turns out to be a better way to estimate Q(s,a)
. Nevertheless, the main objective stays the same: the minimization of the distance between Q(s, a)
and TDTarget (or temporal distance of Q(s,a)
). We can express this objective as the minimization of the error loss function:
In deep Qlearning, we estimate TDtarget y_i
and Q(s,a)
separately by two different neural networks, often called the target and Qnetworks (figure 4). The parameters θ(i1)
(weights, biases) belong to the targetnetwork, while θ(i)
belong to the Qnetwork.
The actions of the AI agents are selected according to the behavior policy µ(as)
. On the other side, the greedy target policy π(as)
selects only actions a’
that maximize Q(s, a)
(used to calculate the TDtarget).
We can accomplish minimization of the error loss function by usual gradient descent algorithms we use in deep learning.
The Problem With Deep QLearning . . .
Deep Qlearning is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values. We can see this in the TDtarget y_i
calculation.
It’s still more or less an open question whether overestimations negatively affect performances of AI agents in practice. Overoptimistic value estimates are not necessarily a problem in and of themselves. If all values would be uniformly higher, then the relative action preferences are preserved and we would not expect the resulting policy to be any worse.
If, however, the overestimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy.
Double Deep QLearning
The idea of double Qlearning is to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation.
In the vanilla implementation, the action selection and action evaluation are coupled. We use the targetnetwork to select the action and estimate the quality of the action at the same time. What does this mean?
The targetnetwork calculates Q(s, a_i)
for each possible action a_i
in state s
. The greedy policy decides upon the highest values Q(s, a_i)
which selects action a_i
. This means the targetnetwork selects the action a_i
and simultaneously evaluates its quality by calculating Q(s, a_i)
. Double Qlearning tries to decouple these procedures from one another.
In double Qlearning the TDtarget looks like this:
As you can see, the max operation in the target is gone. While the targetnetwork with parameters θ(i1)
evaluates the quality of the action, the Qnetwork determines the action that has parameters θ(i)
. This procedure is in contrast to the vanilla implementation of deep Qlearning where the targetnetwork was responsible for action selection and evaluation.
We can summarize the calculation of new TDtarget y_i
in the following steps:

Qnetwork uses next state
s’
to calculate qualitiesQ(s’,a)
for each possible actiona
in states’

Argmax operation applied on
Q(s’,a)
chooses the actiona*
that belongs to the highest quality (action selection). 
The quality
Q(s’,a*)
(determined by the targetnetwork) that belongs to the actiona*
(determined by the Qnetwork) is selected for the calculation of the target (action evaluation).
We can visualize the process of double Qlearning like this:
An AI agent is at the start in state s
. He knows, based on some previous calculations, the qualities Q(s, a_1)
and Q(s, a_2)
for two possible actions in that state. The agent decides to take action a_1
and ends up in state s’
.
The Qnetwork calculates the qualities Q(s’, a_1’)
and Q(s, a_2’)
for possible actions in this new state. Action a_1’
is picked because it results in the highest quality according to the Qnetwork.
We can now calculate the new actionvalue Q(s, a1)
for action a_1
in state s
with the equation in the figure 2, where Q(s’,a_1’)
is the evaluation of a_1’
is determined by the targetnetwork.
Empirical Results
In this 2015 study, researchers tested deep Qnetworks (DQNs) and double deep Qnetworks (double DQNs) on several Atari 2600 games. You can see the normalized scores achieved by AI agents of those two methods, as well as the comparable human performance, in figure 3.
You can clearly see that these two versions of double DQNs achieve better performances in this area than the vanilla implementation.
With deep QLearning and deep (double) Qlearning we learned to control an AI in discrete action spaces, where the possible actions may be as simple as going left or right, up or down.
However, many realworld applications of reinforcement learning, such as the training of robots or selfdriving cars, require an agent to select optimal actions from continuous spaces where there are (theoretically) an infinite amount of possible actions. This is where stochastic policy gradients come into play, the topic of the next article of this series.