Part 1 : Policy Based Reinforcement Learning — A Detailed Study

10 min readFeb 26, 2022

In this article lets try to get a detailed understanding of what is On Policy and Off Policy Reinforcement Learning? What are the types of Policy functions? How to implement the policy function as a neural network?

A good teacher always explains the most complicated topics in a very simple and effective way. This article is inspired by the book from Brandon Brown, Alexander Zai.

We will start this article by comparing Off Policy and On Policy algorithms. Then we will see different types of policy functions. In the last section, we will see how we can implement the policy functions as a neural network.

What are Off-Policy and On-Policy algorithms in Reinforcement learning?

Well, we will answer this question with examples.

We will use deep Q-network to talk on Off-policy algorithms. If you are not familiar with deep Q-network , don’t worry we will give a basic introduction to it. If you want to enhance your learning on DQN along with implementation ,please go through this article.

Deep Q-network (DQN) is an off-policy algorithm that approximates the Q function with a neural network. The output of the Q-network is Q values for each action on a given state. The Q value is the expected (weighted average) of rewards at a particular state S taking an action a

Fig 1 : Illustration of a Deep Q-network

A Q-network takes a state and returns Q values (action values) for each action. We can use those action values to decide which actions to take. Various strategy can be employed to select the action using the predicted Q values. Epsilon-greedy approach is one famous strategy where we selected an action at random with probability ε, and with probability 1 — ε we selected the action associated with the highest Q value (the action the Q-network predicts is the best, given its experience so far)

What if instead of returning the Q values (action values) our neural network directly predicts the action? If we do that then our neural network is called a policy function or a policy network. Policy function accepts the state and returns the best action. More precisely, returns the probability distribution over the actions, which can be used to pick the best action.

Unlike Off-Policy (DQN), in On Policy network there is no need to design an algorithm (eg: epsilon-greedy) to select the best action. Action corresponding to the highest probability can be chosen directly at each state.

Lets say the probability distribution P(A|S) is the jar filled with notes with an action written on them. In a game with four actions labelled 1–4 , choosing any action will fill the jar with respective notes. If policy network predicts label 2 can return the highest reward, the jar will be filled with lots of notes labelled 2 and few labelled 1,3,4. Now to choose the best action, all we do is close the eyes and pick a random note. We are most likely to pick note 2. But sometimes we may also pick 1,2 or 3. This will lead to exploration. Using this analogy, every time the state of the environment changes, we give the state to our policy network, and it uses that to fill the jar with a new set of labeled notes representing the actions in different proportions. Then we randomly pick from the jar.

This kind of algorithms are called as Policy gradient methods. There are some important differences between Policy gradients and DQN methods. One already mentioned is the method to choose the best action. DQN needs a additional algorithm to select the best action. Whereas, Policy gradients outputs the highest probability for the best action. Other important difference is that in DQN various methods like experience replay and target networks are employed to stabilize the learning process (ready more about them here). A policy network do not have these complexities.

Types of Policy Gradient Algorithms

Stochastic policy gradient

The policy method we described with the jar example above is the stochastic policy gradient. The output of our neural network is an action vector that represents a probability distribution.

A policy function accepts a state and returns a probability distribution over actions. It is stochastic because it returns a probability distribution over actions rather than returning a deterministic, single action. This means that if our agent ends up in the same state twice, it may not take the same action as we have small probabilities for other actions. This allows for exploration on new policies. In the above figure , we feed the agent with the state (1,2) and output is the vector of probabilities over four possible actions. We can see that the first action has highest probability of 0.5, followed by third and forth. But the second action is not at all prescribed.

2. Deterministic Policy gradient

If the environment is stationary, say state and rewards are constant then we use a deterministic strategy, we’d expect the probability distribution to eventually converge to a degenerate probability distribution

A deterministic policy function takes a state and returns a specific action to take, unlike a stochastic policy, which returns a probability distribution over actions.

How do policy gradient methods explore?

We use algorithms like epsilon-greedy etc. with DQN (off-policy) to make then explore the action space. How do we ensure that on-policy methods do sufficient exploration of action space?

For the stochastic policy gradient method, because our output is a probability distribution, there should be a small chance that we explore all spaces; only after sufficient exploration will the action distribution converge to producing the single best action, a degenerate distribution. Or if the environment itself has some randomness, the probability distribution will retain some probability mass to each action. When we initialize our model in the beginning, the probability of our agent picking each action should be approximately equal or uniform, since the model has zero information about which action is better.

Even with deterministic policy gradient we start with equal or random probability distribution. As the training progresses the agent learns the best action and outputs only the best action making probabilities of other actions zero.

Implementing Policy functions with a neural networks

Neural networks need objective functions that is differentiable from the network weights. Deep Q — network is trained to minimize the mean square error (MSE) between predicted Q values and the target Q values. We had a nice formula for calculating the target Q value based on the observed reward, since Q values are just averaged rewards (i.e., expectations), so this was not much different from how we would normally train a supervised deep learning algorithm.

Unlike DQN , policy network predicts actions and there is no way to come up with a target vector of actions we should have taken instead, given the rewards. All we know is whether the action led to positive or negative rewards. In fact, what the best action is secretly depends on a value function, but with a policy network we’re trying to avoid computing these action values directly.

Then how do we train and optimize our policy network?

Lets understand this with a example of Gridworld .

The Gridworld game depicted in figure 5 shows the simple version of Gridworld. The agent (A) must navigate along the shortest path to the goal tile (+) and avoid falling into the pit (–). Our goal is to train a neural network to play Gridworld from scratch. The agent will have access to what the board looks like. There are four possible actions namely up, down, left and right. Based on the action taken by the agent reward will be received. Reward of -1 is awarded for each move. Any action which leads to a fall in pit or into the wall will receive a reward of -10. Positive reward of +10 is awarded for successfully reaching the goal.

Having understood the Gridworld game, let’s understand some notations with respect to policy network we are training. Our policy network is denoted π and is parameterized by a vector θ, which represents all of the parameters (weights) of the neural network. Whenever we run the policy network forward, the parameter vector θ is fixed; the variable is the data that gets fed into the policy network (i.e., the state). Hence, we denote the parameterized policy as πsub(θ) (sub: subscript)

Fig 6 : Initial untrained policy network

Let’s say we give our initially untrained policy network πθ some initial game state for Gridworld, denoted s, and run it forward by computing πθ(s). It returns a probability distribution over the four possible actions, such as [0.25, 0.25, 0.25, 0.25]. We sample from this distribution, and since it’s a uniform distribution, we end up taking a random action. We continue to take actions by sampling from the produced action distribution until we reach the end of the episode.

An episode is a sequence of states, actions, and rewards from an initial state to the terminal state where we win or lose the game. We denote this episode as

Lets say that at the third step we reach our goal and end the episode with reward of +10. Then out episode can be represented as

Form figure 8 we can see that we took action 3 , 1 and 3 at state S0,S1 and S2 respectively. Our episode ended at S2 with a reward of +10. As we know that action 3 has given us a good reward we want to reinforce it and make a smooth updates to our gradients to encourage the network to assign more probability to these winning actions in the future.

A naïve approach would be to update the probability density of action 3 (element 4) in the probability distribution to 1 [0,0,0,1]. This is what we do in a supervised classification model. But in RL we want the update to be smooth and have a small chance to explore better actions. We want something as shown in below image

The reward signal is used to reinforce the action that was taken, that is, it increases the probability of that action given the state if the reward is positive, or it decreases the probability if the reward is negative. Notice that we only received information about action 3 (element 4), but since the probabilities must sum to 1, we have to lower the probabilities of the other actions.

In order to reinforce action 3, we want to modify our policy network parameters θ such that we increase πs(a3|θ). Our objective function merely needs to maximize πs(a3|θ) where a3 is action 3 in our example. Before training, πs(a3|θ) = 0.25, but we want to modify θ such that πs(a3|θ) > 0.25. Because all of our probabilities must sum to 1, maximizing πs(a3|θ) will minimize the other action probabilities. Since deep learning optimizers are good in minimizing the objective function we can concentrate on minimizing other probabilities and hence increasing the probabilities of the desired action. We want to minimize 1 — πs(a|θ). This loss function approaches 0 as πs(a|θ) nears 1, so we are encouraging the gradients to maximize πs(a|θ) for the action we took.

Let’s make this formula more mathematically stable. As probabilities are between 0 to 1 sometimes these values can become extremely tiny and may not help optimizers. So we can use log to over come this problem. So our objective would be to minimize the function

Fig 10 : Log objective function

Now we know our objective function. But should we assign equal weightage to all the actions in the episode? In a game like Gridworld the last action plays a very important role. In episodic games usually the first action is suboptimal . The action importance increases as the time step increases. Keeping this in mind our final Objective function can be

Fig 11 : Final objective function

γt is the discount factor, and the subscript t tells us its value will depend on the time step t, since we want to discount more distant actions more than more recent ones. The parameter Gt is called the total return, or future return, at time step t. It is the return we expect to collect from time step t until the end of the episode, and it can be approximated by adding the rewards from some state in the episode until the end of the episode.

Fig 12 : Total return

With this objective in place, the sequence of discounted reward from start to terminal state of a winning Gridworld game can be [0.970, 0.980, 0.99, 1.0]. The last action is given more prominence over initial once.

A string diagram explaining the entire training process of a policy network is given below

The policy network is a neural network parameterized by θ (the weights) that accepts a 64-dimensional vector for an input state. It produces a discrete 4-dimensional probability distribution over the actions. The sample action box samples an action from the distribution and produces an integer as the action, which is given to the environment (to produce a new state and reward) and to the loss function so we can reinforce that action. The reward signal is also fed into the loss function, which we attempt to minimize with respect to the policy network parameters.

Until now we have learnt

Difference between Off-policy and On-policy algorithms
Types of policy functions
Implementation of policy function using neural network

In the next part of this article we will implement a Policy based reinforcement learning algorithm called REINFORCE to solve OpenAI’s Cartpole problem.

Check out the part 2 of this article here:

https://nandakishorej8.medium.com/part-2-policy-based-reinforcement-learning-openais-cartpole-with-reinforce-algorithm-18de8cb5efa4

Part 1 : Policy Based Reinforcement Learning — A Detailed Study

Written by NandaKishore Joshi