Part 2 : Policy Based Reinforcement Learning — OpenAI’s Cartpole with REINFORCE algorithm

NandaKishore Joshi
6 min readFeb 26, 2022

In this article we will implement a policy based reinforcement learning algorithm called REINFORCE to solve OpenAI’s Cartpole problem.

In part 1 of this article we have covered

  1. Difference between Off-policy and On-policy algorithms
  2. Types of policy functions
  3. Implementation of policy function using neural network

Theoretical details of REINFORCE algorithm is explained in the previous article using GridWorld example. This article is an attempt to implement it on Cartpole problem.

What is Cartpole problem?

Fig 1 : Cartpole example

Cartpole is an control environment problem provided by OpenAI’s gym framework. The objective is to not let pole fall over. Cartpole has two actions 0 and 1. O to move left and 1 to right. The state is represented as a vector of length 4 that indicates the cart position, cart velocity, pole angle, and pole velocity. We receive a reward of +1 for every step the pole has not fallen over, which happens when the pole angle is more than 12° from the center or when the cart position is outside the window. Hence, the goal of CartPole is to maximize the length of the episode, since each step returns a positive +1 reward.

Lets look into the code!!

First we need to import the cartpole environment

import gym
env = gym.make('CartPole-v0') -- 1
state1 = env.reset()
action = env.action_space.sample() -- 2
state, reward, done, info = env.step(action) -- 3

1 — importing the Cartpole environment

2- We get a sample action from the environment

3 — pass the sample action to get next state, reward from the environment. done is false when the pole is still standing, true when pole falls indicating the end of episode. info is a dictionary with diagnostic information which can help in debugging

Build neural network!!

We now build and initialize a neural network that serves as policy network. This network will accept state vector as input and produce a discrete probability distribution over the possible actions. You can think of the agent as a thin wrapper around the policy network that samples from the probability distribution to take an action.

import gym
import numpy as np
import torch

l1 = 4 1
l2 = 150 2
l3 = 2 3

model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax() 4
)

learning_rate = 0.0009
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

1- input length of 4 for state vector

2 — middle layer produces a vector of 150

3 — third layer produces vector of 2 for left and right action

4- output is the SoftMax layer to produce the probability distribution over the actions

Lets see how this network interacts with the environment

pred = model(torch.from_numpy(state1).float())                    1
action = np.random.choice(np.array([0,1]), p=pred.data.numpy()) 2
state2, reward, done, info = env.step(action) 3

1 — Calls the policy network model to produce predicted action probabilities

2 - Samples an action from the probability distribution produced by the policy network

3 - Takes the action and receives the new state and reward. The info variable is produced by the environment but is irrelevant.

The policy network might return a discrete probability distribution in the form of a vector, such as [0.25, 0.75] for our two possible actions in CartPole. This means the policy network predicts that action 0 is the best with 25% probability, and action 1 is the best with 75% probability (or confidence). We call this array pred.

The environment responds to the action by producing a new state, s2, and a reward, r2. We store those into two arrays (a states array and an actions array) for when we need to update our model after the episode ends. We then plug the new state into our model, get a new state and reward, store those, and repeat until the episode ends (the pole falls over and the game is finished).

Lets say that the current state is S5 and the policy network returns probability distribution [0.25,0.75]. Based on this we take the action 1 (0.75) and the poll falls over and hence ending the episode. The total duration T=5 and for each step we took an action based on the probability predicted and we stored the specific probabilities of the actions taken in an array , which look like [0.5, 0.3, 0.25, 0.5, 0.75]. We multiply these with the discounted rewards.

Computing the discounted rewards!!

Our discounted reward strategy would be, the last action will be discounted the most as it is responsible for the fall and the first least. If the episode lasted 5 time steps, we create the return array [5,4,3,2,1]. To compute the discounted rewards, we make an array of γt by taking our γ parameter, which may be set to 0.99 for example, and exponentiating it according to the distance from the end of the episode. For example, we start with gamma_t = [0.99, 0.99, 0.99, 0.99, 0.99], then create another array of exponents exp = [1,2,3,4,5] and compute torch.power(gamma_t, exp), which will give us [1.0, 0.99, 0.98, 0.97, 0.96]

def discount_rewards(rewards, gamma=0.99):
lenr = len(rewards)
disc_return = torch.pow(gamma,torch.arange(lenr).float()) * rewards 1
disc_return /= disc_return.max() 2
return disc_return

1 — Computes exponentially decaying rewards

2 — Normalizes the rewards to be within the [0,1] interval to improve numerical stability

Define loss function!!

We make our loss function the negative log-probability of the action given the state, scaled by the reward returns. In PyTorch, this is defined as -1 * torch.sum(r * torch.log(preds)). We compute the loss with the data we’ve collected for the episode, and run the torch optimizer to minimize the loss. Let’s run through some actual code.

def loss_fn(preds, r):                              1
return -1 * torch.sum(r * torch.log(preds)) 2

1 - The loss function expects an array of action probabilities for the actions that were taken and the discounted rewards.

2 — It computes the log of the probabilities, multiplies by the discounted rewards, sums them all, and flips the sign.

Having learnt how to build a policy neural network , discount factor and loss function lets put everything together and look at the entire training loop

Lets train!!

MAX_DUR = 200
MAX_EPISODES = 500
gamma = 0.99
score = [] --1
for episode in range(MAX_EPISODES):
curr_state = env.reset()
done = False
transitions = [] --2

for t in range(MAX_DUR): --3
act_prob = model(torch.from_numpy(curr_state).float()) --4
action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy()) --5
prev_state = curr_state
curr_state, _, done, info = env.step(action) --6
transitions.append((prev_state, action, t+1)) --7
if done: --8
break

ep_len = len(transitions) --9
score.append(ep_len)
reward_batch = torch.Tensor([r for (s,a,r) in
transitions]).flip(dims=(0,)) --10
disc_rewards = discount_rewards(reward_batch) --11
state_batch = torch.Tensor([s for (s,a,r) in transitions])--12
action_batch = torch.Tensor([a for (s,a,r) in transitions])--13
pred_batch = model(state_batch) 14
prob_batch = pred_batch.gather(dim=1,index=action_
batch.long().view(-1,1)).squeeze() 15
loss = loss_fn(prob_batch, disc_rewards)
optimizer.zero_grad()
loss.backward()
optimizer.step()

1 — A list to keep track of the episode length over training time

2 — A list of state, action, rewards (but we ignore the reward)

3 — While in the episode

4 — Gets the action probabilities

5 — Selects an action stochastically

6 — Takes the action in the environment

7 — Stores this transition

8 — If the game is lost, breaks out of the loop

9 — Stores the episode length

10 — Collects all the rewards in the episode in a single tensor

11 — Computes the discounted version of the rewards

12 — Collects the states in the episode in a single tensor

13 — Collects the actions in the episode in a single tensor

14 — Recomputes the action probabilities for all the states in the episode

15 — Subsets the action-probabilities associated with the actions that were actually taken

After training the policy netwrok for 500 epochs, we get rewards as shown below.

Fig 2 : Total rewards during training

We get a plot that demonstrates the agent really is learning how to play CartPole. OpenAI’s documentation says that the game is considered “solved” if the agent can play an episode beyond 200 time steps. Although the plot looks like it tops off at around 190, that’s because it’s a moving average plot. There are many episodes that reach 200 but a few times where it randomly fails early on, bringing the average down a bit.

End Note:
In part 1 of this article we started with understanding the working of Policy based Reinforcement learning algorithms in detail. In this article we have seen how to implement one to solve a problem.

Please feel free to reach out to me incase of any doubts.

Part 1 of this article:

https://nandakishorej8.medium.com/part-1-policy-based-reinforcement-learning-a-detailed-study-1d4e9b8b5239

--

--