Part 2 : Evolutionary Algorithms for Reinforcement Learning — Solving OpenAI’s Cartpole

Solving the Cartpole problem with a genetic algorithm

9 min readApr 29, 2022

In the first part of this article we discussed why alternative evolutionary algorithms are required for reinforcement learning. We also discussed various steps involved in implementing genetic algorithm for an reinforcement learning. In this article lets implement a genetic algorithm to solve the famous OpenAI’s cartpole problem.

We will use the CartPole environment to test our agent. The agent is rewarded by keeping the pole upright, and it can move the cart left or right. We can represent an agent as a neural network that approximates the policy function — it accepts a state and outputs an action, or more typically a probability distribution over actions. The following listing shows an example of a three-layer network.

import numpy as np
import torch
 
def model(x,unpacked_params):
    l1,b1,l2,b2,l3,b3 = unpacked_params          1
    y = torch.nn.functional.linear(x,l1,b1)      2
    y = torch.relu(y)                            3
    y = torch.nn.functional.linear(y,l2,b2)
    y = torch.relu(y)
    y = torch.nn.functional.linear(y,l3,b3)
    y = torch.log_softmax(y,dim=0)               4
    return y

1-Unpacks the parameter vector into individual layer matrices
2-A simple linear layer with bias
3-A rectified linear unit activation function
4-The last layer will output log probabilities over actions.

The above function defined a 3 -layered neural network with first two layer having Relu activation and the third having log_softmax activation to give log probabilities over the actions as the output. Notice that this function expects an input state, x, and unpacked_params, which is a tuple of individual parameter matrices that are used in each layer.

In evolutionary algorithms, the fittest agents are selected ,recombined and a slight mutation is introduced to get the better child. To make the recombination and mutation process easier, we will create a population of parameter vectors (1-tensors) that we must then “unpack” or decompose into individual parameter matrices for use in each layer of the neural network.

def unpack_params(params, layers=[(25,4),(10,25),(2,10)]):  1
    unpacked_params = []                                    2
    e = 0
    for i,l in enumerate(layers):                           3
        s,e = e,e+np.prod(l)
        weights = params[s:e].view(l)                       4
        s,e = e,e+l[0]
        bias = params[s:e]
        unpacked_params.extend([weights,bias])              5
    return unpacked_params

1-The layers parameter specifies the shape of each layer matrix.
2-Stores each individual layer tensor
3-Iterates through each layer
4-Unpacks the individual layer into matrix form
5-Adds the unpacked tensor to the list

The above function takes a flat parameter vector as the params and specifications of the layer it contains as the layers input, which unpacks the parameter vector into individual layer matrix and bias vector stored in a list. By default a three layer neural network is specified, which consists three weight matrix of dimensions 25X4, 10X25, 2X10 and bias vector of dimensions 1X25, 1X10, 1X2. Hence, the total parameters of the flattened parameter vector comes up to 4 * 25 + 25 + 10 * 25 + 10 + 2 * 10 + 2 = 407. The only reason we’re adding this complexity of using flattened parameter vectors and unpacking them for use is that we want to be able to mutate over and recombine the entire set of parameters, which ends up being simpler overall.

Next let’s add a function to create a population of agents.

def spawn_population(N=50,size=407):         1
    pop = []
    for i in range(N):
        vec = torch.randn(size) / 2.0        2
        fit = 0
        p = {'params':vec, 'fitness':fit}    3
        pop.append(p)
    return pop

1-N is the number of individuals in the population; size is the length of the parameter vectors.
2-Creates a randomly initialized parameter vector
3-Creates a dictionary to store the parameter vector and its associated fitness score

Each agent will be a simple Python dictionary that stores the parameter vector for that agent and the fitness score for that agent.

Next we implement the function that will recombine two parent agents to produce two new child agents.

def recombine(x1,x2):                          1
    x1 = x1['params']                          2
    x2 = x2['params']
    l = x1.shape[0]
    split_pt = np.random.randint(l)            3
    child1 = torch.zeros(l)
    child2 = torch.zeros(l)
    child1[0:split_pt] = x1[0:split_pt]        4
    child1[split_pt:] = x2[split_pt:]
    child2[0:split_pt] = x2[0:split_pt]
    child2[split_pt:] = x1[split_pt:]
    c1 = {'params':child1, 'fitness': 0.0}     5
    c2 = {'params':child2, 'fitness': 0.0}
    return c1, c2

1-x1 and x2 are agents, which are dictionaries.
2-Extracts just the parameter vector
3-Randomly produces a split or crossover point
4-The first child is produced by taking the first segment of parent 1 and the second segment of parent 2.
5-Creates new child agents by packaging the new parameter vectors into dictionaries

The above functions takes two parent agents and randomly produces a split or crossover point. The first piece of the parent 1 is combined with second piece of parent 2 to produce child 1 and second piece of parent 1 is combined with first piece of parent 2 to produce child 2.

That was the first stage for populating the next generation; the second stage is to mutate the individuals with some fairly low probability. Mutation is the only source of new genetic information in each generation — recombination only shuffles around information that already exists.

Next we implement a function to mutate the parameter vectors

def mutate(x, rate=0.01):                                          1
    x_ = x['params']
    num_to_change = int(rate * x_.shape[0])                        2
    idx = np.random.randint(low=0,high=x_.shape[0],size=(num_to_change,))
    x_[idx] = torch.randn(num_to_change) / 10.0                    3
    x['params'] = x_
    return x

1-rate is the mutation rate, where 0.01 is a 1% mutation rate.
2-Uses the mutation rate to decide how many elements in the parameter vector to mutate
3-Randomly resets the selected elements in the parameter vector

We randomly change a few elements in the parameter vector. Number of elements to be changed in the parameter vector is decided by mutation rate. We need to control the mutation rate carefully to balance the creation of new information that can be used to improve existing solutions and the destruction of old information.

Next we need to assess the fitness of each agent by actually testing them on the environment (CartPole in our case).

import gym
env = gym.make("CartPole-v0")
 
def test_model(agent):
    done = False
    state = torch.from_numpy(env.reset()).float()
    score = 0
    while not done:                                                1
        params = unpack_params(agent['params'])
        probs = model(state,params)                                2
        action = torch.distributions.Categorical(probs=probs).sample()              3
        state_, reward, done, info = env.step(action.item())
        state = torch.from_numpy(state_).float()
        score += 1                                                 4
    return score

1-While game is not lost
2-Gets the action probabilities from the model using the agent’s parameter vector
3-Probabilistically selects an action by sampling from a categorical distribution
4-Keeps track of the number of time steps the game is not lost as the score

The test_model function takes an agent (a dictionary of a parameter vector and its fitness value) and runs it in the CartPole environment until it loses the game and returns the number of time steps it lasted as its score. We want to breed agents that can last longer and longer in CartPole (therefore achieving a high score).We need to do this for all the agents in the population for evaluation.

Next we will create a population evaluation function

def evaluate_population(pop):
    tot_fit = 0                         1
    lp = len(pop)
    for agent in pop:                   2
        score = test_model(agent)       3
        agent['fitness'] = score        4
        tot_fit += score
    avg_fit = tot_fit / lp
    return pop, avg_fit

1-Total fitness for this population; used to later calculate the average fitness of the population
2-Iterates through each agent in the population
3-Runs the agent in the environment to assess its fitness
4-Stores the fitness value

The evaluate_population function iterates through each agent in the population and runs test_model on them to assess their fitness.

Final main function we need is the next_generation function. In policy gradient method, probabilistic selection mechanism was used to select the actions. But for choosing parents in a genetic algorithm, it often ends up leading to too rapid convergence. Genetic algorithms require more exploration than gradient-descent–based methods. In this case we’ll use a selection mechanism called tournament-style selection

In tournament selection we evaluate the fitness of all the individuals in the population as usual, and then we choose a random subset of the full population (in this figure just 2 of 4), and then choose the top individuals (usually 2) in this subset, mate them to produce offspring and mutate them. We repeat this selection process until we fill up the next generation. We can change the tournament size (the size of the random subset) to control the degree to which we favor choosing the best agents in the current generation, at the risk of losing genetic diversity.

In this example we set the tournament size as a percentage of the size of the population. Empirically, tournament sizes of about 20% seem to work fairly well.

def next_generation(pop,mut_rate=0.001,tournament_size=0.2):
    new_pop = []
    lp = len(pop)
    while len(new_pop) < len(pop):                         1
        rids = np.random.randint(low=0,high=lp, \
        size=(int(tournament_size*lp)))                    2
        batch = np.array([[i,x['fitness']] for \ 
        (i,x) in enumerate(pop) if i in rids])             3
        scores = batch[batch[:, 1].argsort()]              4
        i0, i1 = int(scores[-1][0]),int(scores[-2][0])     5
        parent0,parent1 = pop[i0],pop[i1]
        offspring_ = recombine(parent0,parent1)            6
        child1 = mutate(offspring_[0], rate=mut_rate)      7
        child2 = mutate(offspring_[1], rate=mut_rate)
        offspring = [child1, child2]
        new_pop.extend(offspring)
    return new_pop

1-While the new population is not full
2-Selects a percentage of the full population as a subset
3-Subsets the population to get a batch of agents and matches each one with their index value in the original population
4-Sorts this batch in increasing order of score
5-The last agents in the sorted batch are the agents with the highest scores; selects the top 2 as parents.
6-Recombines the parents to get offspring
7-Mutates the children before putting them into the next generation

The next_generation function creates a list of random indices to index the population list and create a subset for a tournament batch. We use the enumerate function to keep track of the index positions of each agent in the subset so we can refer back to them in the main population. Then we sort the batch of fitness scores in ascending order and take the last two elements in the list as the top two individuals in that batch. We look up their indices and select the whole agent from the original population list.

Putting it all together, we can train a population of agents to play CartPole in just a handful of generations. We should experiment with the hyperparameters of mutation rate, population size, and number of generations

Below code shows the training of the models

num_generations = 25                                               1
population_size = 500                                              2
mutation_rate = 0.01
pop_fit = []
pop = spawn_population(N=population_size,size=407)                 3
for i in range(num_generations):
    pop, avg_fit = evaluate_population(pop)                        4
    pop_fit.append(avg_fit)
    pop = next_generation(pop, mut_rate=mutation_rate,tournament_size=0.2)                        5

1-The number of generations to evolve
2-The number of individuals in each generation
3-Initializes a population
4-Evaluates the fitness of each agent in the population
5-Populates the next generation

The first generation begins with a population of random parameter vectors, but by chance some of these will be better than the others, and we preferentially select these to mate and produce offspring for the next generation. To maintain genetic diversity, we allow each individual to be mutated slightly. This process repeats until we have individuals who are exceptionally good at playing CartPole.

We can see in figure 3 that the score steadily increases with each generation of evolution.

Pros and Cons of Evolutionary algorithms

Evolutionary algorithms tend to be more data hungry and less data-efficient than gradient-based approaches; in some circumstances this may be fine, notably if you have a simulator.
Evolutionary algorithms can optimize over nondifferentiable or even discrete functions, which gradient-based methods cannot do.

Summary

In this series of article we learnt why and how to use evolutionary algorithms for reinforcement learning. Evolutionary algorithms provide us with more powerful tools for our toolkit. Based on biological evolution, we

Produce individuals
Select the best from the current generation
Shuffle the genes around
Mutate them to introduce some variation
Mate them to create new generations for the next population

Link to part 1 of this article :

https://nandakishorej8.medium.com/part-1-evolutionary-algorithms-for-reinforcement-learning-an-introduction-e434d2a1d023

Part 2 : Evolutionary Algorithms for Reinforcement Learning — Solving OpenAI’s Cartpole

Solving the Cartpole problem with a genetic algorithm

Written by NandaKishore Joshi