Part 1 : Evolutionary Algorithms for Reinforcement Learning — An Introduction

An Introduction to Genetic algorithm

NandaKishore Joshi
7 min readApr 29, 2022

Unlike conventional gradient descent based Reinforcement learning techniques, there are alternative optimization techniques which are inspired by Charles Darwin’s Evolutionary theory. In this Article we deep dive to understand the working of these algorithms and try to implement them to solve OpenAI’s cartpole problem

Neural Networks, the building blocks of the modern AI are inspired by biological brain. Similarly convolution neural networks are inspired by biological vision. Nature through the process of evolution has solved many problems efficiently and elegantly. Hence, many advances in technology and engineering are inspired by biological organisms. Naturally people have tried to borrow the process of evolution itself and implement it as a computer algorithm to solve problems. In this article we will see how process of evolution is harnessed to solve problems.

Why different approach to Reinforcement Learning?

With DQN and policy gradient approaches we create a agent whose policy depended on a neural network to approximate the Q function or policy function. The agent interacts with the environment, collects experiences and then backpropagates to improve the accuracy of the neural network and hence, its policy. We need to carefully select various hyperparameters like optimizer function, mini-batch size, learning rate to make the learning stable and successful. Training of both DQN and Policy gradient relies on stochastic gradient descent, there is no guarantee that the models will successfully learn in all cases. The training of the DQN and policy gradient approaches are shown in below figure

Figure 1 : Agent interaction with environment in DQN and Policy Gradient approaches

Finding the right combination of hyperparameter for creating an agent can be highly difficult when the environment and network gets complicated. Moreover, gradient descent and backpropagation need a model that is differential. There are certainly interesting and useful models you could construct that might be impossible to train with gradient descent due to the lack of differentiability.

Instead of creating one agent and improving it, we can instead learn from Charles Darwin and use evolution by (un)natural selection. We could spawn multiple different agents with different parameters (weights), observe which ones did the best, and “breed” the best agents such that the descendants could inherit their parents’ desirable traits — just like in natural selection. We could emulate biological evolution using algorithms. We wouldn’t need to struggle to tune hyperparameters and wait for multiple epochs to see if an agent is learning “correctly.” We could just pick the agents that are already performing better. This approach is shown in figure 2.

Figure 2 : Working of Evolutionary Algorithm

This class of algorithms does not require an individual agent to learn. It does not rely on gradient descent and is aptly called a gradient-free algorithm. But its not that these agents do not have a objective they are purely relying on chance. We will be selecting the fittest among the population with variance in traits.

Reinforcement Learning with evolution strategy

Let us now discuss how fitness plays in evolution strategies and cover the task of selecting the fittest agent. Later we will see how to combine those agents into new agents and show what happens when we introduce mutations

Fitness in Evolution Strategy :

We have often heard “ survival of the fittest” . What does fittest mean here?

Fitness is relative to environment. For example, Polar bear is the fittest and well adapted to polar ice caps. But Amazon rain forest may not be its cup of cake. In Biology fittest represents the greatest reproductive success and hence passed on their genetic information to the generations. These traits can be beak design for a bird which helps it to fetch more food and hence increase the chances of living and reproducing . It can also be the thickness of Fur on a Yak which allows it to survive in extreme cold regions. You can think of the environment as determining an objective or fitness function that assigns individuals a fitness score based on their performance within that environment; their performance is determined solely by their genetic information.

Mutations in Evolution Strategy :

In biology, mutations are the subtle changes in the characters of the organism. These changes happen over many generations and help them to better adapt and survive in there respective environment. In case of beak design of a bird, many random mutations would be introduced. Only birds with favorable beak design would have survived or reproduced in more numbers and over the generations the entire population of birds have that trait. Here we can see nature selecting the fittest trait and reproducing it.

Evolutionary reinforcement learning :

In evolutionary reinforcement learning, we are selecting for traits that give our agents the highest reward in a given environment, and by traits we mean model parameters (e.g., the weights of a neural network) or entire model structures. An RL agent’s fitness can be determined by the expected reward it would receive if it were to perform in the environment.

Let’s say agent A played the Atari game Breakout and was able to achieve an average score of 500 while agent B was only able to obtain 300 points. We would say that agent A is more fit than agent B and that we want our optimal agent to be more similar to agent A than B. The only reason why agent A would be more fit than agent B is because its model parameters were slightly more optimized to the environment.

The objective in evolutionary reinforcement learning is exactly the same as in backpropagation and gradient descent-based training. The only difference is that we use this evolutionary process, which is often referred to as a genetic algorithm, to optimize the parameters of a model such as a neural network. The genetic algorithm for reinforcement learning is shown in figure 3

Figure 3 : Genetic Algorithm in Reinforcement Learning

Let’s run through the steps of the genetic algorithm in more detail. Let’s say we have a neural network that we want to use as an agent to play Gridworld, and we want to train it using a genetic algorithm. Remember, training a neural network just means iteratively updating its parameters such that its performance improves. Also recall that given a fixed neural network architecture, the parameters completely determine its behavior, so to copy a neural network we just need to copy its parameters.

  1. We generate an initial population of random parameter vectors. We refer to each parameter vector in the population as an individual. Let’s say this initial population has 100 individuals.
  2. We iterate through this population and assess the fitness of each individual by running the model in Gridworld with that parameter vector and recording the rewards. Each individual is assigned a fitness score based on the rewards it earns. Since the initial population is random, they will all likely perform very poorly, but there will be a few, just by chance, that will perform better than others.
  3. We randomly sample a pair of individuals (“parents”) from the population, weighted according to their relative fitness score (individuals with higher fitness have a higher probability of being selected) to create a “breeding population.”
  4. The individuals in the breeding population will then “mate” to produce “offspring” that will form a new, full population of 100 individuals. If the individuals are simply parameter vectors of real numbers, mating vector 1 with vector 2 involves taking a subset from vector 1 and combining it with a complementary subset of vector 2 to make a new offspring vector of the same dimensions. For example, suppose you have vector 1: [1 2 3] and vector 2: [4 5 6]. Vector 1 mates with vector 2 to produce [1 5 6] and [4 2 3]. We simply randomly pair up individuals from the breeding populations and recombine them to produce two new offspring until we fill up a new population. This creates new “genetic” diversity with the best performers.
  5. We now have a new population with the top solutions from the last generation, along with new offspring solutions. At this point, we will iterate over our solutions and randomly mutate some of them to make sure we introduce new genetic diversity into every generation to prevent premature convergence on a local optimum. Mutation simply means adding a little random noise to the parameter vectors. If these were binary vectors, mutation would mean randomly flipping a few bits; otherwise we might add some Gaussian noise. The mutation rate needs to be fairly low, or we’ll risk ruining the already present good solutions.
  6. We now have a new population of mutated offspring from the previous generation. We repeat this process with the new population for N number of generations or until we reach convergence (which is when the average population’s fitness has stopped improving significantly).
Figure 4: A genetic algorithm optimization of neural networks for reinforcement learning

Summary

In this article we discussed, Why an alternative approach to gradient descent based reinforcement learning is required and tried to understand what is Evolutionary mechanism in Biological terms and how that can be used in Reinforcement learning.

In the next part of this article, we will try to implement the Evolutionary Algorithm to solve OpenAI’s cartpole problem and try to understand the Pros and cons of it.

Link to part 2 of this article :

https://nandakishorej8.medium.com/part-2-evolutionary-algorithms-for-reinforcement-learning-solving-openais-cartpole-318aaef8a6eb

--

--