Enter the world of Policy Gradient methods!

Video

First, let’s look at the resulting video, to understand what we are doing.

Episode 100 Trial

Episode 100 Trial video

Episode 700 Trial

Episode 700 Trial video

Episode 800 Trial

Episode 800 Trial video

Have you wondered why, After episode 100 the agent was not working at all, and ad 700 it looks to learn the behavior and then at 800 it was stuck in the air? Let’s start!

Let’s look at the previous blogs

Previous blogs were about value-gradient methods where the actions were discrete. How about continuous actions? How about directly updating policy instead of choosing policy based on value function approximation? Here comes the world of Policy gradient methods!

Reinforce

In this blog, i am writing about new type of Reinforcement learning (will be referred as RL) method; REINFORCE method.

History

This method was first introduced by Ronald J. Williams in the Paper link. This paper is a lot more mathematical. Luckily we have intuition of it as well! More on that later.

Why Policy Gradient methods?

In the value-based methods the policies are not learned directly. The learning is done according to the value. The value function is continuously updated. WHAT IF WE value function is not neede? What if we want to learn the policy directly? Good or bad this is actually the reason. Value-based methods work well, if the policy is discrete. If the policy is discrete we can know which action to take based on value function. But if it is continuous, then that is impossible. One can discretize the actions, but again, how many actions can one take?

We don’t know which policy is the best! Also, there is no way to explore different actions! If the actions is from -1.0 to 1.0, how can one explore the action , let’s say, 0.666 ? Here comes policy gradient methods!

Choice of actions Value-based methods choose actions randomly. Making action choice erratic, at least in the exploration stage. While Policy gradients do not have this problem. Actions are based on gradients!

Let’s talk about intuition

I find this blog to have the best intuition on Policy gradient method. Please check it out first before proceeding! I plan to write intuition blog in future.

Now it’s time of Code Walkthrough

How to run?

The repository is here

Assuming you have read the intuition blog, please go through the walkthrough.

Build Image

docker compose build

Run the container

docker compose up

Training with saving the video after every certain episode

python3 /src/reinforce_discrete.py LunarLander-v2 --train --save_model_path </path/to/save/the/model> --gamma <gamma hyper-parameter> --epoch <num_of_epoch> --plot <to plot or not> --plot_fig_path </path/to/save/the/plot> --train_video_dir </path/to/train_video> --train_video_every <number_of_episodes_to_save_video>

Inference

  • To setup for inference, run from outside the docker container

      sudo xhost +SI:localuser:<username>
    
  • From another terminal run docker exec -it <container_name> and inside the container run,

      python3 /src/reinforce_discrete.py LunarLander-v2 --infer --infer_weight /path/to/saved/weight --infer_render <to render the inference or not> --infer_render_fps <fps for render video> --infer_video </path/to/save/inference/rendered/video.>
    

What about the codebase?

Policy Script

The __init__ and forward methods are same to previous blogs. A simple neural network of 2 hidden layers.


class Policy:

    def __init__(self, s_size=4, fc1_size=150, fc2_size=120, a_size=2):
            super(Policy, self).__init__()
            self.fc1 = nn.Linear(s_size, fc1_size)
            self.fc2 = nn.Linear(fc1_size, fc2_size)
            self.final = nn.Linear(fc2_size, a_size)
    
    def forward(self, x):
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.final(x)
            return F.softmax(x, dim=1)

  • But the Action is a bit different this time!

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)
  • So we see that the act method outputs the action value (obvious) and the action probability.

  • Why probability ? The reason is because for the gradient ascent we will need the action probability. If we look at the equation for gradient ascent,

    \[\nabla_\theta J(\theta)= \hat{A}(s,a)\nabla_\theta\log\pi_{\theta}(s|a)\] \[\theta = \theta + \alpha\nabla_\theta J(\theta)\]

    So it turns out, we need to have action value to run the agent and the log probability as well.

rl/rl/renforce.py script

  • Function definition

      def reinforce_discrete(
          env,
          policy,
          model_weights_path,
          n_episodes=1000,
          max_t=1000,
          gamma=1.0,
          print_every=100,
          learning_rate=1e-2
      ):
          scores = []
    
  • make an optimizer

          optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
    
  • Get the action and log probability, get the reward, and save the rewards with log probability

          for i_episode in range(1, n_episodes + 1):
              saved_log_probs = []
              rewards = []
              state = env.reset()
              states = [state]
              for t in range(max_t):
                  action, log_prob = policy.act(state)
                  saved_log_probs.append(log_prob)
                  state, reward, done, _ = env.step(action)
                  rewards.append(reward)
                  states.append(state)
                  if done:
                      break
              scores.append(sum(rewards))
    
  • We will get expected rewards based on gamma. It is the application of the following equation,

    \[E = R_i + \gamma R_{i +1} + \gamma^2R_{i+2} ....\gamma^{n-1}R_{i + n - 1}\]
              expected_rewards = get_expected_reward(rewards, gamma)
              state_values = get_state_values(rewards)
    
              policy_loss = []
                
    
  • Advantage according to the equation, get the policy loss and backpropagate.

    Advantage is , as following \(A = E - mean(E)\) There is a good explanation of using Advantage here. But for now, it is enough to understand that, Advantage is more dependant on action by an agent, i.e. it shows the effect of taking an action for a certain state , and hence named Advantage maybe!

    The code is as follows

              for i, log_prob in enumerate(saved_log_probs):
                  A = expected_rewards[i] - np.mean(expected_rewards)
                  policy_loss.append((-log_prob * A).float())
              policy_loss = torch.cat(policy_loss).sum()
    
              optimizer.zero_grad()
              policy_loss.backward()
              optimizer.step()
    
              if i_episode % print_every == 0:
                  print(
                      "INFO: Episode {}\tAverage Score: {:.2f}".format(
                          i_episode, np.mean(scores[-print_every:])
                      )
                  )
              if np.mean(scores[-print_every:]) >= 195.0:
                  print(
                      "INFO: Environment solved in {:d} episodes!\tAverage Score: {:.2f}".format(
                          i_episode - 100, np.mean(scores[-print_every:])
                      )
                  )
                  break
          print(f"INFO: Saving the weights in {model_weights_path}")
          torch.save(policy.state_dict(), model_weights_path)
          return scores
    

Score plot

The score plots of the entire training is as follows.

Score plot

Problem with this approach

If you check the plot above, and also the videos referred to at the beginning, you will notice that the agent is not stable enough. After 700 episodes, at the 800th episode, it should do the optimum task to sit down at the right place. But it was hovering, which can be referred to two cases

  • This method only considers the reward due to action and no evaluation of the state! So when hovering it might be getting a higher reward than falling down. It is sub-optimal behavior only based on actions!
  • Hence, the training process has a higher variance. As we are training the agent based only on the rewards of the actions, not on the value of the state.
  • How to solve it? We will find out in the next blogs!

TODO

I was planning to compare the score plots! When I will be able to do it, I will update you!

  • Compare Reinforce method plots to DQN based methods.

Conclusion

To conclude, Reinforce method opens the pathway to policy gradient methods, making continuous action based problems easier to solve. But it has high variance issue. It needs to be updated. Which will be subject to my upcoming blogs. Although I mentioned about the continuous actions, this blog is still working with discrete actions! In the part 2 of reinforce, I will try to update the action into Continuous one!

If you have any question or comment, please ask at sezan92[at]gmail[dot]com