Unlocking the Power of Policy Gradient a RL method
- vazquezgz
- Oct 8, 2023
- 8 min read
Updated: Mar 4, 2024

Reinforcement Learning (RL) is a branch of machine learning where agents learn to make sequential decisions by interacting with an environment. The goal of RL is to maximize a cumulative reward signal received by the agent. In this context, Policy Gradients are a type of RL algorithm that directly parametrize a policy, which is a mapping from states to actions. Policy Gradients aim to find a policy that maximizes the expected cumulative reward.
Where are Policy Gradients Used?
Policy Gradients find applications in various domains, including robotics, game playing, natural language processing, and autonomous systems. They are particularly useful in scenarios where the optimal policy is unknown and needs to be learned through interaction with the environment. Some examples of applications include:
Game Playing: Policy Gradients have been used to train agents for playing video games, such as Atari games, using raw pixel inputs.
Robotics: In robotics, these algorithms help robots learn to perform complex tasks, like grasping objects or walking, by trial and error.
Autonomous Vehicles: Policy Gradients are employed in training self-driving cars to navigate safely and efficiently on real roads.
Healthcare: They can be used for personalized treatment recommendation, drug discovery, and optimizing patient treatment plans.
Now, let's delve into the details of how Policy Gradients work.
How Policy Gradients Work
Policy Representation
Policy Gradients begin with the parameterization of a policy, typically represented as a neural network or a function approximator. This policy maps states or observations to a probability distribution over actions. The policy is denoted as πθ, where θ represents the policy parameters. In practice, neural networks are commonly used to approximate complex policies, with θ being the weights and biases of the network.
The policy function πθ assigns a probability to each action in the action space given the current state. In a discrete action space, this can be represented as a probability vector, where each element corresponds to the probability of taking a specific action.
Collecting Trajectories
The next step involves the agent interacting with the environment using the current policy πθ and collecting trajectories. A trajectory is a sequence of state-action pairs and the corresponding rewards. These trajectories are generated by following the policy in the environment.
State: The state represents the current situation or configuration of the environment. It provides information about the agent's position, objects in the environment, and any other relevant information.
Action: Actions are the decisions made by the agent at each time step. The actions are selected based on the policy's probability distribution over the action space.
Reward: At each time step, the agent receives a reward from the environment. The reward signal provides feedback on the quality of the actions taken. The goal is to maximize the cumulative reward over time.
Trajectory: A trajectory is a sequence of state-action pairs (s1, a1), (s2, a2), ..., (st, at), along with the corresponding rewards (r1, r2, ..., rt).
Objective Function
The objective in Policy Gradients is to find the policy parameters θ that maximize the expected cumulative reward. This is typically done by defining an objective function, often referred to as the "policy objective" or "expected return."
The policy objective function J(θ) quantifies the expected return of the policy under the current parameterization θ. It is defined as the expected value of the sum of rewards:
Here, R(τ) represents the cumulative reward for a trajectory τ, which is the sum of rewards obtained during that trajectory.
Gradient Ascent
To find the policy parameters that maximize the expected return, Policy Gradients use gradient-based optimization methods, typically stochastic gradient ascent. The gradient of the policy objective with respect to the policy parameters, ∇J(θ), indicates how small changes in θ affect the expected return.
The update step in stochastic gradient ascent is given by:
Where:
θ represents the current policy parameters.
α is the learning rate, which controls the step size of the update.
∇J(θ) is the gradient of the policy objective.
Update Policy
The policy parameters θ are updated in the direction that increases the expected return. This means that if the policy is performing well in terms of accumulating rewards, the parameters will be adjusted to make similar decisions in the future. Conversely, if the policy is performing poorly, adjustments are made to improve its decision-making.
The key idea is to increase the probabilities of actions that have led to higher rewards and decrease the probabilities of actions that have resulted in lower rewards. This is done by scaling the gradient update by the rewards obtained in the trajectories, effectively reinforcing actions that have led to positive outcomes.
Repeat
The process of collecting trajectories, computing the policy objective, and updating the policy parameters is iteratively repeated. The agent interacts with the environment, collects more trajectories, and updates the policy based on the collected experiences. This iterative process continues until the policy converges to an optimal or near-optimal policy.
It's important to note that Policy Gradients are typically used in the context of episodic tasks, where episodes have a natural termination point (e.g., reaching a goal, winning a game, or completing a task). The updates are often performed at the end of each episode, using the entire trajectory to estimate the policy gradient.
To better understand how Policy Gradients work in practice, let's walk through four examples using Python and freely available datasets.
CartPole with OpenAI Gym
In this example, we'll use the OpenAI Gym environment, specifically CartPole. CartPole is a simple RL environment where a pole is balanced on a moving cart. The agent's task is to keep the pole upright by applying forces to the cart.
import gym
import numpy as np
# Create the CartPole environment
env = gym.make('CartPole-v1')
# Define the policy as a simple neural network
def policy(observation, theta):
return 1 if np.dot(theta, observation) > 0 else 0
# Initialize policy parameters
theta = np.random.rand(4)
# Training loop
for episode in range(1000):
observation = env.reset()
episode_rewards = 0
for t in range(200):
# Render the environment for visualization (optional)
env.render()
# Sample an action from the policy
action = policy(observation, theta)
# Take the chosen action
observation, reward, done, _ = env.step(action)
# Update the total rewards
episode_rewards += reward
if done:
break
# Policy gradient update (Monte Carlo style)
theta += 0.01 * episode_rewards * theta
The script above sets up the CartPole environment using Gym and defines a basic policy function. It initializes policy parameters randomly and enters a training loop, where the agent interacts with the environment, taking actions according to the policy and accumulating rewards. After each episode, the script updates the policy parameters using a simple policy gradient update, favoring actions that resulted in higher rewards. This iterative process repeats for 1000 episodes, allowing the agent to learn and refine its policy, ultimately aiming to balance the pole on the cart effectively.
LunarLander with OpenAI Gym
Next, we'll work with the LunarLander environment in OpenAI Gym, where the agent controls a spacecraft to land safely on the moon's surface.
import gym
import numpy as np
# Create the LunarLander environment
env = gym.make('LunarLander-v2')
# Define a neural network policy
class PolicyNetwork(nn.Module):
def __init__(self, input_size, output_size):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, 128)
self.fc2 = nn.Linear(128, output_size)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return self.softmax(x)
# Initialize policy network
policy_net = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
# Define the optimizer
optimizer = optim.Adam(policy_net.parameters(), lr=0.01)
# Training loop
for episode in range(1000):
state = env.reset()
episode_rewards = 0
log_probs = []
for t in range(1000):
# Render the environment for visualization (optional)
env.render()
# Sample an action from the policy
action_probs = policy_net(torch.FloatTensor(state))
action = np.random.choice(env.action_space.n, p=action_probs.detach().numpy())
# Calculate the log probability of the selected action
log_prob = torch.log(action_probs[action])
log_probs.append(log_prob)
# Take the chosen action
next_state, reward, done, _ = env.step(action)
# Update the total rewards
episode_rewards += reward
state = next_state
if done:
break
# Calculate the loss and perform policy gradient update
loss = -torch.stack(log_probs).sum() * episode_rewards
optimizer.zero_grad()
loss.backward()
optimizer.step()
Employing a neural network-based policy to train an agent in the LunarLander-v2 environment using Policy Gradients. In the training loop, the agent interacts with the environment, selecting actions based on the policy network's output probability distribution. It accumulates rewards, calculates log probabilities of actions taken, and updates the policy using the gradient of the expected return. This iterative process unfolds across 1000 episodes, allowing the agent to learn a policy that successfully navigates and lands the lunar lander while maximizing the cumulative reward.
Pong with OpenAI Gym
Pong is a classic Atari game where the agent learns to control a paddle and play against an opponent.
import gym
import numpy as np
# Create the Pong environment
env = gym.make('Pong-v0')
# Define a neural network policy
class PolicyNetwork(nn.Module):
def __init__(self, input_size, output_size):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(input_size, 128)
self.fc2 = nn.Linear(128, output_size)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return self.softmax(x)
# Initialize policy network
policy_net = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
# Define the optimizer
optimizer = optim.Adam(policy_net.parameters(), lr=0.001)
# Training loop
for episode in range(1000):
state = env.reset()
episode_rewards = 0
log_probs = []
for t in range(1000):
# Render the environment for visualization (optional)
env.render()
# Preprocess the state (optional)
state = preprocess(state)
# Sample an action from the policy
action_probs = policy_net(torch.FloatTensor(state))
action = np.random.choice(env.action_space.n, p=action_probs.detach().numpy())
# Calculate the log probability of the selected action
log_prob = torch.log(action_probs[action])
log_probs.append(log_prob)
# Take the chosen action
next_state, reward, done, _ = env.step(action)
# Update the total rewards
episode_rewards += reward
state = next_state
if done:
break
# Calculate the loss and perform policy gradient update
loss = -torch.stack(log_probs).sum() * episode_rewards
optimizer.zero_grad()
loss.backward()
optimizer.step()
The script is designed for training an agent to play the game of Pong in the 'Pong-v0' environment from Gym using Policy Gradients. It begins by creating the Pong environment. The agent's policy is represented by a neural network called PolicyNetwork, which takes the game state as input and outputs a probability distribution over actions (e.g., move the paddle up or down) using a softmax activation. Inside the training loop, the agent interacts with the environment over multiple episodes. For each episode, it initializes the state, accumulates rewards, and tracks the log probabilities of actions taken. In a loop with a maximum of 1000 time steps, the agent samples actions from the policy network, calculates the log probability of the selected action, and executes the action in the environment. It then updates the total episode rewards and repeats these steps until the episode terminates. After each episode, the agent computes the policy gradient loss, which encourages actions that lead to higher rewards and discourages actions with lower rewards. This loss is backpropagated through the policy network, and an optimizer (Adam with a learning rate of 0.001) adjusts the network's parameters. This process repeats for 1000 episodes, enabling the agent to learn a policy that competently plays the game of Pong.
In summary, the script employs a neural network-based policy to train an agent in the 'Pong-v0' environment using Policy Gradients. Over 1000 episodes, the agent learns to play Pong effectively by maximizing its cumulative rewards while taking actions based on the learned policy and updating the policy's parameters through gradient ascent.
In this post, we provided an introduction to Policy Gradients, explained how they work in detail, and showcased four examples using Python with freely available datasets. Policy Gradients are a powerful class of reinforcement learning algorithms that find applications in various domains, from game playing to robotics and healthcare. They directly parametrize policies to maximize expected cumulative rewards, making them versatile tools for training agents to make sequential decisions in complex environments. However, they also have limitations, such as high variance in training and sensitivity to hyperparameters, which require careful tuning. Policy Gradients are just one of many RL algorithms, and the choice of algorithm depends on the specific problem and the characteristics of the environment.
Comments