PPO Configuration

Overview

The PPOConfig dataclass defines all hyperparameters for Proximal Policy Optimization (PPO) training. These parameters control the learning dynamics, advantage estimation, clipping behavior, and training batch configuration.

Core PPO Hyperparameters

Learning Rate

learning_rate

float

default:"3e-4"

Adam optimizer learning rate for all network components (encoder, decoder, value network).Controls the step size for gradient descent updates. Lower values provide more stable but slower learning.

config = PPOConfig(
    learning_rate=1e-4  # More conservative learning rate
)

Discount Factor (Gamma)

gamma

float

default:"0.99"

Reward discount factor for future rewards.Determines how much the agent values future rewards versus immediate rewards. Values closer to 1.0 prioritize long-term planning.

In training_server.py, this is increased to 0.997 for longer-horizon planning in DOOM survival scenarios.

config = PPOConfig(
    gamma=0.997  # Increased for survival scenarios
)

Generalized Advantage Estimation (GAE)

gae_lambda

float

default:"0.95"

Lambda parameter for GAE advantage estimation.Controls the bias-variance tradeoff in advantage estimation:

1.0 = high variance, low bias (uses full Monte Carlo returns)
0.0 = low variance, high bias (uses only 1-step TD)
0.95 = recommended balanced setting

config = PPOConfig(
    gae_lambda=0.95  # Standard balanced setting
)

Clipping and Loss Coefficients

clip_epsilon

float

default:"0.2"

PPO policy clipping parameter.Limits the size of policy updates to prevent destructively large changes. The policy ratio is clipped to [1 - epsilon, 1 + epsilon].

value_loss_coef

float

default:"0.3"

Coefficient for value function loss in total loss calculation.Total loss = policy_loss + value_loss_coef * value_loss + entropy_coef * entropy_loss

entropy_coef

float

default:"0.02"

Entropy bonus coefficient for policy exploration.Encourages exploration by penalizing overly deterministic policies. Higher values increase randomness in action selection.

config = PPOConfig(
    clip_epsilon=0.15,      # Tighter clipping for stability
    value_loss_coef=0.5,    # Stronger value learning
    entropy_coef=0.01       # Less exploration
)

Gradient Clipping

max_grad_norm

float

default:"3.0"

Maximum gradient norm for gradient clipping.Prevents exploding gradients by clipping the global norm of gradients to this value.

Comment in code notes: “This can probably be reduced to 1, 0.5 was clipping policy movements too much”

config = PPOConfig(
    max_grad_norm=1.0  # More aggressive clipping
)

Return Normalization

normalize_returns

bool

default:"True"

Whether to normalize advantage estimates and returns.Normalizes advantages to have zero mean and unit variance, which can stabilize training.

In ppo_doom.py: “Leave this on for the most part, stabilizes the critic, maybe a running norm would be better?”In training_server.py: Set to False as part of DOOM Initial Report tuning.

config = PPOConfig(
    normalize_returns=True  # Stabilizes critic learning
)

Training Configuration

Batch and Episode Settings

num_envs

int

default:"1"

Number of parallel environments for data collection.Currently only single environment is supported due to hardware constraints.

steps_per_update

int

default:"2048"

Number of environment steps collected before each PPO update.Total samples per update = num_envs × steps_per_update

batch_size

int

default:"256"

Minibatch size for SGD updates during PPO epochs.The collected steps_per_update samples are divided into minibatches of this size for optimization.

num_epochs

int

default:"4"

Number of optimization epochs per PPO update.How many times to iterate over the collected batch of experience. More epochs can improve sample efficiency but risk overfitting to old data.

max_episodes

int

default:"2000"

Maximum number of training episodes before termination.

config = PPOConfig(
    steps_per_update=4096,  # Collect more samples
    batch_size=512,         # Larger batches
    num_epochs=8            # More optimization per batch
)

Example Configurations

Conservative Training

conservative_config = PPOConfig(
    learning_rate=1e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_epsilon=0.1,
    value_loss_coef=0.5,
    entropy_coef=0.01,
    max_grad_norm=0.5,
    normalize_returns=True,
    steps_per_update=2048,
    batch_size=128,
    num_epochs=3
)

Aggressive Exploration

exploration_config = PPOConfig(
    learning_rate=5e-4,
    gamma=0.99,
    gae_lambda=0.90,
    clip_epsilon=0.3,
    value_loss_coef=0.2,
    entropy_coef=0.05,      # Higher entropy for more exploration
    max_grad_norm=5.0,
    normalize_returns=False,
    steps_per_update=4096,
    batch_size=512,
    num_epochs=4
)

Long-Horizon Survival (DOOM)

survival_config = PPOConfig(
    learning_rate=3e-4,
    gamma=0.997,            # Higher gamma for long-term planning
    gae_lambda=0.95,
    clip_epsilon=0.2,
    value_loss_coef=0.3,
    entropy_coef=0.02,
    max_grad_norm=3.0,
    normalize_returns=False,
    steps_per_update=2048,
    batch_size=256,
    num_epochs=4
)

Encoder/Decoder

Network architecture settings

Action Spaces

Action space configuration

Get Started

Core Concepts

Guides

Configuration

Advanced

Overview

Core PPO Hyperparameters

Learning Rate

Discount Factor (Gamma)

Generalized Advantage Estimation (GAE)

Clipping and Loss Coefficients

Gradient Clipping

Return Normalization

Training Configuration

Batch and Episode Settings

Example Configurations

Conservative Training

Aggressive Exploration

Long-Horizon Survival (DOOM)

Encoder/Decoder

Action Spaces

Get Started

Core Concepts

Guides

Configuration

Advanced

​Overview

​Core PPO Hyperparameters

​Learning Rate

​Discount Factor (Gamma)

​Generalized Advantage Estimation (GAE)

​Clipping and Loss Coefficients

​Gradient Clipping

​Return Normalization

​Training Configuration

​Batch and Episode Settings

​Example Configurations

​Conservative Training

​Aggressive Exploration

​Long-Horizon Survival (DOOM)

​Related Configuration

Encoder/Decoder

Action Spaces

Overview

Core PPO Hyperparameters

Learning Rate

Discount Factor (Gamma)

Generalized Advantage Estimation (GAE)

Clipping and Loss Coefficients

Gradient Clipping

Return Normalization

Training Configuration

Batch and Episode Settings

Example Configurations

Conservative Training

Aggressive Exploration

Long-Horizon Survival (DOOM)

Related Configuration