Skip to main content

Overview

The deadly corridor curriculum provides a structured progression from basic survival skills to complex combat scenarios. Training on levels 1-4 builds fundamental movement and targeting policies, while level 5 serves as the ultimate benchmark.
deadly_corridor_5.cfg is a significant jump from level 4. Habits learned in 1-4 (like straight running toward armor) may underperform on level 5. Adjust curriculum pacing accordingly.

Available Scenarios

Deadly Corridor Curriculum

All configs use deadly_corridor.wad:
ConfigDifficultyDescription
deadly_corridor_1.cfgBeginnerMinimal enemies, ample resources
deadly_corridor_2.cfgEasySlightly more enemies, tighter spacing
deadly_corridor_3.cfgMediumBalanced challenge, requires dodging
deadly_corridor_4.cfgHardDense enemy placement, ammo scarcity
deadly_corridor_5.cfgBenchmarkExtreme difficulty, official test level
Progress through 1-4 builds basic policies yet may result in movement habits that underperform on 5 (straight running toward armor). Fine-tune on level 5 with a lower learning rate to adapt movement behavior.

Other Scenarios

Progressive Deathmatch (Default)
config = PPOConfig(doom_config="progressive_deathmatch.cfg")
  • Similar to survival, but kills don’t reset ammo count
  • Encourages proper ammo management
  • Movement tweaks make training easier
  • Uses progressive_deathmatch.wad
Survival
config = PPOConfig(doom_config="survival.cfg")
  • Classic survival mode
  • Uses survival.wad

Curriculum Design Principles

From README.md (lines 22-23):
Files deadly_corridor_1.cfg to deadly_corridor_4.cfg ramp difficulty gradually, but deadly_corridor_5.cfg is a significant jump (and the actual benchmark). Progress through 1-4 builds basic policies yet may result in movement habits that underperform on 5 (straight running toward armor). Adjust curriculum pacing accordingly.

Why Progressive Training Matters

Starting directly on level 5 often results in:
  • Random exploration with minimal reward signal
  • High variance in policy gradients
  • Slow or failed convergence
  • Neurons receiving noisy, uninformative feedback
Curriculum learning provides:
  • Gradual skill acquisition (movement → targeting → tactics)
  • Stronger reward signals early in training
  • More stable policy updates
  • Conditioned neurons with meaningful stimulus-response mappings

Configuration for Deadly Corridor

Architecture & Feedback Tuning

The specific values below are tuned for the deadly corridor scenario (deadly_corridor_1.cfgdeadly_corridor_5.cfg). Treat them as a starting point only — other scenarios (progressive deathmatch, survival) will likely require different values for feedback scaling, reward shaping, ray-cast geometry, and curriculum pacing.
From README.md lines 25-41:
config = PPOConfig(
    # Reward feedback
    use_reward_feedback=True,  # Uses rewards to drive positive/negative feedback
    
    # Decoder configuration
    decoder_enforce_nonnegative=False,
    decoder_freeze_weights=False,  # Decoder stays free to mirror encoder intent
    decoder_zero_bias=True,  # Keeps bias at zero so decoded actions depend solely on encoder
    decoder_use_mlp=False,  # Linear decoder keeps hardware mapping transparent
    decoder_mlp_hidden=32,
    decoder_weight_l2_coef=0.0,
    decoder_bias_l2_coef=0.0,
    
    # Ray-cast features tuned for corridor geometry
    wall_ray_count=12,
    wall_ray_max_range=64,
    wall_depth_max_distance=18.0,
    
    # Encoder configuration
    encoder_trainable=True,
    encoder_entropy_coef=-0.10,  # Encourages confident (low-variance) stimulation
    encoder_use_cnn=True,
    encoder_cnn_channels=16,
    encoder_cnn_downsample=4,
    
    # Ablation testing
    decoder_ablation_mode='none',  # Swap to 'random' or 'zero' to test robustness
    
    # Episode feedback
    episode_positive_feedback_event=None,  # Default to global reward feedback
    episode_negative_feedback_event=None,
    
    # Distance normalization for corridor geometry
    enemy_distance_normalization=1312.0  # Leave untouched unless you change WAD geometry
)

PPO Hyperparameters

From README.md lines 19-20:
config = PPOConfig(
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,  # High gamma/lambda needed for long-range dependencies
    clip_epsilon=0.2,
    entropy_coef=0.02,
    steps_per_update=2048,  # Higher for stability (slow without parallelization)
    batch_size=256,
    num_epochs=4
)
Many RL implementations use far lower gamma (0.95) and lambda (0.90) for GAE, but this can severely affect training on lower levels due to long-range dependencies, since you take less damage and therefore live longer.

Training Progression

Stage 1: Basic Movement (Level 1)

config = PPOConfig(
    doom_config="deadly_corridor_1.cfg",
    learning_rate=3e-4,
    max_episodes=500
)
Train until:
  • Agent consistently moves forward
  • Picks up armor/health
  • Survival time > 30 seconds
python training_server.py \
    --mode train \
    --device cuda \
    --cl1-host 192.168.1.50 \
    --max-episodes 500

Stage 2: Targeting (Levels 2-3)

config = PPOConfig(
    doom_config="deadly_corridor_2.cfg",
    learning_rate=3e-4,
    max_episodes=1000
)
Train until:
  • Agent turns toward enemies
  • Kill count increasing
  • Dodges incoming fire
Load checkpoint from Stage 1:
python training_server.py \
    --mode train \
    --device cuda \
    --cl1-host 192.168.1.50 \
    --checkpoint checkpoints/l5_2048_rand/episode_500.pt \
    --max-episodes 1500

Stage 3: Tactics (Level 4)

config = PPOConfig(
    doom_config="deadly_corridor_4.cfg",
    learning_rate=3e-4,
    max_episodes=2000
)
Train until:
  • Strategic positioning
  • Ammo conservation
  • Multi-enemy engagement

Stage 4: Fine-Tuning (Level 5)

Level 5 is a significant difficulty spike. Use a lower learning rate to adapt existing policies without catastrophic forgetting.
config = PPOConfig(
    doom_config="deadly_corridor_5.cfg",
    learning_rate=1e-4,  # Reduced from 3e-4
    max_episodes=3000
)
From README.md line 54:
Consider fine-tuning on deadly_corridor_5.cfg with a lower learning rate to adapt movement behavior.
python training_server.py \
    --mode train \
    --device cuda \
    --cl1-host 192.168.1.50 \
    --checkpoint checkpoints/l5_2048_rand/episode_2000.pt \
    --max-episodes 5000

Monitoring Curriculum Progress

TensorBoard Metrics

tensorboard --logdir checkpoints/l5_2048_rand/logs --port 6006
Key metrics per curriculum stage:
MetricLevel 1 TargetLevel 2-3 TargetLevel 4 TargetLevel 5 Target
Episode Reward> 100> 300> 500> 800
Kill Count1-23-55-88+
Survival Time30s45s60s90s+
Ammo WasteHighMediumLowMinimal

Transition Criteria

Move to the next level when:
  1. Reward plateau: No improvement for 100 episodes
  2. Consistency: 80% of episodes achieve target metrics
  3. Skill demonstration: Agent exhibits desired behaviors (recorded gameplay)

Checkpoint Management

Saving Checkpoints

Checkpoints are automatically saved every 100 episodes (configurable):
config = PPOConfig(
    checkpoint_dir="checkpoints/l5_2048_rand",
    save_interval=100  # episodes
)
From USAGE.md line 55-61:
# Working checkpoint directory
checkpoints/l5_2048_rand

# To save/load checkpoints, make a copy of this directory
cp -r checkpoints/l5_2048_rand checkpoints/level2_backup

Loading Between Stages

# Stage 1 → Stage 2
python training_server.py \
    --mode train \
    --device cuda \
    --cl1-host 192.168.1.50 \
    --checkpoint checkpoints/l5_2048_rand/episode_500.pt

# Update doom_config in code to deadly_corridor_2.cfg

Common Curriculum Issues

Symptom: Works well on level 1-3, fails catastrophically on level 5.Causes:
  • Over-optimization on easy levels (e.g., always running straight)
  • Insufficient exploration on harder levels
Solutions:
  • Reduce steps_per_update on level 5 for more frequent updates
  • Increase entropy_coef temporarily to encourage exploration
  • Lower learning rate to prevent catastrophic forgetting
config = PPOConfig(
    doom_config="deadly_corridor_5.cfg",
    learning_rate=1e-4,  # Lower
    entropy_coef=0.04,   # Higher
    steps_per_update=1024  # More frequent updates
)
Symptom: Reward stays flat even on easiest level.Causes:
  • Neurons not responding to stimulation
  • Feedback channels misconfigured
  • Ablation mode accidentally enabled
Solutions:
  • Check decoder_ablation_mode='none'
  • Verify spike counts > 0 (TensorBoard: Spikes/total_count)
  • Inspect feedback amplitude/frequency in logs
  • Test with --show_window to observe behavior
Symptom: Large reward variance when loading checkpoint on new level.Causes:
  • Learning rate too high for new scenario
  • Value network hasn’t adapted to new reward distribution
Solutions:
  • Always reduce learning rate 2-3x when changing levels
  • Use normalize_returns=True for value stability
  • Run 50-100 episodes on new level before judging performance
config = PPOConfig(
    learning_rate=1e-4,  # 3x lower than 3e-4
    normalize_returns=True,
    max_grad_norm=1.0  # Reduce from 3.0 for stability
)

Action Space Considerations

Hybrid vs. Discrete Actions

From README.md line 21:
Hybrid action spaces are used (and greatly preferred) unless use_discrete_action_set=True. Realistically, you only flip this if all else fails to reduce entropy as it greatly reduces the movement fidelity of the agent and just doesn’t look as cool.
# Preferred (hybrid)
config = PPOConfig(
    use_discrete_action_set=False  # Default
)

# Fallback (simplified)
config = PPOConfig(
    use_discrete_action_set=True  # Only if hybrid fails
)

Visualizing Training

From USAGE.md lines 35-39:
<!-- visualisation.html -->
<img id="img" width="640" src="http://127.0.0.1:12349/doom.mjpeg">
Run with visualization:
python training_server.py \
    --mode train \
    --device cuda \
    --cl1-host 192.168.1.50 \
    --show_window
Open visualisation.html in a browser and update the IP to your training server.