POPGym Suite: Memory & RL Benchmark
- POPGym Suite is a comprehensive RL benchmark set designed to test memory and partial observability using low-dimensional and pixel-based environments.
- Its dual components, Classic and Arcade, facilitate rapid prototyping and high-throughput simulation on both commodity GPUs and hardware accelerators.
- The suite integrates standardized APIs and analytical tools, such as memory saliency metrics, to rigorously evaluate recurrent policies and sample efficiency.
The Partially Observable Process Gym (POPGym) Suite is a suite of reinforcement learning (RL) benchmarks specifically designed to probe memory and partial observability. It consists of several environments and analytical tools enabling large-scale evaluation of sequence models under Partially Observable Markov Decision Processes (POMDPs) as well as fully observable Markov Decision Processes (MDPs). Its two flagship components, the original POPGym with low-dimensional tasks and the hardware-accelerated POPGym Arcade with pixel-based environments, jointly provide the most diverse and technically rigorous benchmarking platforms for memory-centric RL research to date (Wang et al., 3 Mar 2025, Morad et al., 2023).
1. Formal Definition and Environment Families
POPGym environments are uniformly cast as POMDPs of the form
where is the latent state space, the action space, the observation space, is the state transition kernel, the observation model, the reward function with range , and the discount factor. At each timestep, the agent applies a policy based on internal memory, yielding sequenced transitions under partial observability.
POPGym (Classic)
- Comprises 15 low-dimensional environments spanning five tags: Diagnostic, Control, Noisy, Game, and Navigation.
- Each comes in three difficulty levels, supporting fast simulation and rapid convergence (hours on commodity GPUs).
- Observation spaces are vectors or small arrays, with modalities designed to stress various types of memory.
POPGym Arcade
- Introduces pixel-based environments as POMDPs/MDPs with
- Internal latent state
- Action space up, down, left, right, fire
- Observations (RGB images)
- Each environment is paired with both a fully observable ("MDP twin") and a partially observable ("POMDP twin") variant.
- The MDP twin: , injective with respect to the state.
- The POMDP twin: possibly adds masking or stochastic noise, making aliased with respect to .
- Users can switch between observability settings mid-episode, allowing direct, counterfactual assessment of partial observability.
2. Hardware Acceleration and Parallel Experimentation
POPGym Arcade is implemented using JAX and compiled via XLA, running entirely on hardware accelerators (GPU/TPU) for extreme simulation throughput.
- Vectorized parallelism: All environments are rolled out in batched form, with states and actions processed in a single kernel invocation.
- JIT-compilation: Both step and rendering logic are JIT-fused via
jax.jit, attaining memory-bandwidth-limited frame rates (hundreds of thousands of px frames per second on an RTX4090). - End-to-end device execution: The PQN (Parallel Q-learning, No buffers, No target networks) learning algorithm executes fully on GPU, avoiding all CPU-device synchronization except during model update steps. This architecture yields an order-of-magnitude higher FPS than CPU-based Atari/MinAtar settings, despite handling much larger image spaces (Wang et al., 3 Mar 2025).
3. Memory Model API and Baseline Implementations
The POPGym suite defines a memory-agnostic API, facilitating rapid prototyping and fair comparison among memory-augmented RL architectures:
- Abstract
MemoryModelinterface (Classic POPGym): Requires two methods:initial_state(batch_size)andmemory_forward(obs, hidden), plugged into Ray RLlib for seamless integration with distributed RL trainers. - Baseline models: Thirteen sequence-processing architectures, including classical MLP, Positional MLP (PosMLP), Elman RNN, LSTM, GRU, IndRNN, DNC, Fast Autoregressive Transformer (FART), Fast Weight Programmer (FWP), frame stacking, Temporal Convolutional Network (TCN), Legendre Memory Unit (LMU), and Diagonal State-Space Model (S4D). All are standardized in PyTorch (POPGym) or JAX (POPGym Arcade), supporting identical actor/critic head architectures and hyperparameters.
The overall framework enables plug-and-play experimentation: users can select any algorithm from RLlib (e.g., PPO, DQN, IMPALA), combine it with any memory architecture, scale up parallel rollouts, and log training progress using popular tools such as TensorBoard or Weights & Biases (Morad et al., 2023).
4. Analytical Tools for Policy Memory Usage
POPGym Arcade provides mathematical introspection utilities for dissecting memory usage in learned recurrent policies. Core to this is the memory saliency metric:
- For a recurrent policy with hidden state and latent (fed to the Q-network), memory saliency quantifies the per-pixel influence of historical observation on the downstream value via
where specifies a pixel. This measure is efficiently computed in JAX via a single backward-through-time traversal and serves as a differentiable proxy for information retention and long-term credit assignment.
- Visualizing reveals both aggregate memory horizon and precise spatial focus across past percepts, supporting hypothesis-driven evaluation of learned credit assignment and memory manipulation (Wang et al., 3 Mar 2025).
5. Empirical Findings on Memory, Sample Efficiency, and Brittleness
Extensive experiments with both classic and Arcade versions of POPGym yield several key observations:
- Sample efficiency: In classic low-dimensional POPGym, RNNs (GRU, LSTM, Elman) provide robust sample efficiency, though at a 5–10 computational cost relative to feed-forward or convolutional models. In POPGym Arcade, both MinGRU and LRU (Linear Recurrent Unit/SSM) recurrent models are dramatically more sample-efficient in POMDP twins than MLPs; surprising empirical results indicate that LRU can outperform MinGRU for online RL, in contrast to prior Decision Transformer findings (Wang et al., 3 Mar 2025).
- MDP vs. POMDP learning rates: Contrary to classical POMDP theory, recurrent models can learn faster on POMDPs than fully observable MDPs in this benchmark suite.
- Policy brittleness and out-of-distribution (OOD) sensitivity: Recurrent policies trained in both MDP and POMDP setups may over-attend to spurious or irrelevant historical frames, as indicated by memory saliency analyses—even when the present observation suffices. This can generate "hallucinated" dependencies and yield brittle, OOD-sensitive behavior under pixel or texture shift (Wang et al., 3 Mar 2025).
- Policy churn: Across all experiments, policy churn (the fraction of states where tiny weight perturbations flip the greedy action) shows negligible difference between MDP and POMDP twins.
6. Implications for Sim-to-Real, Imitation, and Offline RL
The design and analytical features of POPGym inform several downstream research domains:
- Sim-to-real transfer: POPGym Arcade's pixel-based POMDPs naturally introduce noise, aliasing, and occlusions analogous to real-world robotics and vision sensors. Memory saliency provides actionable diagnostics pinpointing which features, if shifted in deployment, may precipitate behavioral failure.
- Imitation learning: Saliency maps enable alignment between teacher and student networks—trajectories generated under differing observability (via mid-episode toggling) yield data suited to counterfactual and inverse RL, facilitating GAIL-style reasoning.
- Offline RL benchmarking: MinGRU’s performance in offline Decision Transformer settings does not generalize to online RL within POPGym Arcade. Benchmarking protocols should thus consider both static and dynamic POMDPs as well as paired MDP/POMDP twins to capture algorithmic biases. The PQN algorithm, being memory- and throughput-efficient, is a promising candidate for large-batch, bufferless offline fine-tuning (Wang et al., 3 Mar 2025).
7. Benchmarking Paradigm and Influence
POPGym and POPGym Arcade define a new standard for memory evaluation in RL benchmarks:
- Diversity and difficulty: The suite contains both low-dimensional environments (allowing rapid multi-seed experimentation and large-scale ablation) and high-dimensional, pixel-based variants (for vision, credit assignment, and robustness study).
- Standard baselines: Direct comparison among 13 memory models exposes transfer limitations—architectures excelling in supervised sequence modeling (e.g., S4D, LMU, FWP) often perform poorly versus classical RNNs in RL. The importance of easily-overlooked baselines (e.g., positional encoding without recurrence) is repeatedly noted.
- Plug-and-play experimentation and extensibility: Tight integration with RLlib and unified APIs accelerates empirical studies and facilitates dissemination of robust, reproducible findings.
A plausible implication is that the POPGym suite will drive methodological rigor by focusing RL memory assessment on sample efficiency, robustness to partial observability, and real-world transferability—a shift away from mere benchmark "score attainment" (Morad et al., 2023, Wang et al., 3 Mar 2025).