Skip to content

HiccupRL/Generative-Policy-Toy

Repository files navigation

Toy Experiments on Multi-Modal Policy Learning

Toy Experiment Visualisation

Figure 1: Comparison of distribution modelling capabilities. The three red peaks represent the underlying expert action modes (ground-truth Gaussian Mixture Model), while the blue scatter points denote the actions generated by the learned policies at different training steps. Our SMFP approach accurately captures all multimodal modes without mode collapse.

To isolate the generative capacity of the actor networks from the confounding factors of environment interaction and Critic training, we provide two toy experiment scripts to empirically validate and visualise their distribution modelling capabilities:

  • verify_gaussian_policy.py: Evaluates the standard Soft Actor-Critic (SAC) Gaussian policy.
  • verify_smfp_uat.py: Evaluates the Stochastic MeanFlow (SMFP) policy.

Installation

To set up the environment using Conda, please run the following commands:

# 1. Create the conda environment from the provided yml file
conda env create -f environment.yml

# 2. Activate the newly created environment
conda activate smfp

Running the Experiments

To execute the toy experiments, run the following commands. We highly recommend experimenting with different random seeds (--seed) (especially for gaussian policy) and, for the SMFP agent, adjusting the flow-matching boundary condition target standard deviation (--target_std).

# Verify SAC Gaussian policy
python verify_gaussian_policy.py --seed 0 --num_steps 50000

# Verify MeanFlow policy
python verify_smfp_uat.py --seed 0 --target_std 1e-3 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-2 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-1 --num_steps 40000
# Withou Entropy. 
python verify_smfp_uat.py --seed 0 --target_std 0.0 --num_steps 40000

Objective and Setting

Objective: These experiments assess the structural expressivity of each policy network in fitting a complex, multimodal target distribution (a 2D Gaussian Mixture Model featuring three distinct modes).

1. Target Distribution

The ground-truth target density is defined as a 2D Gaussian Mixture Model (GMM) on the domain [-1, 1] × [-1, 1]. The target probability density function, $p_{\text{target}}(\mathbf{x})$, is formulated as:

$$ p_{\text{target}}(\mathbf{x}) = \sum_{i=1}^{3} w_i \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_i, \mathbf{\Sigma}_i) $$

The specific parameters are configured as follows:

  • Means ($\boldsymbol{\mu}_i$): $(0.0, 0.5)$, $(-0.5, -0.5)$, $(0.5, -0.5)$
  • Covariances ($\mathbf{\Sigma}_i$): Isotropic covariances with a standard deviation of $\sigma_i = 0.02$ (i.e., $\mathbf{\Sigma}_i = \sigma_i^2 \mathbf{I}$)
  • Mixture Weights ($w_i$): $1.4$, $1.2$, $1.0$ (internally normalised to sum to $1$)

2. Optimisation Methodology

  • Due to the absence of an RL environment and a trained Critic for estimating the action-value function Q(s, a), standard RL policy evaluation steps are bypassed.
  • For the SAC Gaussian Policy, we utilise the analytical GMM density to compute the exact Reverse KL Divergence as a surrogate objective:

$$ \mathcal{J}_{\text{SAC}}(\phi) = \mathbb{E}_{\mathbf{a} \sim \pi_\phi} [ \log \pi_\phi(\mathbf{a}) - \log p_{\text{target}}(\mathbf{a}) ] $$

This objective strictly updates the policy parameters $\phi$ to minimise the divergence against the true target distribution.

  • For the Stochastic MeanFlow Policy (SMFP), we optimise the flow-matching loss (typically MSE or Huber loss against the target vector field) to align the generated action trajectories with the empirical samples drawn from the target distribution.

3. Visualisation

Both scripts periodically generate plots within the vis_toy/ directory. The visual outputs comprise:

  • A red contour map illustrating the ground-truth GMM density.
  • Blue scatter points representing the actions generated by the currently learned policy.
  • Side-by-side comparative plots to qualitatively track convergence and fitting fidelity.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages