GitHub - HiccupRL/Generative-Policy-Toy

Toy Experiments on Multi-Modal Policy Learning

Figure 1: Comparison of distribution modelling capabilities. The three red peaks represent the underlying expert action modes (ground-truth Gaussian Mixture Model), while the blue scatter points denote the actions generated by the learned policies at different training steps. Our SMFP approach accurately captures all multimodal modes without mode collapse.

To isolate the generative capacity of the actor networks from the confounding factors of environment interaction and Critic training, we provide two toy experiment scripts to empirically validate and visualise their distribution modelling capabilities:

verify_gaussian_policy.py: Evaluates the standard Soft Actor-Critic (SAC) Gaussian policy.
verify_smfp_uat.py: Evaluates the Stochastic MeanFlow (SMFP) policy.

Installation

To set up the environment using Conda, please run the following commands:

# 1. Create the conda environment from the provided yml file
conda env create -f environment.yml

# 2. Activate the newly created environment
conda activate smfp

Running the Experiments

To execute the toy experiments, run the following commands. We highly recommend experimenting with different random seeds (--seed) (especially for gaussian policy) and, for the SMFP agent, adjusting the flow-matching boundary condition target standard deviation (--target_std).

# Verify SAC Gaussian policy
python verify_gaussian_policy.py --seed 0 --num_steps 50000

# Verify MeanFlow policy
python verify_smfp_uat.py --seed 0 --target_std 1e-3 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-2 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-1 --num_steps 40000
# Withou Entropy. 
python verify_smfp_uat.py --seed 0 --target_std 0.0 --num_steps 40000

Objective and Setting

Objective: These experiments assess the structural expressivity of each policy network in fitting a complex, multimodal target distribution (a 2D Gaussian Mixture Model featuring three distinct modes).

1. Target Distribution

The ground-truth target density is defined as a 2D Gaussian Mixture Model (GMM) on the domain [-1, 1] × [-1, 1]. The target probability density function, $p_{\text{target}}(\mathbf{x})$, is formulated as:

$$ p_{\text{target}}(\mathbf{x}) = \sum_{i=1}^{3} w_i \cdot \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_i, \mathbf{\Sigma}_i) $$

The specific parameters are configured as follows:

Means ($\boldsymbol{\mu}_i$): $(0.0, 0.5)$, $(-0.5, -0.5)$, $(0.5, -0.5)$
Covariances ($\mathbf{\Sigma}_i$): Isotropic covariances with a standard deviation of $\sigma_i = 0.02$ (i.e., $\mathbf{\Sigma}_i = \sigma_i^2 \mathbf{I}$)
Mixture Weights ($w_i$): $1.4$, $1.2$, $1.0$ (internally normalised to sum to $1$)

2. Optimisation Methodology

Due to the absence of an RL environment and a trained Critic for estimating the action-value function Q(s, a), standard RL policy evaluation steps are bypassed.
For the SAC Gaussian Policy, we utilise the analytical GMM density to compute the exact Reverse KL Divergence as a surrogate objective:

$$ \mathcal{J}_{\text{SAC}}(\phi) = \mathbb{E}_{\mathbf{a} \sim \pi_\phi} [ \log \pi_\phi(\mathbf{a}) - \log p_{\text{target}}(\mathbf{a}) ] $$

This objective strictly updates the policy parameters $\phi$ to minimise the divergence against the true target distribution.

For the Stochastic MeanFlow Policy (SMFP), we optimise the flow-matching loss (typically MSE or Huber loss against the target vector field) to align the generated action trajectories with the empirical samples drawn from the target distribution.

3. Visualisation

Both scripts periodically generate plots within the vis_toy/ directory. The visual outputs comprise:

A red contour map illustrating the ground-truth GMM density.
Blue scatter points representing the actions generated by the currently learned policy.
Side-by-side comparative plots to qualitatively track convergence and fitting fidelity.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
agents		agents
assert		assert
figures		figures
utils		utils
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
make_ppt_editorial_style.py		make_ppt_editorial_style.py
verify_gaussian_policy.py		verify_gaussian_policy.py
verify_smfp_uat.py		verify_smfp_uat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toy Experiments on Multi-Modal Policy Learning

Installation

Running the Experiments

Objective and Setting

1. Target Distribution

2. Optimisation Methodology

3. Visualisation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toy Experiments on Multi-Modal Policy Learning

Installation

Running the Experiments

Objective and Setting

1. Target Distribution

2. Optimisation Methodology

3. Visualisation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages