Figure 1: Comparison of distribution modelling capabilities. The three red peaks represent the underlying expert action modes (ground-truth Gaussian Mixture Model), while the blue scatter points denote the actions generated by the learned policies at different training steps. Our SMFP approach accurately captures all multimodal modes without mode collapse.
To isolate the generative capacity of the actor networks from the confounding factors of environment interaction and Critic training, we provide two toy experiment scripts to empirically validate and visualise their distribution modelling capabilities:
verify_gaussian_policy.py: Evaluates the standard Soft Actor-Critic (SAC) Gaussian policy.verify_smfp_uat.py: Evaluates the Stochastic MeanFlow (SMFP) policy.
To set up the environment using Conda, please run the following commands:
# 1. Create the conda environment from the provided yml file
conda env create -f environment.yml
# 2. Activate the newly created environment
conda activate smfpTo execute the toy experiments, run the following commands. We highly recommend experimenting with different random seeds (--seed) (especially for gaussian policy) and, for the SMFP agent, adjusting the flow-matching boundary condition target standard deviation (--target_std).
# Verify SAC Gaussian policy
python verify_gaussian_policy.py --seed 0 --num_steps 50000
# Verify MeanFlow policy
python verify_smfp_uat.py --seed 0 --target_std 1e-3 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-2 --num_steps 40000
python verify_smfp_uat.py --seed 0 --target_std 1e-1 --num_steps 40000
# Withou Entropy.
python verify_smfp_uat.py --seed 0 --target_std 0.0 --num_steps 40000Objective: These experiments assess the structural expressivity of each policy network in fitting a complex, multimodal target distribution (a 2D Gaussian Mixture Model featuring three distinct modes).
The ground-truth target density is defined as a 2D Gaussian Mixture Model (GMM) on the domain [-1, 1] × [-1, 1]. The target probability density function,
The specific parameters are configured as follows:
-
Means (
$\boldsymbol{\mu}_i$ ):$(0.0, 0.5)$ ,$(-0.5, -0.5)$ ,$(0.5, -0.5)$ -
Covariances (
$\mathbf{\Sigma}_i$ ): Isotropic covariances with a standard deviation of$\sigma_i = 0.02$ (i.e.,$\mathbf{\Sigma}_i = \sigma_i^2 \mathbf{I}$ ) -
Mixture Weights (
$w_i$ ):$1.4$ ,$1.2$ ,$1.0$ (internally normalised to sum to$1$ )
- Due to the absence of an RL environment and a trained Critic for estimating the action-value function
Q(s, a), standard RL policy evaluation steps are bypassed. - For the SAC Gaussian Policy, we utilise the analytical GMM density to compute the exact Reverse KL Divergence as a surrogate objective:
This objective strictly updates the policy parameters
- For the Stochastic MeanFlow Policy (SMFP), we optimise the flow-matching loss (typically MSE or Huber loss against the target vector field) to align the generated action trajectories with the empirical samples drawn from the target distribution.
Both scripts periodically generate plots within the vis_toy/ directory. The visual outputs comprise:
- A red contour map illustrating the ground-truth GMM density.
- Blue scatter points representing the actions generated by the currently learned policy.
- Side-by-side comparative plots to qualitatively track convergence and fitting fidelity.