This module defines the Variational Autoencoder (VAE) architecture used to learn the latent representation of impulse responses.
A specialized VAE that encodes spectral features into a low-dimensional latent space and decodes them into parameters for a chain of biquad filters.
The core of the VAE is the Latent Vector
During training, we want the model to learn a continuous space. We don't just encode an input to a single point
-
$\mu$ (Mean): The center of the distribution. -
$\sigma^2$ (Variance): The spread.-
Note: The network actually predicts
logvar($\log(\sigma^2)$). This is for numerical stability, as it allows the network to output any real number (negative to positive), which we then exponentiate to get a strictly positive variance.
-
Note: The network actually predicts
The Sampling Step:
This forces the model to learn that points near
When you move the sliders in the web app, you are bypassing the Encoder and Sampling Layer.
- You are manually providing the
$z$ vector:$z = [z_0, z_1, z_2, z_3]$ . - The Decoder takes these exact numbers and generates the 32x3 filter parameters.
- Because the model learned a smooth space during training, moving these knobs smoothly morphs the filter response.
During training, we add a KL-Divergence Loss that forces the latent distribution to approximate a Standard Normal Distribution (
- In a Bell Curve (Normal Distribution), ~95% of all data points fall within 2 Standard Deviations of the mean.
- Therefore, the "valid" learned space is mostly between has learned useful mappings.
- Going beyond
$\pm 2$ (e.g., to 10) enters "uncharted territory" where the model may produce undefined or extreme filters because it never saw data there.
The encoder compresses the input magnitude spectrum (513 bins) into a latent distribution N(mu, sigma).
- Input:
(Batch, 513) - Layers:
- 3x 1D Convolutional Layers (16, 32, 64 filters) with ReLU activation and stride 2 for downsampling.
- Flattening.
- 2x Dense Layers (128, 64) with ReLU activation.
- 2x Output Dense Layers for
mu(mean) andlogvar(log variance) of the latent distribution.
- Latent Dim: default
2(for easy 2D visualization/control).
The decoder maps a latent vector z to the parameters of N parametric filters.
- Input:
(Batch, Latent_Dim) - Layers:
- 2x Dense Layers (64, 128) with ReLU activation.
- Output Dense Layer sized
num_filters * 3.
- Output Mapping: The raw outputs are reshaped to
(Batch, Num_Filters, 3)and passed through activation functions to ensure valid audio parameters:- Frequency:
Sigmoid * 0.99 + 0.01(Normalized 0.01 to 1.0 of Nyquist). - Q (Resonance):
Softplus + 0.1(Positive, > 0.1). - Gain:
Tanh * 20.0(-20dB to +20dB).
- Frequency:
from aether.models.vae import ParametricVAE
import jax.random
model = ParametricVAE(num_filters=5, latent_dim=2)
key = jax.random.PRNGKey(0)
# Initialize
variables = model.init(key, jnp.ones((1, 513)), key)
# Forward Pass (Training)
# Returns: freqs, qs, gains, mu, logvar
outputs = model.apply(variables, input_spectrum, key)
# Decode Only (Inference)
z = jnp.array([[0.5, -0.5]])
freqs, qs, gains = model.apply(variables, z, method=model.decode)It is important to distinguish between what the VAE learns and how the Controller drives it.
The VAE learns a 4-dimensional map of all possible filters. It organizes this map to maximize "disentanglement" (separating independent features), but it does not know concepts like "Space" or "Timbre".
z0: Might control low-mid gain.z1: Might control high-frequency rolloff.z2: Might control notch depth.z3: Might control overall resonance.
Feature Visualization:
Below is a sweep of each dimension (holding others at 0) to show its effect on the filter bank's frequency response.

The "meaning" of these dimensions in the simulation comes from how we move through them.
-
Spatial (z0, z1): In
spiralmode, we modulate these dimensions with a fast sine/cosine LFO. This creates rapid spectral motion, which the ear perceives as "spatial movement" or "phaser-like" texture. -
Timbre (z2, z3): We modulate these dimensions with a very slow LFO. This creates gradual evolution, which the ear perceives as "timbre morphing".
-
Timbre (z2, z3): We modulate these dimensions with a very slow LFO. This creates gradual evolution, which the ear perceives as "timbre morphing".
If we swapped the modulation rates, z2/z3 would become the "spatial" controls!
You might ask: Why doesn't the model just output the audio spectrum directly?
- Direct Prediction is Static: A standard VAE would output a static "image" of a spectrum. We couldn't easily animate it or modulate it in real-time without artifacts.
-
Parametric Prediction is Dynamic: By predicting 32 Filter Knobs (Freq, Q, Gain) instead, we get a mathematically defined "recipe" for the sound.
-
Smoothness: Moving
z0smoothly changes the frequency of a filter from 100Hz to 200Hz. - Differentiability: We can calculate the gradient of the audio with respect to these knobs, allowing us to train the whole system end-to-end.
-
Synthesizer Control: The Decoder output is literally 96 numbers (
$32 \text{ filters} \times 3 \text{ knobs}$ ) that drive a parallel filter bank synthesizer.
-
Smoothness: Moving
