Skip to content

Latest commit

 

History

History
116 lines (86 loc) · 6.13 KB

File metadata and controls

116 lines (86 loc) · 6.13 KB

Models Module (aether.models)

This module defines the Variational Autoencoder (VAE) architecture used to learn the latent representation of impulse responses.

ParametricVAE (vae.py)

A specialized VAE that encodes spectral features into a low-dimensional latent space and decodes them into parameters for a chain of biquad filters.

Architecture

Aether VAE Architecture

The Latent Sampling Layer

The core of the VAE is the Latent Vector $z$. This 4-dimensional vector represents the "DNA" of a filter.

How it works during Training (The "Bottle Neck")

During training, we want the model to learn a continuous space. We don't just encode an input to a single point $z$; we encode it to a probability distribution defined by:

  1. $\mu$ (Mean): The center of the distribution.
  2. $\sigma^2$ (Variance): The spread.
    • Note: The network actually predicts logvar ($\log(\sigma^2)$). This is for numerical stability, as it allows the network to output any real number (negative to positive), which we then exponentiate to get a strictly positive variance.

The Sampling Step: $$z = \mu + \epsilon \cdot \sigma$$ where $\epsilon$ is random noise.

This forces the model to learn that points near $\mu$ should also decode to similar filters. This is what creates the "smoothness" that allows us to morph sounds.

How it works during Simulation (The "Knobs")

When you move the sliders in the web app, you are bypassing the Encoder and Sampling Layer.

  • You are manually providing the $z$ vector: $z = [z_0, z_1, z_2, z_3]$.
  • The Decoder takes these exact numbers and generates the 32x3 filter parameters.
  • Because the model learned a smooth space during training, moving these knobs smoothly morphs the filter response.

Why the range [-2, 2]?

During training, we add a KL-Divergence Loss that forces the latent distribution to approximate a Standard Normal Distribution ($\mu=0, \sigma=1$).

  • In a Bell Curve (Normal Distribution), ~95% of all data points fall within 2 Standard Deviations of the mean.
  • Therefore, the "valid" learned space is mostly between has learned useful mappings.
  • Going beyond $\pm 2$ (e.g., to 10) enters "uncharted territory" where the model may produce undefined or extreme filters because it never saw data there.

Encoder

The encoder compresses the input magnitude spectrum (513 bins) into a latent distribution N(mu, sigma).

  • Input: (Batch, 513)
  • Layers:
    • 3x 1D Convolutional Layers (16, 32, 64 filters) with ReLU activation and stride 2 for downsampling.
    • Flattening.
    • 2x Dense Layers (128, 64) with ReLU activation.
    • 2x Output Dense Layers for mu (mean) and logvar (log variance) of the latent distribution.
  • Latent Dim: default 2 (for easy 2D visualization/control).

Decoder

The decoder maps a latent vector z to the parameters of N parametric filters.

  • Input: (Batch, Latent_Dim)
  • Layers:
    • 2x Dense Layers (64, 128) with ReLU activation.
    • Output Dense Layer sized num_filters * 3.
  • Output Mapping: The raw outputs are reshaped to (Batch, Num_Filters, 3) and passed through activation functions to ensure valid audio parameters:
    • Frequency: Sigmoid * 0.99 + 0.01 (Normalized 0.01 to 1.0 of Nyquist).
    • Q (Resonance): Softplus + 0.1 (Positive, > 0.1).
    • Gain: Tanh * 20.0 (-20dB to +20dB).

Usage

from aether.models.vae import ParametricVAE
import jax.random

model = ParametricVAE(num_filters=5, latent_dim=2)
key = jax.random.PRNGKey(0)

# Initialize
variables = model.init(key, jnp.ones((1, 513)), key)

# Forward Pass (Training)
# Returns: freqs, qs, gains, mu, logvar
outputs = model.apply(variables, input_spectrum, key) 

# Decode Only (Inference)
z = jnp.array([[0.5, -0.5]])
freqs, qs, gains = model.apply(variables, z, method=model.decode)

The Latent Space: Model vs. Controller

It is important to distinguish between what the VAE learns and how the Controller drives it.

1. The Model (VAE)

The VAE learns a 4-dimensional map of all possible filters. It organizes this map to maximize "disentanglement" (separating independent features), but it does not know concepts like "Space" or "Timbre".

  • z0: Might control low-mid gain.
  • z1: Might control high-frequency rolloff.
  • z2: Might control notch depth.
  • z3: Might control overall resonance.

Feature Visualization: Below is a sweep of each dimension (holding others at 0) to show its effect on the filter bank's frequency response. Latent Sweeps

2. The Controller (Simulation)

The "meaning" of these dimensions in the simulation comes from how we move through them.

  • Spatial (z0, z1): In spiral mode, we modulate these dimensions with a fast sine/cosine LFO. This creates rapid spectral motion, which the ear perceives as "spatial movement" or "phaser-like" texture.

  • Timbre (z2, z3): We modulate these dimensions with a very slow LFO. This creates gradual evolution, which the ear perceives as "timbre morphing".

  • Timbre (z2, z3): We modulate these dimensions with a very slow LFO. This creates gradual evolution, which the ear perceives as "timbre morphing".

If we swapped the modulation rates, z2/z3 would become the "spatial" controls!

Why a "Parametric" VAE?

You might ask: Why doesn't the model just output the audio spectrum directly?

  1. Direct Prediction is Static: A standard VAE would output a static "image" of a spectrum. We couldn't easily animate it or modulate it in real-time without artifacts.
  2. Parametric Prediction is Dynamic: By predicting 32 Filter Knobs (Freq, Q, Gain) instead, we get a mathematically defined "recipe" for the sound.
    • Smoothness: Moving z0 smoothly changes the frequency of a filter from 100Hz to 200Hz.
    • Differentiability: We can calculate the gradient of the audio with respect to these knobs, allowing us to train the whole system end-to-end.
    • Synthesizer Control: The Decoder output is literally 96 numbers ($32 \text{ filters} \times 3 \text{ knobs}$) that drive a parallel filter bank synthesizer.

← Previous: Data | Next: Training →