| title | Vox2Trumpet DDSP |
|---|---|
| emoji | 🎺 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | gradio |
| sdk_version | 3.50.2 |
| app_file | app.py |
| pinned | false |
RAW AUDIO ENCODER (CREPE) INPUTS (Log) DECODER (Brain) PARAMETERS
[Batch, Sample] [Pre-trained] [Batch, Time, 2] [Batch, Time, 512] [Batch, Time, 166]
┌───────────┐ ┌─────────────────┐ ┌──────────┐ ┌───────────────────┐ ┌───────────────────┐
│ │ │ CREPE │──(f0)──▶ │ log_f0 │──┐ │ GRU (1 Layer) │ │ MLP (Linear) │
│ [ 🎤 ] │────▶────┤ (Pitch/Loud) │ └──────────┘ │ │ Hidden: 512 │ │ 166 Neurons Out │
│ │ │ │──(L)───▶ ┌──────────┐ ├───▶───┤ ├───▶───┤ │──┐
└───────────┘ └─────────────────┘ │ loudness │──┘ │ BatchFirst=True │ │ (Softmax/Sigmoid)│ │
└──────────┘ └───────────────────┘ └───────────────────┘ │
│
▼
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
│
│ SYNTHESIZERS (Body) AUDIO OUTPUT
│ [Physics-Informed DSP] [Batch, Sample]
│
│ (Amps) ┌───────────────────────┐
├──────────▶──│ Harmonic Oscillator │──┐
│ └───────────────────────┘ │ ┌───────────┐
│ ├──▶───│ SUM │──▶ [ 🎺 ]
│ (Mags) ┌───────────────────────┐ │ └───────────┘
└──────────▶──│ Noise Filter Bank │──┘
└───────────────────────┘
The encoder uses a pre-trained CREPE model. Below is its internal convolutional structure:
graph TD
Input["Audio Input<br/>1024 samples @ 16kHz"] --> C1["Conv1: 512 filters<br/>Stride 4"]
C1 --> P1["Max Pool 2x"]
P1 --> C2["Conv2: 64 filters<br/>Stride 1"]
C2 --> P2["Max Pool 2x"]
P2 --> C3["Conv3: 64 filters<br/>Stride 1"]
C3 --> P3["Max Pool 2x"]
P3 --> C4["Conv4: 64 filters<br/>Stride 1"]
C4 --> P4["Max Pool 2x"]
P4 --> C5["Conv5: 64 filters<br/>Stride 1"]
C5 --> P5["Max Pool 2x"]
P5 --> C6["Conv6: 64 filters<br/>Stride 1"]
C6 --> P6["Max Pool 2x"]
P6 --> Flatten["Flatten"]
Flatten --> Dense["Dense Layer<br/>360 Units"]
Dense --> Softmax["Softmax"]
Softmax --> Output["Pitch Estimate<br/>(360 Cent Bins)"]
style Input fill:#f9f,stroke:#333,stroke-width:2px
style Output fill:#f9f,stroke:#333,stroke-width:2px
Model Details:
- Decoder: Single-layer GRU (Gated Recurrent Unit) to capture temporal dependencies (slurs, vibrato, and decay).
- Hidden Size: 512 units.
- Parameters: 166 total (101 harmonic amplitudes + 65 noise band magnitudes).
- Inductive Bias: The model predicts control signals rather than raw samples, ensuring stable pitch.
This project uses a local, isolated environment to guarantee reproducibility. All dependencies are installed into the ./venv directory within the project root.
We provide a script to set up a Python 3.9 virtual environment and install exact dependencies.
# Run the setup script (creates ./venv and installs packages)
chmod +x setup_env.sh
./setup_env.shWhere are the dependencies?
They are downloaded and installed locally in venv/lib/python3.9/site-packages. They do not affect your global system Python.
Run the verification script to check that the Vox2Trumpet model and DSP components can be instantiated correctly.
# Verify the build
./venv/bin/python verify_project.pyLaunch the interactive Gradio interface to perform timbre transfer on your own voice or uploaded recordings.
# Start the web server
./venv/bin/python app.pyOnce started, open the local URL (typically http://127.0.0.1:7860) in your web browser.
We use the URMP dataset (specifically trumpet recordings) for training. Use our script to automate the download and extraction of the audio files.
./venv/bin/python download_trumpet_data.pyExtract high-precision pitch (
./venv/bin/python preprocess.py --input_dir data/raw/urmp/trumpet_only --output_dir data/processed/urmpBefore training, verify the extracted features. This tool saves a diagnostic plot to data/visualization/.
./venv/bin/python visualize_features.py --file data/processed/urmp/sample.ptWe use W&B for experiment tracking. It allows you to monitor loss and listen to audio samples in real-time.
- Login:
./venv/bin/python -m wandb login(Paste your API key from wandb.ai). - Train:
./venv/bin/python train.py --data_dir data/processed/urmp --batch_size 16 --epochs 100
For faster training on cloud GPUs (A10G/T4), you can stream your data from the Hugging Face Hub.
- Set Token: Create a Hugging Face Write Token and export it:
export HF_TOKEN=your_token_here - Upload Data: Run the upload script to create a private dataset:
./venv/bin/python scripts/upload_to_hf.py --repo_id username/vox2trumpet-data
- Train Remote:
./venv/bin/python train.py --config_name deep --hf_repo_id username/vox2trumpet-data
To understand the training dynamics, here is how the data is handled under the hood:
- The Dataset: We use preprocessed
.ptfiles from the URMP dataset. Each file contains a full recording. - Random Cropping: Every time the model accesses a file, it picks a random 1-second segment (16,000 samples). This ensures that the model sees different parts of the performances in every epoch, significantly increasing data diversity.
- Batches: With a Batch Size of 16, the model processes 16 different 1-second segments simultaneously on your GPU.
- Epochs: One epoch means the model has "visited" every one of the training files exactly once. Because of the random crop, training for 100 epochs means the model effectively hears thousands of unique snippets of trumpet playing.
The University of Rochester Multi-Modal Music Performance (URMP) dataset is a collection of high-quality ensemble performances where each instrument was recorded individually in a professional studio setting.
- Instrument Isolation: We focus specifically on the trumpet tracks. These isolated recordings are essential for timbre transfer because they provide "clean" examples of the target instrument's spectral characteristics without bleed from other instruments.
- Multi-Modal Source: While the dataset includes video and MIDI, this project primarily utilizes the 48kHz audio (resampled to 16kHz) as the ground truth for spectral analysis.
- Professional Performance: The tracks cover various musical styles and include sophisticated techniques like vibrato and slurs, which are modeled by the GRU decoder.
Training a neural synthesizer is an iterative process. Here is how to keep an eye on your model's progress:
Once training starts, W&B provides a real-time dashboard at the URL printed in your terminal.
- Loss Curves: Monitor
train_loss. A steady decrease indicates the model is learning the spectral features. - Audio Samples: Every 5 epochs, the model uploads an audio reconstruction. Listen to these to hear the timbre evolve from noise to trumpet. Specifically, the model extracts the pitch and loudness "DNA" from the target audio and passes it through the neural network to see how well the synthesizer can mimic the original.
- Run Management: Each execution gets a random name (e.g.,
trumpet-vibe-10). You can delete failed/interrupted runs from the W&B project settings to keep your dashboard clean.
- Progress Bar (
tqdm): Shows instantaneous loss and processing speed (iterations per second) in your terminal. - Checkpoints: High-fidelity model states are saved to
checkpoints/model_epoch_N.pth. - Graceful Exit: Hit
CTRL+Cat any time to stop training. The script will safely close the W&B connection and save the current state.
You don't need to do anything special to resume. If you stop the script and run the training command again, it will:
- Detect
checkpoints/latest.pth. - Automatically load the latest weights and optimizer state.
- New: It will automatically detect if you have updated the
learning_rateormag_loss_weightinconfig.jsonand apply the new values to the resumed optimizer. - Continue training from the exact epoch where it was interrupted.
The model follows the Control-Synthesis paradigm, separating the "Brain" (Neural Network) from the "Body" (DSP Synthesizer).
-
Encoder: Extracts pitch (
$f_0$ ) using a pre-trained CREPE model and A-weighted Loudness.-
CREPE Architecture: This project utilizes CREPE (Convolutional REpresentation for Pitch Estimation), a deep convolutional neural network.
- Structure: It consists of 6 convolutional layers followed by a fully connected output layer.
- Frequency Resolution: The model outputs a probability distribution over 360 pitch bins, each 20 cents wide, covering 6 octaves (from 32.7 Hz to 1975.5 Hz).
- Input: Operates on raw 16kHz audio waveforms using a 1024-sample window.
-
Capacity: We default to the
tinymodel for efficiency, thoughfullcan be used for maximum precision during offline preprocessing.
-
CREPE Architecture: This project utilizes CREPE (Convolutional REpresentation for Pitch Estimation), a deep convolutional neural network.
-
Decoder: A GRU (Gated Recurrent Unit) maps these control signals to synthesizer parameters.
- Why GRU?: We chose a GRU over a Transformer because it is significantly more efficient for real-time audio synthesis. GRUs have a strong "inductive bias" for sequences where the next frame depends heavily on the previous one (temporal persistence).
🧠 Decoder (Brain) Internal Structure
The decoder transforms control signals into high-dimensional synthesizer parameters:
Loadinggraph TD InF0["Log F0 (Pitch)"] --> Concat["Concatenate [Batch, Time, 2]"] InLoud["Loudness (Volume)"] --> Concat Concat --> GRU["GRU Layer<br/>(Hidden: 512, 1 Layer)"] GRU --> MLP["MLP Projection<br/>(Linear Layer)"] MLP --> Split["Split (166 Units)"] Split --> Harm["Harmonic Branch<br/>(101 Units)"] Split --> Noise["Noise Branch<br/>(65 Units)"] Harm --> Softmax["Softmax Activation<br/>(Timbre Distribution)"] Softmax --> MulLoud["Multiply by Input Loudness"] MulLoud --> OutHarm["Harmonic Amplitudes"] Noise --> Sigmoid["Sigmoid Activation"] Sigmoid --> OutNoise["Noise Magnitudes"] style InF0 fill:#f9f,stroke:#333,stroke-width:2px style InLoud fill:#f9f,stroke:#333,stroke-width:2px style OutHarm fill:#f9f,stroke:#333,stroke-width:2px style OutNoise fill:#f9f,stroke:#333,stroke-width:2px- Future-Proofing: While the GRU is our "lean and mean" baseline, the modular design allows us to swap in a Transformer-based decoder if we need to model more complex, long-range musical dependencies in the future.
-
Synthesizer:
- Harmonic Synthesizer: Additive synthesis (sum of sines) for the tonal trumpet vibration.
- Filtered Noise: Subtractive synthesis (time-varying FIR filters) for air flux and rasp.
-
Loss: Multi-Resolution STFT Loss (in
loss.py) ensures the model learns spectral details across time and frequency.
- [Original Paper]: Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation (Cho et al., 2014)
- [Visual Guide]: Gated Recurrent Units (GRU) - Dive into Deep Learning — An excellent, interactive textbook guide with clear diagrams and math.
The Vox2Trumpet uses a Multi-Resolution Short-Time Fourier Transform (MR-STFT) Loss to train the GRU decoder. Unlike a simple time-domain MSE, this loss evaluates the model across multiple time-frequency scales to capture both sharp transients and sustained harmonic timbre.
For each resolution
This measures the overall "energy shape" of the spectrogram. It is defined as the Frobenius norm of the difference between the target and predicted magnitudes, normalized by the target's norm:
To better match human auditory perception (which is logarithmic), we calculate the config.json) to prioritize tonal warmth:
To prevent the model from over-fitting to a single window size, we average the loss across a bank of resolutions (e.g., 512, 1024, 2048, and 4096):
Using a high resolution like 4096 is crucial for capturing the distinct, tight harmonics of low notes, while the 512 resolution ensures the attack is preserved.
-
model.py: MainVox2Trumpetnn.Module. -
synth.py: Differentiable DSP modules (HarmonicSynthesizer,FilteredNoiseSynthesizer). -
loss.py: Perceptual loss functions. -
preprocess.py: Data pipeline for extracting$(f_0, Loudness)$ features. -
visualize_features.py: Diagnostic tool for feature inspection. -
train.py: Training loop with W&B integration and checkpoint resume. -
data.py:Vox2TrumpetDatasetwith random cropping. -
download_trumpet_data.py: Automated trumpet data downloader. -
setup_env.sh: Environment reproducibility script. -
test_e2e.py: Deterministic regression test suite. -
config.json: Centralized configuration for model architectures and hyperparameters. -
scripts/upload_to_hf.py: Utility for migrating training data to the cloud. -
scripts/check_dataset_nans.py: Integrity verification tool for processed features.