Vox2Trumpet: DDSP Timbre Transfer

title	Vox2Trumpet DDSP
emoji	🎺
colorFrom	blue
colorTo	indigo
sdk	gradio
sdk_version	3.50.2
app_file	app.py
pinned	false

Vox2Trumpet: DDSP Timbre Transfer

🧠 System Architecture

      RAW AUDIO             ENCODER (CREPE)               INPUTS (Log)           DECODER (Brain)                PARAMETERS
    [Batch, Sample]        [Pre-trained]                [Batch, Time, 2]      [Batch, Time, 512]           [Batch, Time, 166]

    ┌───────────┐         ┌─────────────────┐          ┌──────────┐          ┌───────────────────┐        ┌───────────────────┐
    │           │         │      CREPE      │──(f0)──▶ │  log_f0  │──┐       │   GRU (1 Layer)   │        │   MLP (Linear)    │
    │ [ 🎤 ]    │────▶────┤  (Pitch/Loud)   │          └──────────┘  │       │   Hidden: 512     │        │  166 Neurons Out  │
    │           │         │                 │──(L)───▶ ┌──────────┐  ├───▶───┤                   ├───▶───┤                   │──┐
    └───────────┘         └─────────────────┘          │ loudness │──┘       │   BatchFirst=True │        │  (Softmax/Sigmoid)│  │
                                                       └──────────┘          └───────────────────┘        └───────────────────┘  │
                                                                                                                                 │
                                                                                                                                 ▼
          ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
          │
          │             SYNTHESIZERS (Body)               AUDIO OUTPUT
          │           [Physics-Informed DSP]             [Batch, Sample]
          │
          │     (Amps)  ┌───────────────────────┐
          ├──────────▶──│ Harmonic Oscillator   │──┐
          │             └───────────────────────┘  │      ┌───────────┐
          │                                        ├──▶───│   SUM     │──▶ [ 🎺 ]
          │     (Mags)  ┌───────────────────────┐  │      └───────────┘
          └──────────▶──│ Noise Filter Bank     │──┘
                        └───────────────────────┘

🎙️ CREPE Deepnet Architecture

The encoder uses a pre-trained CREPE model. Below is its internal convolutional structure:

graph TD
    Input["Audio Input<br/>1024 samples @ 16kHz"] --> C1["Conv1: 512 filters<br/>Stride 4"]
    C1 --> P1["Max Pool 2x"]
    P1 --> C2["Conv2: 64 filters<br/>Stride 1"]
    C2 --> P2["Max Pool 2x"]
    P2 --> C3["Conv3: 64 filters<br/>Stride 1"]
    C3 --> P3["Max Pool 2x"]
    P3 --> C4["Conv4: 64 filters<br/>Stride 1"]
    C4 --> P4["Max Pool 2x"]
    P4 --> C5["Conv5: 64 filters<br/>Stride 1"]
    C5 --> P5["Max Pool 2x"]
    P5 --> C6["Conv6: 64 filters<br/>Stride 1"]
    C6 --> P6["Max Pool 2x"]
    P6 --> Flatten["Flatten"]
    Flatten --> Dense["Dense Layer<br/>360 Units"]
    Dense --> Softmax["Softmax"]
    Softmax --> Output["Pitch Estimate<br/>(360 Cent Bins)"]

    style Input fill:#f9f,stroke:#333,stroke-width:2px
    style Output fill:#f9f,stroke:#333,stroke-width:2px

Model Details:

Decoder: Single-layer GRU (Gated Recurrent Unit) to capture temporal dependencies (slurs, vibrato, and decay).
Hidden Size: 512 units.
Parameters: 166 total (101 harmonic amplitudes + 65 noise band magnitudes).
Inductive Bias: The model predicts control signals rather than raw samples, ensuring stable pitch.

🚀 Quick Start

This project uses a local, isolated environment to guarantee reproducibility. All dependencies are installed into the ./venv directory within the project root.

1. Setup Environment

We provide a script to set up a Python 3.9 virtual environment and install exact dependencies.

# Run the setup script (creates ./venv and installs packages)
chmod +x setup_env.sh
./setup_env.sh

Where are the dependencies? They are downloaded and installed locally in venv/lib/python3.9/site-packages. They do not affect your global system Python.

2. Verify Installation

Run the verification script to check that the Vox2Trumpet model and DSP components can be instantiated correctly.

# Verify the build
./venv/bin/python verify_project.py

3. Running the GUI Locally

Launch the interactive Gradio interface to perform timbre transfer on your own voice or uploaded recordings.

# Start the web server
./venv/bin/python app.py

Once started, open the local URL (typically http://127.0.0.1:7860) in your web browser.

🛠️ Data & Training Pipeline

1. Download Dataset (URMP)

We use the URMP dataset (specifically trumpet recordings) for training. Use our script to automate the download and extraction of the audio files.

./venv/bin/python download_trumpet_data.py

2. Preprocessing

Extract high-precision pitch ($f_0$) via CREPE and A-weighted loudness. Our script includes a Resume Capability—if interrupted, it will skip already processed files.

./venv/bin/python preprocess.py --input_dir data/raw/urmp/trumpet_only --output_dir data/processed/urmp

3. Quality Control (Visualization)

Before training, verify the extracted features. This tool saves a diagnostic plot to data/visualization/.

./venv/bin/python visualize_features.py --file data/processed/urmp/sample.pt

4. Training with Weights & Biases (W&B)

We use W&B for experiment tracking. It allows you to monitor loss and listen to audio samples in real-time.

Login: ./venv/bin/python -m wandb login (Paste your API key from wandb.ai).

Train:

./venv/bin/python train.py --data_dir data/processed/urmp --batch_size 16 --epochs 100

5. Cloud Training (Hugging Face)

For faster training on cloud GPUs (A10G/T4), you can stream your data from the Hugging Face Hub.

Set Token: Create a Hugging Face Write Token and export it:
```
export HF_TOKEN=your_token_here
```

Upload Data: Run the upload script to create a private dataset:

./venv/bin/python scripts/upload_to_hf.py --repo_id username/vox2trumpet-data

Train Remote:

./venv/bin/python train.py --config_name deep --hf_repo_id username/vox2trumpet-data

🧬 Data Pipeline: Batches & Epochs

To understand the training dynamics, here is how the data is handled under the hood:

The Dataset: We use preprocessed .pt files from the URMP dataset. Each file contains a full recording.
Random Cropping: Every time the model accesses a file, it picks a random 1-second segment (16,000 samples). This ensures that the model sees different parts of the performances in every epoch, significantly increasing data diversity.
Batches: With a Batch Size of 16, the model processes 16 different 1-second segments simultaneously on your GPU.
Epochs: One epoch means the model has "visited" every one of the training files exactly once. Because of the random crop, training for 100 epochs means the model effectively hears thousands of unique snippets of trumpet playing.

🎼 The URMP Dataset

The University of Rochester Multi-Modal Music Performance (URMP) dataset is a collection of high-quality ensemble performances where each instrument was recorded individually in a professional studio setting.

Instrument Isolation: We focus specifically on the trumpet tracks. These isolated recordings are essential for timbre transfer because they provide "clean" examples of the target instrument's spectral characteristics without bleed from other instruments.
Multi-Modal Source: While the dataset includes video and MIDI, this project primarily utilizes the 48kHz audio (resampled to 16kHz) as the ground truth for spectral analysis.
Professional Performance: The tracks cover various musical styles and include sophisticated techniques like vibrato and slurs, which are modeled by the GRU decoder.

📈 Monitoring & Debugging

Training a neural synthesizer is an iterative process. Here is how to keep an eye on your model's progress:

1. Weights & Biases (Remote)

Once training starts, W&B provides a real-time dashboard at the URL printed in your terminal.

Loss Curves: Monitor train_loss. A steady decrease indicates the model is learning the spectral features.
Audio Samples: Every 5 epochs, the model uploads an audio reconstruction. Listen to these to hear the timbre evolve from noise to trumpet. Specifically, the model extracts the pitch and loudness "DNA" from the target audio and passes it through the neural network to see how well the synthesizer can mimic the original.
Run Management: Each execution gets a random name (e.g., trumpet-vibe-10). You can delete failed/interrupted runs from the W&B project settings to keep your dashboard clean.

2. Local Monitoring

Progress Bar (tqdm): Shows instantaneous loss and processing speed (iterations per second) in your terminal.
Checkpoints: High-fidelity model states are saved to checkpoints/model_epoch_N.pth.
Graceful Exit: Hit CTRL+C at any time to stop training. The script will safely close the W&B connection and save the current state.

3. Auto-Resume

You don't need to do anything special to resume. If you stop the script and run the training command again, it will:

Detect checkpoints/latest.pth.
Automatically load the latest weights and optimizer state.
New: It will automatically detect if you have updated the learning_rate or mag_loss_weight in config.json and apply the new values to the resumed optimizer.
Continue training from the exact epoch where it was interrupted.

🧠 Architecture

The model follows the Control-Synthesis paradigm, separating the "Brain" (Neural Network) from the "Body" (DSP Synthesizer).

Encoder: Extracts pitch ($f_0$) using a pre-trained CREPE model and A-weighted Loudness.
- CREPE Architecture: This project utilizes CREPE (Convolutional REpresentation for Pitch Estimation), a deep convolutional neural network.
  - Structure: It consists of 6 convolutional layers followed by a fully connected output layer.
  - Frequency Resolution: The model outputs a probability distribution over 360 pitch bins, each 20 cents wide, covering 6 octaves (from 32.7 Hz to 1975.5 Hz).
  - Input: Operates on raw 16kHz audio waveforms using a 1024-sample window.
  - Capacity: We default to the tiny model for efficiency, though full can be used for maximum precision during offline preprocessing.

Decoder: A GRU (Gated Recurrent Unit) maps these control signals to synthesizer parameters.

Why GRU?: We chose a GRU over a Transformer because it is significantly more efficient for real-time audio synthesis. GRUs have a strong "inductive bias" for sequences where the next frame depends heavily on the previous one (temporal persistence).

🧠 Decoder (Brain) Internal Structure

The decoder transforms control signals into high-dimensional synthesizer parameters:

graph TD
    InF0["Log F0 (Pitch)"] --> Concat["Concatenate [Batch, Time, 2]"]
    InLoud["Loudness (Volume)"] --> Concat
    Concat --> GRU["GRU Layer<br/>(Hidden: 512, 1 Layer)"]
    GRU --> MLP["MLP Projection<br/>(Linear Layer)"]
    MLP --> Split["Split (166 Units)"]
    Split --> Harm["Harmonic Branch<br/>(101 Units)"]
    Split --> Noise["Noise Branch<br/>(65 Units)"]
    Harm --> Softmax["Softmax Activation<br/>(Timbre Distribution)"]
    Softmax --> MulLoud["Multiply by Input Loudness"]
    MulLoud --> OutHarm["Harmonic Amplitudes"]
    Noise --> Sigmoid["Sigmoid Activation"]
    Sigmoid --> OutNoise["Noise Magnitudes"]

    style InF0 fill:#f9f,stroke:#333,stroke-width:2px
    style InLoud fill:#f9f,stroke:#333,stroke-width:2px
    style OutHarm fill:#f9f,stroke:#333,stroke-width:2px
    style OutNoise fill:#f9f,stroke:#333,stroke-width:2px

Future-Proofing: While the GRU is our "lean and mean" baseline, the modular design allows us to swap in a Transformer-based decoder if we need to model more complex, long-range musical dependencies in the future.

Synthesizer:
- Harmonic Synthesizer: Additive synthesis (sum of sines) for the tonal trumpet vibration.
- Filtered Noise: Subtractive synthesis (time-varying FIR filters) for air flux and rasp.
Loss: Multi-Resolution STFT Loss (in loss.py) ensures the model learns spectral details across time and frequency.

Mathematical Loss Function: Multi-Resolution STFT

The Vox2Trumpet uses a Multi-Resolution Short-Time Fourier Transform (MR-STFT) Loss to train the GRU decoder. Unlike a simple time-domain MSE, this loss evaluates the model across multiple time-frequency scales to capture both sharp transients and sustained harmonic timbre.

1. Single-Resolution STFT Loss

For each resolution $i$ (defined by FFT size $N_i$, hop size $H_i$, and window $W_i$), the loss $L_s^{(i)}$ is a combination of two components:

Spectral Convergence ($L_{sc}$)

This measures the overall "energy shape" of the spectrogram. It is defined as the Frobenius norm of the difference between the target and predicted magnitudes, normalized by the target's norm:

$$L_{sc}(x, y) = \frac{|| \ |STFT(y)| - |STFT(x)| \ ||_F}{|| \ |STFT(y)| \ ||_F + \epsilon}$$

Log-Magnitude Loss ($L_{mag}$)

To better match human auditory perception (which is logarithmic), we calculate the $L1$ distance between the log-spectrograms. We use a weighting factor $w$ (configured in config.json) to prioritize tonal warmth:

$$L_{mag}(x, y) = w \cdot \frac{1}{T \cdot F} \sum_{t,f} | \log(|STFT(y)_{t,f}| + \epsilon) - \log(|STFT(x)_{t,f}| + \epsilon) |$$

2. Multi-Resolution Aggregation

To prevent the model from over-fitting to a single window size, we average the loss across a bank of resolutions (e.g., 512, 1024, 2048, and 4096):

$$L_{total} = \frac{1}{M} \sum_{i=1}^{M} (L_{sc}^{(i)} + L_{mag}^{(i)})$$

Using a high resolution like 4096 is crucial for capturing the distinct, tight harmonics of low notes, while the 512 resolution ensures the attack is preserved.

📂 File Structure

model.py: Main Vox2Trumpet nn.Module.
synth.py: Differentiable DSP modules (HarmonicSynthesizer, FilteredNoiseSynthesizer).
loss.py: Perceptual loss functions.
preprocess.py: Data pipeline for extracting $(f_0, Loudness)$ features.
visualize_features.py: Diagnostic tool for feature inspection.
train.py: Training loop with W&B integration and checkpoint resume.
data.py: Vox2TrumpetDataset with random cropping.
download_trumpet_data.py: Automated trumpet data downloader.
setup_env.sh: Environment reproducibility script.
test_e2e.py: Deterministic regression test suite.
config.json: Centralized configuration for model architectures and hyperparameters.
scripts/upload_to_hf.py: Utility for migrating training data to the cloud.
scripts/check_dataset_nans.py: Integrity verification tool for processed features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vox2Trumpet: DDSP Timbre Transfer

🧠 System Architecture

🎙️ CREPE Deepnet Architecture

🚀 Quick Start

1. Setup Environment

2. Verify Installation

3. Running the GUI Locally

🛠️ Data & Training Pipeline

1. Download Dataset (URMP)

2. Preprocessing

3. Quality Control (Visualization)

4. Training with Weights & Biases (W&B)

5. Cloud Training (Hugging Face)

🧬 Data Pipeline: Batches & Epochs

🎼 The URMP Dataset

📈 Monitoring & Debugging

1. Weights & Biases (Remote)

2. Local Monitoring

3. Auto-Resume

🧠 Architecture

🧠 Decoder (Brain) Internal Structure

Further Reading on GRUs

Mathematical Loss Function: Multi-Resolution STFT

1. Single-Resolution STFT Loss

Spectral Convergence ($L_{sc}$)

Log-Magnitude Loss ($L_{mag}$)

2. Multi-Resolution Aggregation

📂 File Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
checkpoints		checkpoints
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.json		config.json
config_preprocess.json		config_preprocess.json
core.py		core.py
data.py		data.py
debug_synth_perfect.py		debug_synth_perfect.py
download_isolated_wind_intr_data.py		download_isolated_wind_intr_data.py
inference.py		inference.py
loss.py		loss.py
model.py		model.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh
synth.py		synth.py
test_e2e.py		test_e2e.py
train.py		train.py
verify_project.py		verify_project.py
visualize_features.py		visualize_features.py

Folders and files

Latest commit

History

Repository files navigation

Vox2Trumpet: DDSP Timbre Transfer

🧠 System Architecture

🎙️ CREPE Deepnet Architecture

🚀 Quick Start

1. Setup Environment

2. Verify Installation

3. Running the GUI Locally

🛠️ Data & Training Pipeline

1. Download Dataset (URMP)

2. Preprocessing

3. Quality Control (Visualization)

4. Training with Weights & Biases (W&B)

5. Cloud Training (Hugging Face)

🧬 Data Pipeline: Batches & Epochs

🎼 The URMP Dataset

📈 Monitoring & Debugging

1. Weights & Biases (Remote)

2. Local Monitoring

3. Auto-Resume

🧠 Architecture

🧠 Decoder (Brain) Internal Structure

Further Reading on GRUs

Mathematical Loss Function: Multi-Resolution STFT

1. Single-Resolution STFT Loss

Spectral Convergence ($L_{sc}$)

Log-Magnitude Loss ($L_{mag}$)

2. Multi-Resolution Aggregation

📂 File Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages