Skip to content

tonimateos/vox2trumpet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title Vox2Trumpet DDSP
emoji 🎺
colorFrom blue
colorTo indigo
sdk gradio
sdk_version 3.50.2
app_file app.py
pinned false

Vox2Trumpet: DDSP Timbre Transfer

🧠 System Architecture

      RAW AUDIO             ENCODER (CREPE)               INPUTS (Log)           DECODER (Brain)                PARAMETERS
    [Batch, Sample]        [Pre-trained]                [Batch, Time, 2]      [Batch, Time, 512]           [Batch, Time, 166]

    ┌───────────┐         ┌─────────────────┐          ┌──────────┐          ┌───────────────────┐        ┌───────────────────┐
    │           │         │      CREPE      │──(f0)──▶ │  log_f0  │──┐       │   GRU (1 Layer)   │        │   MLP (Linear)    │
    │ [ 🎤 ]    │────▶────┤  (Pitch/Loud)   │          └──────────┘  │       │   Hidden: 512     │        │  166 Neurons Out  │
    │           │         │                 │──(L)───▶ ┌──────────┐  ├───▶───┤                   ├───▶───┤                   │──┐
    └───────────┘         └─────────────────┘          │ loudness │──┘       │   BatchFirst=True │        │  (Softmax/Sigmoid)│  │
                                                       └──────────┘          └───────────────────┘        └───────────────────┘  │
                                                                                                                                 │
                                                                                                                                 ▼
          ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
          │
          │             SYNTHESIZERS (Body)               AUDIO OUTPUT
          │           [Physics-Informed DSP]             [Batch, Sample]
          │
          │     (Amps)  ┌───────────────────────┐
          ├──────────▶──│ Harmonic Oscillator   │──┐
          │             └───────────────────────┘  │      ┌───────────┐
          │                                        ├──▶───│   SUM     │──▶ [ 🎺 ]
          │     (Mags)  ┌───────────────────────┐  │      └───────────┘
          └──────────▶──│ Noise Filter Bank     │──┘
                        └───────────────────────┘

🎙️ CREPE Deepnet Architecture

The encoder uses a pre-trained CREPE model. Below is its internal convolutional structure:

graph TD
    Input["Audio Input<br/>1024 samples @ 16kHz"] --> C1["Conv1: 512 filters<br/>Stride 4"]
    C1 --> P1["Max Pool 2x"]
    P1 --> C2["Conv2: 64 filters<br/>Stride 1"]
    C2 --> P2["Max Pool 2x"]
    P2 --> C3["Conv3: 64 filters<br/>Stride 1"]
    C3 --> P3["Max Pool 2x"]
    P3 --> C4["Conv4: 64 filters<br/>Stride 1"]
    C4 --> P4["Max Pool 2x"]
    P4 --> C5["Conv5: 64 filters<br/>Stride 1"]
    C5 --> P5["Max Pool 2x"]
    P5 --> C6["Conv6: 64 filters<br/>Stride 1"]
    C6 --> P6["Max Pool 2x"]
    P6 --> Flatten["Flatten"]
    Flatten --> Dense["Dense Layer<br/>360 Units"]
    Dense --> Softmax["Softmax"]
    Softmax --> Output["Pitch Estimate<br/>(360 Cent Bins)"]

    style Input fill:#f9f,stroke:#333,stroke-width:2px
    style Output fill:#f9f,stroke:#333,stroke-width:2px
Loading

Model Details:

  • Decoder: Single-layer GRU (Gated Recurrent Unit) to capture temporal dependencies (slurs, vibrato, and decay).
  • Hidden Size: 512 units.
  • Parameters: 166 total (101 harmonic amplitudes + 65 noise band magnitudes).
  • Inductive Bias: The model predicts control signals rather than raw samples, ensuring stable pitch.

🚀 Quick Start

This project uses a local, isolated environment to guarantee reproducibility. All dependencies are installed into the ./venv directory within the project root.

1. Setup Environment

We provide a script to set up a Python 3.9 virtual environment and install exact dependencies.

# Run the setup script (creates ./venv and installs packages)
chmod +x setup_env.sh
./setup_env.sh

Where are the dependencies? They are downloaded and installed locally in venv/lib/python3.9/site-packages. They do not affect your global system Python.

2. Verify Installation

Run the verification script to check that the Vox2Trumpet model and DSP components can be instantiated correctly.

# Verify the build
./venv/bin/python verify_project.py

3. Running the GUI Locally

Launch the interactive Gradio interface to perform timbre transfer on your own voice or uploaded recordings.

# Start the web server
./venv/bin/python app.py

Once started, open the local URL (typically http://127.0.0.1:7860) in your web browser.

🛠️ Data & Training Pipeline

1. Download Dataset (URMP)

We use the URMP dataset (specifically trumpet recordings) for training. Use our script to automate the download and extraction of the audio files.

./venv/bin/python download_trumpet_data.py

2. Preprocessing

Extract high-precision pitch ($f_0$) via CREPE and A-weighted loudness. Our script includes a Resume Capability—if interrupted, it will skip already processed files.

./venv/bin/python preprocess.py --input_dir data/raw/urmp/trumpet_only --output_dir data/processed/urmp

3. Quality Control (Visualization)

Before training, verify the extracted features. This tool saves a diagnostic plot to data/visualization/.

./venv/bin/python visualize_features.py --file data/processed/urmp/sample.pt

4. Training with Weights & Biases (W&B)

We use W&B for experiment tracking. It allows you to monitor loss and listen to audio samples in real-time.

  1. Login: ./venv/bin/python -m wandb login (Paste your API key from wandb.ai).
  2. Train:
    ./venv/bin/python train.py --data_dir data/processed/urmp --batch_size 16 --epochs 100

5. Cloud Training (Hugging Face)

For faster training on cloud GPUs (A10G/T4), you can stream your data from the Hugging Face Hub.

  1. Set Token: Create a Hugging Face Write Token and export it:
    export HF_TOKEN=your_token_here
  2. Upload Data: Run the upload script to create a private dataset:
    ./venv/bin/python scripts/upload_to_hf.py --repo_id username/vox2trumpet-data
  3. Train Remote:
    ./venv/bin/python train.py --config_name deep --hf_repo_id username/vox2trumpet-data

🧬 Data Pipeline: Batches & Epochs

To understand the training dynamics, here is how the data is handled under the hood:

  • The Dataset: We use preprocessed .pt files from the URMP dataset. Each file contains a full recording.
  • Random Cropping: Every time the model accesses a file, it picks a random 1-second segment (16,000 samples). This ensures that the model sees different parts of the performances in every epoch, significantly increasing data diversity.
  • Batches: With a Batch Size of 16, the model processes 16 different 1-second segments simultaneously on your GPU.
  • Epochs: One epoch means the model has "visited" every one of the training files exactly once. Because of the random crop, training for 100 epochs means the model effectively hears thousands of unique snippets of trumpet playing.

🎼 The URMP Dataset

The University of Rochester Multi-Modal Music Performance (URMP) dataset is a collection of high-quality ensemble performances where each instrument was recorded individually in a professional studio setting.

  • Instrument Isolation: We focus specifically on the trumpet tracks. These isolated recordings are essential for timbre transfer because they provide "clean" examples of the target instrument's spectral characteristics without bleed from other instruments.
  • Multi-Modal Source: While the dataset includes video and MIDI, this project primarily utilizes the 48kHz audio (resampled to 16kHz) as the ground truth for spectral analysis.
  • Professional Performance: The tracks cover various musical styles and include sophisticated techniques like vibrato and slurs, which are modeled by the GRU decoder.

📈 Monitoring & Debugging

Training a neural synthesizer is an iterative process. Here is how to keep an eye on your model's progress:

1. Weights & Biases (Remote)

Once training starts, W&B provides a real-time dashboard at the URL printed in your terminal.

  • Loss Curves: Monitor train_loss. A steady decrease indicates the model is learning the spectral features.
  • Audio Samples: Every 5 epochs, the model uploads an audio reconstruction. Listen to these to hear the timbre evolve from noise to trumpet. Specifically, the model extracts the pitch and loudness "DNA" from the target audio and passes it through the neural network to see how well the synthesizer can mimic the original.
  • Run Management: Each execution gets a random name (e.g., trumpet-vibe-10). You can delete failed/interrupted runs from the W&B project settings to keep your dashboard clean.

2. Local Monitoring

  • Progress Bar (tqdm): Shows instantaneous loss and processing speed (iterations per second) in your terminal.
  • Checkpoints: High-fidelity model states are saved to checkpoints/model_epoch_N.pth.
  • Graceful Exit: Hit CTRL+C at any time to stop training. The script will safely close the W&B connection and save the current state.

3. Auto-Resume

You don't need to do anything special to resume. If you stop the script and run the training command again, it will:

  1. Detect checkpoints/latest.pth.
  2. Automatically load the latest weights and optimizer state.
  3. New: It will automatically detect if you have updated the learning_rate or mag_loss_weight in config.json and apply the new values to the resumed optimizer.
  4. Continue training from the exact epoch where it was interrupted.

🧠 Architecture

The model follows the Control-Synthesis paradigm, separating the "Brain" (Neural Network) from the "Body" (DSP Synthesizer).

  • Encoder: Extracts pitch ($f_0$) using a pre-trained CREPE model and A-weighted Loudness.

    • CREPE Architecture: This project utilizes CREPE (Convolutional REpresentation for Pitch Estimation), a deep convolutional neural network.
      • Structure: It consists of 6 convolutional layers followed by a fully connected output layer.
      • Frequency Resolution: The model outputs a probability distribution over 360 pitch bins, each 20 cents wide, covering 6 octaves (from 32.7 Hz to 1975.5 Hz).
      • Input: Operates on raw 16kHz audio waveforms using a 1024-sample window.
      • Capacity: We default to the tiny model for efficiency, though full can be used for maximum precision during offline preprocessing.
  • Decoder: A GRU (Gated Recurrent Unit) maps these control signals to synthesizer parameters.

    • Why GRU?: We chose a GRU over a Transformer because it is significantly more efficient for real-time audio synthesis. GRUs have a strong "inductive bias" for sequences where the next frame depends heavily on the previous one (temporal persistence).

    🧠 Decoder (Brain) Internal Structure

    The decoder transforms control signals into high-dimensional synthesizer parameters:

    graph TD
        InF0["Log F0 (Pitch)"] --> Concat["Concatenate [Batch, Time, 2]"]
        InLoud["Loudness (Volume)"] --> Concat
        Concat --> GRU["GRU Layer<br/>(Hidden: 512, 1 Layer)"]
        GRU --> MLP["MLP Projection<br/>(Linear Layer)"]
        MLP --> Split["Split (166 Units)"]
        Split --> Harm["Harmonic Branch<br/>(101 Units)"]
        Split --> Noise["Noise Branch<br/>(65 Units)"]
        Harm --> Softmax["Softmax Activation<br/>(Timbre Distribution)"]
        Softmax --> MulLoud["Multiply by Input Loudness"]
        MulLoud --> OutHarm["Harmonic Amplitudes"]
        Noise --> Sigmoid["Sigmoid Activation"]
        Sigmoid --> OutNoise["Noise Magnitudes"]
    
        style InF0 fill:#f9f,stroke:#333,stroke-width:2px
        style InLoud fill:#f9f,stroke:#333,stroke-width:2px
        style OutHarm fill:#f9f,stroke:#333,stroke-width:2px
        style OutNoise fill:#f9f,stroke:#333,stroke-width:2px
    
    Loading
    • Future-Proofing: While the GRU is our "lean and mean" baseline, the modular design allows us to swap in a Transformer-based decoder if we need to model more complex, long-range musical dependencies in the future.
  • Synthesizer:

    • Harmonic Synthesizer: Additive synthesis (sum of sines) for the tonal trumpet vibration.
    • Filtered Noise: Subtractive synthesis (time-varying FIR filters) for air flux and rasp.
  • Loss: Multi-Resolution STFT Loss (in loss.py) ensures the model learns spectral details across time and frequency.

Further Reading on GRUs


Mathematical Loss Function: Multi-Resolution STFT

The Vox2Trumpet uses a Multi-Resolution Short-Time Fourier Transform (MR-STFT) Loss to train the GRU decoder. Unlike a simple time-domain MSE, this loss evaluates the model across multiple time-frequency scales to capture both sharp transients and sustained harmonic timbre.

1. Single-Resolution STFT Loss

For each resolution $i$ (defined by FFT size $N_i$, hop size $H_i$, and window $W_i$), the loss $L_s^{(i)}$ is a combination of two components:

Spectral Convergence ($L_{sc}$)

This measures the overall "energy shape" of the spectrogram. It is defined as the Frobenius norm of the difference between the target and predicted magnitudes, normalized by the target's norm:

$$L_{sc}(x, y) = \frac{|| \ |STFT(y)| - |STFT(x)| \ ||_F}{|| \ |STFT(y)| \ ||_F + \epsilon}$$

Log-Magnitude Loss ($L_{mag}$)

To better match human auditory perception (which is logarithmic), we calculate the $L1$ distance between the log-spectrograms. We use a weighting factor $w$ (configured in config.json) to prioritize tonal warmth:

$$L_{mag}(x, y) = w \cdot \frac{1}{T \cdot F} \sum_{t,f} | \log(|STFT(y)_{t,f}| + \epsilon) - \log(|STFT(x)_{t,f}| + \epsilon) |$$

2. Multi-Resolution Aggregation

To prevent the model from over-fitting to a single window size, we average the loss across a bank of resolutions (e.g., 512, 1024, 2048, and 4096):

$$L_{total} = \frac{1}{M} \sum_{i=1}^{M} (L_{sc}^{(i)} + L_{mag}^{(i)})$$

Using a high resolution like 4096 is crucial for capturing the distinct, tight harmonics of low notes, while the 512 resolution ensures the attack is preserved.


📂 File Structure

  • model.py: Main Vox2Trumpet nn.Module.
  • synth.py: Differentiable DSP modules (HarmonicSynthesizer, FilteredNoiseSynthesizer).
  • loss.py: Perceptual loss functions.
  • preprocess.py: Data pipeline for extracting $(f_0, Loudness)$ features.
  • visualize_features.py: Diagnostic tool for feature inspection.
  • train.py: Training loop with W&B integration and checkpoint resume.
  • data.py: Vox2TrumpetDataset with random cropping.
  • download_trumpet_data.py: Automated trumpet data downloader.
  • setup_env.sh: Environment reproducibility script.
  • test_e2e.py: Deterministic regression test suite.
  • config.json: Centralized configuration for model architectures and hyperparameters.
  • scripts/upload_to_hf.py: Utility for migrating training data to the cloud.
  • scripts/check_dataset_nans.py: Integrity verification tool for processed features.

About

Deep Learning to transform voice or humming recordings into trumpet sounds

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors