Audio Tokenizer Train

HF-compatible training and inference code for MOSS/CAT audio tokenizers.

CAT is the primary supported tokenizer workflow in this repository. DAC is retained as a model type under audio_tokenizer.models for reference and future extension, but the original Descript DAC download/encode/decode CLI is not part of this repo surface.

Repository Layout

audio_tokenizer/models/: tokenizer models and shared codec utilities.
audio_tokenizer/nn/: layers, quantizers, losses, and transformer blocks.
audio_tokenizer/data/: CAT tokenizer training dataset.
audio_tokenizer/inference/: CAT tokenizer inference helpers and CLI entry logic.
audio_tokenizer/semantic/: semantic LLM head for CAT training.
audio_tokenizer/asr/: CAT-ASR and CAT-CTC downstream models and data code.
scripts/: CAT tokenizer training, CAT inference, pretokenization, CAT-ASR, and CAT-CTC entry points.
conf/cat/: CAT tokenizer training configs.
conf/asr/: CAT-ASR and CAT-CTC downstream configs.

Installation

Use Python 3.10 or newer in a dedicated virtual environment or conda environment.

Create and activate an environment:

conda create -n audio-tokenizer-train python=3.10 -y
conda activate audio-tokenizer-train

Install PyTorch first. Pick the wheel that matches your CUDA driver and runtime from the official PyTorch install selector. For example, for a CUDA-enabled environment:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

For CPU-only setup:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

Install this repository in editable mode:

pip install -e .

For development and tests, install the optional dev dependencies:

pip install -e ".[dev]"

If your environment already provides PyTorch, install this repository after activating that environment and keep the existing PyTorch build. Verify the package import after installation:

python - <<'PY'
import audio_tokenizer
from audio_tokenizer.models.cat import CAT
from audio_tokenizer.models.dac import DAC

print(audio_tokenizer.__version__)
print(CAT.__name__, DAC.__name__)
PY

Full CAT training also requires local audio manifests, model/checkpoint paths, and GPU resources. Update the paths in conf/cat/ and conf/asr/ before running training commands.

CAT Training Data

CAT tokenizer training uses a UTF-8 manifest file with one sample per line.

Audio-only sample:

/path/to/audio.wav

Audio plus text sample:

/path/to/audio.wav<TAB>transcription text<TAB>asr

The third column is optional and defaults to asr. The dataset loader resamples audio, mono-mixes it, and crops or pads to the configured segment duration.

Train CAT

Small debug config:

python scripts/train_cat.py \
  --args.load conf/cat/small.yml \
  --generator_config /path/to/MOSS-Audio-Tokenizer \
  --training_stage 1 \
  --save_path runs/cat_small

Full stage 1:

python scripts/train_cat.py \
  --args.load conf/cat/base.yml \
  --generator_config /path/to/MOSS-Audio-Tokenizer \
  --training_stage 1 \
  --save_path runs/cat_base_s1

Stage 2 adversarial fine-tuning from a stage 1 checkpoint:

python scripts/train_cat.py \
  --args.load conf/cat/base.yml \
  --training_stage 2 \
  --resume true \
  --resume_from runs/cat_base_s1 \
  --save_path runs/cat_base_s2

For distributed training, launch the same script with torchrun:

torchrun --nproc_per_node 8 scripts/train_cat.py \
  --args.load conf/cat/base.yml \
  --generator_config /path/to/MOSS-Audio-Tokenizer \
  --training_stage 1 \
  --save_path runs/cat_base_s1

generator_config points to a Hugging Face-style MOSS-Audio-Tokenizer directory containing config.json and model*.safetensors. Training checkpoints are saved in the same HF-compatible format: config.json, model.safetensors, optional remote-code files, and training_state/ for optimizer and tracker state. Update manifest paths, model paths, batch sizes, and worker counts in the YAML configs for your machine before running full training.

CAT Inference

Use scripts/infer_cat.py for reconstruction or tokenization checks against a HF-style MOSS/CAT checkpoint:

python scripts/infer_cat.py --help

python scripts/infer_cat.py reconstruct \
  --checkpoint /path/to/MOSS-Audio-Tokenizer \
  --input /path/to/audio.wav \
  --output runs/recon.wav

CAT-ASR And CAT-CTC

Pretokenize audio with a CAT checkpoint before training downstream models:

python scripts/pretokenize_cat_asr.py --help

Train CAT-CTC:

python scripts/train_cat_ctc.py --config conf/asr/cat_ctc_small.yml

Train CAT-ASR:

python scripts/train_cat_asr.py --config conf/asr/qwen3_1.7b_full.yml

The ASR configs include example local paths. Replace them with your model, checkpoint, and manifest paths.

Tests

Run retained smoke and downstream tests:

python -m pytest tests/test_imports.py tests/test_cat_ctc.py tests/test_cat_asr.py

Some training and inference commands require local audio manifests, model checkpoints, Hugging Face model paths, and GPU resources.

Checkpoint Compatibility

CAT checkpoints use the Hugging Face MOSS-Audio-Tokenizer layout by default. The generator weights are stored as safetensors with official parameter names, so official HF checkpoints can be loaded directly by training and inference code without an external key conversion step. Optimizer and tracker state are stored separately under training_state/.

Acknowledgements

This repository is based on the original Descript Audio Codec (DAC) codebase. We thank the DAC authors and contributors for releasing the project and making it a strong foundation for this audio tokenizer training repository.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
audio_tokenizer		audio_tokenizer
conf		conf
data/asr_subsets		data/asr_subsets
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Tokenizer Train

Repository Layout

Installation

CAT Training Data

Train CAT

CAT Inference

CAT-ASR And CAT-CTC

Tests

Checkpoint Compatibility

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Tokenizer Train

Repository Layout

Installation

CAT Training Data

Train CAT

CAT Inference

CAT-ASR And CAT-CTC

Tests

Checkpoint Compatibility

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages