HF-compatible training and inference code for MOSS/CAT audio tokenizers.
CAT is the primary supported tokenizer workflow in this repository. DAC is
retained as a model type under audio_tokenizer.models for reference and
future extension, but the original Descript DAC download/encode/decode CLI is
not part of this repo surface.
audio_tokenizer/models/: tokenizer models and shared codec utilities.audio_tokenizer/nn/: layers, quantizers, losses, and transformer blocks.audio_tokenizer/data/: CAT tokenizer training dataset.audio_tokenizer/inference/: CAT tokenizer inference helpers and CLI entry logic.audio_tokenizer/semantic/: semantic LLM head for CAT training.audio_tokenizer/asr/: CAT-ASR and CAT-CTC downstream models and data code.scripts/: CAT tokenizer training, CAT inference, pretokenization, CAT-ASR, and CAT-CTC entry points.conf/cat/: CAT tokenizer training configs.conf/asr/: CAT-ASR and CAT-CTC downstream configs.
Use Python 3.10 or newer in a dedicated virtual environment or conda environment.
Create and activate an environment:
conda create -n audio-tokenizer-train python=3.10 -y
conda activate audio-tokenizer-trainInstall PyTorch first. Pick the wheel that matches your CUDA driver and runtime from the official PyTorch install selector. For example, for a CUDA-enabled environment:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121For CPU-only setup:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpuInstall this repository in editable mode:
pip install -e .For development and tests, install the optional dev dependencies:
pip install -e ".[dev]"If your environment already provides PyTorch, install this repository after activating that environment and keep the existing PyTorch build. Verify the package import after installation:
python - <<'PY'
import audio_tokenizer
from audio_tokenizer.models.cat import CAT
from audio_tokenizer.models.dac import DAC
print(audio_tokenizer.__version__)
print(CAT.__name__, DAC.__name__)
PYFull CAT training also requires local audio manifests, model/checkpoint paths,
and GPU resources. Update the paths in conf/cat/ and conf/asr/ before
running training commands.
CAT tokenizer training uses a UTF-8 manifest file with one sample per line.
Audio-only sample:
/path/to/audio.wav
Audio plus text sample:
/path/to/audio.wav<TAB>transcription text<TAB>asr
The third column is optional and defaults to asr. The dataset loader resamples
audio, mono-mixes it, and crops or pads to the configured segment duration.
Small debug config:
python scripts/train_cat.py \
--args.load conf/cat/small.yml \
--generator_config /path/to/MOSS-Audio-Tokenizer \
--training_stage 1 \
--save_path runs/cat_smallFull stage 1:
python scripts/train_cat.py \
--args.load conf/cat/base.yml \
--generator_config /path/to/MOSS-Audio-Tokenizer \
--training_stage 1 \
--save_path runs/cat_base_s1Stage 2 adversarial fine-tuning from a stage 1 checkpoint:
python scripts/train_cat.py \
--args.load conf/cat/base.yml \
--training_stage 2 \
--resume true \
--resume_from runs/cat_base_s1 \
--save_path runs/cat_base_s2For distributed training, launch the same script with torchrun:
torchrun --nproc_per_node 8 scripts/train_cat.py \
--args.load conf/cat/base.yml \
--generator_config /path/to/MOSS-Audio-Tokenizer \
--training_stage 1 \
--save_path runs/cat_base_s1generator_config points to a Hugging Face-style MOSS-Audio-Tokenizer
directory containing config.json and model*.safetensors. Training
checkpoints are saved in the same HF-compatible format: config.json,
model.safetensors, optional remote-code files, and training_state/ for
optimizer and tracker state. Update manifest paths, model paths, batch sizes,
and worker counts in the YAML configs for your machine before running full
training.
Use scripts/infer_cat.py for reconstruction or tokenization checks against a
HF-style MOSS/CAT checkpoint:
python scripts/infer_cat.py --helppython scripts/infer_cat.py reconstruct \
--checkpoint /path/to/MOSS-Audio-Tokenizer \
--input /path/to/audio.wav \
--output runs/recon.wavPretokenize audio with a CAT checkpoint before training downstream models:
python scripts/pretokenize_cat_asr.py --helpTrain CAT-CTC:
python scripts/train_cat_ctc.py --config conf/asr/cat_ctc_small.ymlTrain CAT-ASR:
python scripts/train_cat_asr.py --config conf/asr/qwen3_1.7b_full.ymlThe ASR configs include example local paths. Replace them with your model, checkpoint, and manifest paths.
Run retained smoke and downstream tests:
python -m pytest tests/test_imports.py tests/test_cat_ctc.py tests/test_cat_asr.pySome training and inference commands require local audio manifests, model checkpoints, Hugging Face model paths, and GPU resources.
CAT checkpoints use the Hugging Face MOSS-Audio-Tokenizer layout by default.
The generator weights are stored as safetensors with official parameter names,
so official HF checkpoints can be loaded directly by training and inference
code without an external key conversion step. Optimizer and tracker state are
stored separately under training_state/.
This repository is based on the original Descript Audio Codec (DAC) codebase. We thank the DAC authors and contributors for releasing the project and making it a strong foundation for this audio tokenizer training repository.