This project integrates FastSpeech2 and HiFi-GAN for on-device TTS (Text-to-Speech) on iOS. It currently supports both Japanese and English, with one voice per language.
Model export to Core ML and some challenging implementation parts were assisted by ChatGPT and Gemini.
Japanese:
espnet/kan-bayashi_jsut_fastspeech2
English:
espnet/kan-bayashi_ljspeech_fastspeech2
We use OpenJTalk to extract Japanese phonemes from input text. These phonemes are then used as input for the FastSpeech2 model.
For English, we convert graphemes to phonemes using the CMU Pronouncing Dictionary. The phonemes are then used as input for FastSpeech2.
To improve naturalness, the mel-spectrogram output from FastSpeech2 is passed to HiFi-GAN to synthesize waveform audio.
This project relies on the following external libraries and tools:
- OpenJTalkForiOS Used for extracting Japanese phonemes from input text. Follow the installation instructions in the repo to integrate it into your Xcode project.
- ❌ Xcode Simulator is not supported.
- ✅ Tested only on iPhone 15 Pro.
- Other devices are untested.
This project is provided as-is. Please use and test at your own risk — we do not provide support.
This project is licensed under the Apache License 2.0.
The models and code are based on:
Modifications include conversion to Core ML format and integration with iOS.