😋 Text To Speech (TTS)

Check the CHANGELOG file to have a global overview of the latest modifications! 😋

Project structure

├── architectures            : utilities for model architectures
│   ├── layers               : custom layer implementations
│   ├── transformers         : transformer architecture implementations
│   ├── tacotron2_arch.py       : Tacotron-2 synthesizer architecture (+ multi-speaker variant)
│   └── waveglow_arch.py        : WaveGlow vocoder architecture
├── custom_train_objects
│   ├── losses
│   │   └── tacotron_loss.py    : custom Tacotron2 loss
├── example_outputs         : some pre-computed audios (cf the `text_to_speech` notebook)
├── loggers
├── models
│   ├── tts
│   │   ├── sv2tts_tacotron2.py : SV2TTS main class
│   │   ├── tacotron2.py        : Tacotron2 main class
│   │   └── waveglow.py         : WaveGlow main class (both pytorch and tensorflow)
│   └── weights_converter.py    : utilities to convert weights between different models
├── pretrained_models
├── tests                    : unit and integration tests for model validation
├── utils                    : utility functions for data processing and visualization
├── LICENCE                  : project license file
├── README.md                : this file
├── requirements.txt         : required packages
└── text_to_speech.ipynb     : notebook demonstrating model creation + TTS features

Check the main project for more information about the unextended modules / structure / main classes.

* Check the encoders project for more information about the models/encoder module

Available features

Text-To-Speech (module models.tts) :

Feature	Function / class	Description
Text-To-Speech	`tts`	perform TTS on text you want with the model you want
stream	`stream`	perform TTS on text you enter

The text_to_speech notebook provides a concrete demonstration of the tts function

Available models

Model architectures

Available architectures:

Synthesizer:
- Tacotron2 with extensions for multi-speaker (by ID or SV2TTS)
- SV2TTS extension of the Tacotron2 architecture for multi-speaker based on speaker's embeddings*
Vocoder:
- Waveglow

The SV2TTS models are fine-tuned from pretrained Tacotron2 models, by using the partial transfer learning procedure (see below for details), which speeds up the training significantly.

Model weights

Name	Language	Dataset	Synthesizer	Vocoder	Speaker Encoder	Trainer	Weights
pretrained_tacotron2	`en`	LJSpeech	`Tacotron2`	`WaveGlow`	/	NVIDIA	Google Drive
tacotron2_siwis	`fr`	SIWIS	`Tacotron2`	`WaveGlow`	/	me	Google Drive
sv2tts_tacotron2_256	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_siwis	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_tacotron2_256_v2	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_siwis_v2	`fr`	SIWIS	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive

Models must be unzipped in the pretrained_models/ directory!

Important Note: These links will be updated in a future version, and the converted keras weights of WaveGlow will also be added.

Installation and usage

Clone this repository: git clone https://github.com/yui-mhcp/text_to_speech.git
Go to the root of this repository: cd text_to_speech
Install requirements: pip install -r requirements.txt
Open the text_to_speech notebook and follow the instructions!

You may have to install ffmpeg for audio loading/saving.

TO-DO list:

Multi-speaker Text-To-Speech

There are multiple ways to enable multi-speaker speech synthesis:

Use a speaker ID that is embedded by a learnable Embedding layer. The speaker embedding is then learned during training.
Use a Speaker Encoder (SE) to embed audio from the reference speaker. This is often referred to as zero-shot voice cloning, as it only requires a sample from the speaker (without training).
Recently, a new prompt-based strategy has been proposed to control the speech with prompts.

Automatic voice cloning with the `SV2TTS` architecture

Note: In the next paragraphs, encoder refers to the Tacotron Encoder part (that encodes the input text), while SE refers to a speaker encoder model (detailed below).

The basic intuition

The Speaker Encoder-based Text-To-Speech is inspired from the "From Speaker Verification To Text-To-Speech (SV2TTS)" paper. The authors have proposed an extension of the Tacotron-2 architecture to include information about the speaker's voice.

Here is a short overview of the proposed procedure:

Train a model to identify speakers based on short audio samples: the speaker verification model. This model takes as input an audio sample (5-10 sec) from a speaker and encodes it into a d-dimensional vector, named the embedding. This embedding aims to capture relevant information about the speaker's voice (e.g., frequencies, rhythm, pitch, etc.).
This pre-trained Speaker Encoder (SE) is then used to encode the voice of the speaker to clone.
The produced embedding is then concatenated with the output of the Tacotron-2 encoder part, such that the Decoder has access to both the encoded text and the speaker embedding.

The objective is that the Decoder will learn to use the speaker embedding to copy its prosody/intonation/etc. to read the text with the voice of this speaker.

Limitations and solutions

There are some limitations with the above approach:

Perfect generalization to new speakers is very difficult, as it would require large datasets with many speakers.
The audio should not have any noise/artifacts to avoid noisy synthetic audios.
The Speaker Encoder has to correctly separate speakers and encode their voice in a meaningful way for the synthesizer.

To tackle these limitations, the proposed solution is to perform a 2-step training:

First train a low-quality multi-speakers model on the CommonVoice database. This is one of the largest multilingual databases for audio, at the cost of noisy/variable quality audios. This is therefore not suitable to train good quality models, whereas pre-processing still helps to obtain intelligible audios.
Once a multi-speaker model is trained, a single-speaker database with a limited amount of good quality data can be used to fine-tune the model on a single speaker. This allows the model to learn faster, with only a limited amount of good quality data, and to produce really good quality audios!

The Speaker Encoder (SE)

The SE part should be able to differentiate speakers and embed (encode in a 1-D vector) them in a meaningful way.

The model used in the paper is a 3-layer LSTM model with a normalization layer trained with the GE2E loss. The major limitation is that training this model is really slow and took 2 weeks on 4 GPUs in CorentinJ's master thesis (cf. his GitHub).

This project proposes a simpler architecture based on Convolutional Neural Networks (CNN), which is much faster to train compared to LSTM networks. Furthermore, the Euclidean distance has been used rather than the cosine metric, which has shown faster convergence. Additionally, a custom cache-based generator is proposed to speed up audio processing. These modifications allowed training a 99% accuracy model within 2-3 hours on a single RTX 3090 GPU!

The partial Transfer Learning procedure

In order to avoid training a SV2TTS model from scratch, which would be completely impossible on a single GPU, a new partial transfer learning procedure is proposed.

This procedure takes a pre-trained model with a slightly different architecture and transfers all the common weights (like in regular transfer learning). For the layers with different weight shapes, only the common part is transferred, while the remaining weights are initialized to zeros. This results in a new model with different weights that mimics the behavior of the original model.

In the SV2TTS architecture, the speaker embedding is passed to the recurrent layer of the Tacotron2 decoder. This results in a different input shape, making the layer weights matrix different. The partial transfer learning allows us to initialize the model such that it replicates the behavior of the original single-speaker Tacotron2 model!

Notes and references

GitHub projects

The code for this project is a mixture of multiple GitHub projects, to have a fully modular Tacotron-2 implementation:

NVIDIA's repository (tacotron2 / waveglow): The base pretrained models are inspired from this repository.
The TFTTS project: Some inference optimizations are inspired from their dynamic decoder implementation, which has now been optimized and updated to be Keras 3 compatible.
CorentinJ's Real-Time Voice cloning project: The provided SV2TTS architecture is inspired from this repository, with small differences and optimizations.

Papers

Tacotron 2: The original Tacotron2 paper
Waveglow: The original WaveGlow paper
Transfer learning from Speaker Verification to Text-To-Speech: Original paper for SV2TTS variant
Generalized End-to-End loss for Speaker Verification: The GE2E Loss paper (used for speaker encoder in the SV2TTS architecture)

Contacts and licence

Contacts:

Mail: yui-mhcp@tutanota.com
Discord: yui0732

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

This license allows you to use, modify, and distribute the code, as long as you include the original copyright and license notice in any copy of the software/source. Additionally, if you modify the code and distribute it, or run it on a server as a service, you must make your modified version available under the same license.

For more information about the AGPL-3.0 license, please visit the official website

Citation

If you find this project useful in your work, please add this citation to give it more visibility! 😋

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😋 Text To Speech (TTS)

Project structure

Available features

Available models

Model architectures

Model weights

Installation and usage

TO-DO list:

Multi-speaker Text-To-Speech

Automatic voice cloning with the `SV2TTS` architecture

The basic intuition

Limitations and solutions

The Speaker Encoder (SE)

The partial Transfer Learning procedure

Notes and references

GitHub projects

Papers

Contacts and licence

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
architectures		architectures
custom_train_objects		custom_train_objects
example_outputs		example_outputs
loggers		loggers
models		models
tests		tests
utils		utils
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENCE		LICENCE
README.md		README.md
requirements.txt		requirements.txt
text_to_speech.ipynb		text_to_speech.ipynb

Folders and files

Latest commit

History

Repository files navigation

😋 Text To Speech (TTS)

Project structure

Available features

Available models

Model architectures

Model weights

Installation and usage

TO-DO list:

Multi-speaker Text-To-Speech

Automatic voice cloning with the SV2TTS architecture

The basic intuition

Limitations and solutions

The Speaker Encoder (SE)

The partial Transfer Learning procedure

Notes and references

GitHub projects

Papers

Contacts and licence

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Automatic voice cloning with the `SV2TTS` architecture

Packages