You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ECAPA-Based Speaker Verification of Virtual Assistants: A Transfer Learning Approach
Abstract
Utilizing transfer learning with the ECAPA-TDNN model trained on the VoxCeleb2 dataset.
Intra-voice assistant comparisons: Achieved accuracies of 83.33% (iOS) and 66.67% (Alexa) for text-independent samples and 50% for text-dependent samples.
Inter-voice assistant comparisons (Alexa, Siri, Google Assistant, Cortana): 100% accuracy for text-independent, 80% for text-dependent.
Demonstrates the effectiveness of transfer learning and ECAPA-TDNN model for secure speaker verification across speech assistant versions.
Valuable insights for enhancing speaker verification in the context of speech assistants.
Introduction
Speaker verification utilizes speech characteristics differentiated based on pitch, formants, spectral envelope, MFCCs, and prosody characteristics.
"Voice prints" represent a speaker's unique vocal qualities.
Two types of speaker verification methods: text-dependent and text-independent.
Transfer learning employs pre-trained models to improve performance when labeled data is scarce.
The ECAPA-TDNN model from the SpeechBrain toolkit is used in this study for transfer learning on virtual assistants.
Methodology
Dataset
A custom audio dataset was created with a subset selected for analysis.
Organized into:
Intra-pair Comparisons:
Siri Versions (iOS 9 vs iOS 10 vs iOS 11)
Alexa Versions (3rd gen vs 4th gen vs 5th gen)
Inter-pair Comparisons:
Alexa
Siri
Google
Cortana
SpeechBrain
Features the ECAPA-TDNN model, a state-of-the-art model for speaker recognition that uses TDNN design with MFA mechanism, Squeeze-Excitation (SE), and residual blocks.
Hyperparameters are detailed in a YAML format.
Data Loading makes use of a PyTorch dataset interface.
Batching includes extracting speech features like spectrograms and MFCCs.
Brain_class() simplifies the neural model training process.
Pre-trained Model: ECAPA-TDNN
SpeechBrain provides outputs using pre-trained models such as ECAPA-TDNN.
Data preprocessing: Extract 80-dimensional filterbank features.
Model initialization: 5 TDNN layers, an attention mechanism, and an MLP classifier.
Hyperparameter setting: epochs, batch size, learning rate, etc.
Training: Trained on the VoxCeleb2 dataset.
Validation and Testing: Evaluate on a validation set.
Implementation
- Normalize, denoise, and extract features from audio samples.
Adjust the ECAPA-TDNN model's initial layer for TDSV and TISV.
Use the model to verify speaker identities and obtain similarity scores.
Store scores and predictions in arrays.
Calculate accuracy, precision, F1 score, and recall for evaluation.
Result
Output Snippets
Conclusion
Intra-pair TDSV analysis shows similarities among all versions, leading to potential security concerns.
Inter-pair TDSV analysis found matches between Cortana & Google Assistant and Alexa.
TISV has higher accuracy than TDSV due to the model's capability to differentiate different texts.
For better performance, additional training on a broader dataset of synthetic voices is recommended.
The study emphasizes the potential of transfer learning and SpeechBrain for speaker verification, also acknowledging challenges with synthetic voices.
About
Speaker verification on speech assistants using ECAPA-TDNN model , with focus on intra and inter-voice assistant variations and emphasizing the potential of transfer learning for secure speaker verification