A comprehensive machine learning pipeline for early and cost-effective diagnosis of COVID-19 and other respiratory diseases through voice analysis, combining acoustic and biomechanical features.
- Overview
- Paper Reference
- Key Findings
- Features
- Project Structure
- Requirements
- Installation
- Configuration
- Usage
- Pipeline Stages
- Output Files
- Methodology
- Code, Data Availability, and Reproducibility
- Citation
- License
- Authors
This repository implements a robust methodology for respiratory disease diagnosis based on vocal features and machine learning techniques. In contrast to existing methodologies that rely solely on acoustic attributes of the voice (such as intensity or frequency), our approach represents a pioneering investigation that incorporates biomechanical aspects of vocal production, including:
- Muscle tension
- Coordination of articulatory movements
- Respiration patterns
- Glottal closure characteristics
- Vocal fold vibratory patterns
This code accompanies the research published in:
A. J. L. Rivero, C. C. Corbacho, T. R. Arias, M. Martín-Merino and P. Vaz, "Application of Machine Learning Techniques for the Characterization and Early Diagnosis of Respiratory Diseases Such as COVID-19," in IEEE Access, vol. 12, pp. 160516-160528, 2024.
This paper presents a robust methodology for the early and cost-effective diagnosis of COVID-19 based on vocal features and machine learning techniques. The proposed methodology addresses all challenges inherent to the prediction of COVID-19, including those related to feature extraction and selection, the imbalance problem, and predictor training. In contrast to existing methodologies that rely solely on acoustic attributes of the voice, such as intensity or frequency, our approach represents a pioneering investigation that incorporates biomechanical aspects of vocal production. These include muscle tension, the coordination of articulatory movements, and respiration.
The relationship between these characteristics and the presence of the virus is investigated rigorously using robust feature selection techniques. To this end, we have constructed an original dataset comprising patients with confirmed cases of COVID-19 infection and a control group, incorporating both acoustic and biomechanical features using Voice Clinical Software. The robustness and reproducibility of the experimental results have been enhanced through the rigorous comparison of several classifiers and feature selection algorithms, as well as the employment of resampling strategies.
The application of random forests for feature selection has revealed that a limited set of biomechanical markers are significantly associated with the presence of COVID-19 infection. Moreover, a random forest classifier based on a subset of biomechanical and acoustic features demonstrates high efficacy in predicting cases of COVID-19 infection, achieving a sensitivity of
S = (0.9212 ± 0.0775) while maintaining a specificity of
Sp = (0.9150 ± 0.0649).
Considering these findings, the proposed methodology can be regarded as a non-invasive and cost-effective alternative for the diagnosis of COVID-19 infection. Furthermore, it can be extended to the diagnosis of other respiratory diseases, provided that the vocal cords are affected.
Our best-performing models achieve:
| Model | Accuracy | Balanced Accuracy | Sensitivity | Specificity | F1-Score |
|---|---|---|---|---|---|
| SVM | 0.9370 ± 0.0581 | 0.9571 ± 0.0771 | 0.9216 ± 0.0804 | 0.9394 ± 0.0566 | 0.9351 ± 0.0604 |
| Random Forest | 0.9135 ± 0.0647 | 0.9089 ± 0.0960 | 0.9212 ± 0.0775 | 0.9150 ± 0.0649 | 0.9071 ± 0.0820 |
| KNN | 0.9165 ± 0.0614 | 0.9342 ± 0.0761 | 0.9029 ± 0.0959 | 0.9185 ± 0.0600 | 0.9132 ± 0.0689 |
The following biomechanical markers were identified as significantly associated with COVID-19:
| Marker | Description | Clinical Interpretation |
|---|---|---|
| Pr13 | Instability index | Voice instability in resisting vibration stress |
| Pr14 | Index variation in amplitude | Inability to maintain amplitude |
| Pr06 | Duration opening (%) | Challenges with vocal fold separation during opening |
| Pr07 | Duration closing (%) | Challenges with vocal fold separation during closing |
| Pr17 | Index OM closed phase | Presence of mucosal waves during closing phase |
- Acoustic Features: F0 (fundamental frequency), formants (F1, F2), jitter, shimmer, HNR, MFCCs with deltas and delta-deltas
- Biomechanical Markers: 22 markers from Voice Clinical Systems App Online Lab (see Table 1 in paper)
- RFE with Bootstrap: Recursive Feature Elimination with stability analysis (20 bootstrap iterations)
- Statistical Tests: Mann-Whitney U test with Bonferroni correction (α = 0.05)
- Mutual Information: Information-theoretic feature ranking
- LASSO: L1-regularized feature selection
- CFS: Correlation-based Feature Selection
- Random Forest Importance: Feature importance ranking
| Model | Abbreviation | Notes |
|---|---|---|
| AdaBoost | ADA | Ensemble boosting |
| K-Nearest Neighbors | KNN | Instance-based learning |
| Support Vector Machine | SVM | With RBF kernel, optimized via PSO |
| Bagging | BAG | Bootstrap aggregating |
| Random Forest | RF | Interpretable via decision rules |
| Gradient Boosting | GB | Sequential boosting |
| XGBoost | XGB | Optimized gradient boosting |
| LightGBM | LGB | Fast gradient boosting |
| Decision Tree | DT | Single tree classifier |
- Particle Swarm Optimization (PSO) for hyperparameter tuning
- SMOTE for class imbalance handling (generates synthetic samples)
- Stratified 5-Fold Cross-Validation
- Bootstrap Confidence Intervals (100 iterations)
- Leave-One-Out Evaluation
vocal-biomarkers-ml/
├── config/
│ └── config.yaml # Main configuration file
├── data/
│ ├── raw/
│ │ ├── control/ # Control group audio recordings (.wav)
│ │ ├── pathological/ # COVID-19 positive audio recordings (.wav)
│ │ └── biomechanical/ # Biomechanical marker CSV files
│ └── results/
│ ├── datasets/ # Processed datasets
│ ├── features/ # Feature selection results
│ ├── metrics/ # Model performance metrics
│ └── visualizations/ # Generated plots (PCA, ROC curves)
├── logs/
│ └── voice_analysis.log # Execution logs
├── src/
│ └── voice_analysis/
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── pipeline.py # Main pipeline orchestrator
│ ├── config/ # Configuration management (Pydantic)
│ ├── data/ # Data loading and preprocessing
│ ├── features/ # Feature extraction (acoustic, biomechanical)
│ ├── models/ # Classifiers and hyperparameter optimization
│ ├── metrics/ # Custom evaluation metrics
│ ├── preprocessing/ # Cleaning, resampling, standardization
│ ├── utils/ # I/O, logging, reproducibility
│ └── visualization/ # PCA plots, ROC curves
├── tests/ # Unit and integration tests
├── pyproject.toml # Project dependencies (Poetry)
└── README.md
- Python 3.10, 3.11, or 3.12
- 8GB RAM minimum (16GB recommended)
- Unix-like OS (Linux/macOS) or Windows 10+
| Library | Version | Purpose |
|---|---|---|
| scikit-learn | 1.3.2 | Machine learning algorithms |
| imbalanced-learn | 0.12.3 | SMOTE resampling |
| praat-parselmouth | ^0.4.3 | Acoustic feature extraction via Praat |
| librosa | ^0.10.0 | Audio analysis and MFCCs |
| xgboost | 2.0.3 | XGBoost classifier |
| lightgbm | 4.5.0 | LightGBM classifier |
| pyswarm | ^0.6 | Particle Swarm Optimization |
| pydantic | ^2.0.0 | Configuration validation |
# Clone the repository
git clone https://github.com/cchinchilla-dev/vocal-biomarkers-ml.git
cd vocal-biomarkers-ml
# Install Poetry if needed
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Activate virtual environment
poetry shellgit clone https://github.com/cchinchilla-dev/vocal-biomarkers-ml.git
cd vocal-biomarkers-ml
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .All parameters are configured via config/config.yaml:
# Key parameters (as used in the paper)
data_processing:
cleaning:
variance_threshold: 0.1 # Remove features with variance < 0.1
correlation_threshold: 0.9 # Remove features with correlation > 0.9
feature_selection:
rfe:
stability_threshold: 0.4 # Minimum stability index
n_bootstrap_iterations: 20 # Bootstrap iterations for RFE
statistics:
alpha: 0.05 # Significance level
correction_method: "bonferroni"
models:
test_size: 0.2 # 20% test split
cross_validation:
n_folds: 5 # 5-fold CV
stratified: true
hyperparameter_search:
method: "pso" # Particle Swarm Optimization
pso:
swarm_size: 10
max_iterations: 30
evaluation:
bootstrap:
n_iterations: 100 # Bootstrap iterations for metrics
confidence_level: 0.95# Run complete pipeline
voice-analysis
# Custom configuration
voice-analysis --config path/to/config.yaml
# Run specific stage
voice-analysis --stage extract # Feature extraction only
voice-analysis --stage features # Feature selection only
voice-analysis --stage train # Model training only
# Validate configuration
voice-analysis --dry-run
# Verbose output
voice-analysis --verbosefrom voice_analysis.config.settings import get_settings
from voice_analysis.pipeline import VoiceAnalysisPipeline
# Load configuration
settings = get_settings("config/config.yaml")
# Run pipeline
pipeline = VoiceAnalysisPipeline(settings)
results = pipeline.run()
# Access results
print(f"Selected features: {results['n_selected_features']}")
print(f"Best model: {results['metrics'].loc[results['metrics']['Accuracy'].idxmax(), 'Model']}")- Load audio recordings (Spanish vowel /a/, 4 seconds, 44.1 kHz)
- Load biomechanical markers from Voice Clinical Systems CSV exports
- Merge acoustic and biomechanical features
- Remove features with variance < 0.1
- Remove features with correlation > 0.9
- Statistical tests (Mann-Whitney U + Bonferroni)
- RFE with bootstrap stability analysis
- Mutual Information scoring
- LASSO regularization
- Correlation-based Feature Selection
- Ensemble combination
- Balance classes using SMOTE (k_neighbors=7)
- Validate distribution preservation via PCA and Kolmogorov-Smirnov tests
- Train all classifiers with PSO-optimized hyperparameters
- 5-fold stratified cross-validation
- Compute metrics with bootstrap confidence intervals
- Generate ROC curves
- Perform Leave-One-Out analysis
- Calculate per-patient misclassification probabilities
| Directory | File | Description |
|---|---|---|
datasets/ |
dataset.csv |
Raw combined dataset |
datasets/ |
dataset_clean.csv |
After cleaning |
datasets/ |
dataset_smote.csv |
After SMOTE |
features/ |
bonferroni.csv |
Statistical test results |
features/ |
rfe.csv |
RFE rankings and stability indices |
features/ |
features_to_use.csv |
Final selected features |
metrics/ |
classifier_metrics.csv |
All model metrics |
metrics/ |
prob_misclassification.csv |
Misclassification probabilities |
visualizations/ |
pca_*.svg |
PCA plots |
visualizations/ |
roc_*.svg |
ROC curves |
- 74 adult participants total
- 15 COVID-19 positive (7 women, 8 men, ages 23-55)
- 59 Control group (44 women, 15 men, ages 18-58)
- Voice samples collected June-September 2020 (pre-vaccination)
- Recording: Spanish vowel /a/ for 4 seconds using App Online Lab
| Category | Features | Functionals |
|---|---|---|
| Phonation | F0, F1, F2 | Mean, Std, Skew, Kurtosis, Δ1, Δ2 |
| Perturbation | Jitter (5 types), Shimmer (6 types) | Direct measures |
| Noise | HNR | Direct measure |
| Spectral | MFCC[1-13] | Mean, Std, Skew, Kurtosis, Δ1, Δ2 |
See Table 1 in the paper for complete descriptions. Categories include:
- Fundamental Frequency (Pr01)
- Harmony in vocal fold movement (Pr02)
- Phase characteristics (Pr03-Pr07)
- Muscle tension (Pr08-Pr09)
- Sufficiency of seal (Pr10-Pr12)
- Tension and instability (Pr13-Pr14)
- Edge spacing (Pr15-Pr16)
- Mucosal wave correlates (Pr17-Pr20)
- Mass correlates (Pr21-Pr22)
This repository contains the complete codebase developed throughout the research process, including both the code used to produce the results reported in the published article and additional experimental scripts and analyses that were not included in the final manuscript or its supplementary materials.
The additional code reflects intermediate iterations, alternative methodological approaches, and exploratory analyses conducted during model development. Consequently, some scripts may be experimental in nature and may include incomplete configurations, provisional implementations, or minor inconsistencies that were not part of the final validated pipeline.
Minor discrepancies between the published article and this codebase may exist as a result of the natural evolution of the code during the research process, including experimentation, refactoring, and optimization. These discrepancies do not affect the core methodology or the main conclusions of the study. The published article should be considered the authoritative reference for the final experimental design and results.
Due to data privacy and ethical constraints, the dataset used in this research cannot be made publicly available. Access to the data may be granted by the authors upon reasonable request, subject to applicable regulations and approvals.
To support methodological transparency and reproducibility, this repository provides all preprocessing, feature selection, and modeling code required to replicate the experimental pipeline. Full reproduction of the reported results requires access to the original dataset.
To match paper results exactly:
- Set
reproducibility.n_executions: 100in config.yaml - Set
evaluation.bootstrap.n_iterations: 100 - Use the same random seeds:
[42, 123, 1234, 12345, 0, 1]
If you use this code in your research, please cite:
@ARTICLE{10737357,
author={Rivero, Alfonso José López and Corbacho, Carlos Chinchilla and
Arias, Tatiana Romero and Martín-Merino, Manuel and Vaz, Paulo},
journal={IEEE Access},
title={Application of Machine Learning Techniques for the Characterization
and Early Diagnosis of Respiratory Diseases Such as COVID-19},
year={2024},
volume={12},
pages={160516-160528},
doi={10.1109/ACCESS.2024.3487773}
}This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).
- Carlos Chinchilla Corbacho
Universidad Pontificia de Salamanca
📧 cchinchilla@usal.es
🔗 GitHub
Primary developer and maintainer of the repository.
- Alfonso José López Rivero - Principal Investigator - Universidad Pontificia de Salamanca
- Tatiana Romero Arias - Clinical Research & Data Collection - Universidad Europea de Canarias
- Manuel Martín-Merino - Machine Learning Advisor - Universidad Pontificia de Salamanca
- Paulo Vaz - Research Collaboration - Polytechnic Institute of Viseu
For questions, issues, or contributions, please open an issue on GitHub or contact the authors directly.