Implementation of a Deep Learning model capable of identifying potential signs of depression, comparing the effectiveness across different models on different types of audio features.
This project was developed for the Intelligent Signal Analysis examination of Prof. Sabato Marco Siniscalchi, during the 2024/2025 Academic Year at the Università degli Studi di Palermo, Computer Engineering (LM-32, 2035) course.
Andrea Spinelli - Antonio Spedito - Davide Bonura
- Languages: Python
- Other: Git
To run this project, you will need Python and the required libraries. The setup involves creating an environment and running the analysis script.
- Python 3.8 or higher.
- Required libraries: indicated in the requirements.txt file
- Git (to clone the repository).
Important: All commands must be executed from the project's root directory after installing the dependencies.
-
Clone the Repository Open your terminal or command prompt and clone the repository to your local machine.
git clone https://github.com/A-rgonaut/LM-32_Speech_Analysis_for_Depression_Assessment.git
-
Libraries Install the required libraries
pip install -r requirements.txt
-
Experiment Configuration Before running any script, modify the
config.yamlfile. Thecommon.active_modelkey determines which model will be used ('svm','cnn','ssl'). All other model-specific parameters are defined in their respective sections. -
Data Preprocessing (Run once) This script processes the raw datasets and generates the unified E1-DAIC-WOZ dataset required for training.
python -m scripts.preprocessor
-
Feature Extraction (Optional) This step is required only for the
sslmodel ifuse_preextracted_featuresis set totrueinconfig.yaml.python -m scripts.feature_extraction
-
Model Training The training script is unified. It will read
config.yaml, identify theactive_model, and launch the correct training procedure.python -m scripts.train
-
Model Testing Similar to training, the testing script is unified and relies on the configuration in
config.yaml. It loads the K saved models and runs an evaluation on the test set, reporting the mean metrics and standard deviation.python -m scripts.test
When hyperparameter_search_mode is set to true for cnn or ssl, the script performs a full grid search:
- For each combination of hyperparameters, a K-Fold Cross-Validation is performed.
- The models from each fold are saved temporarily.
- If the average F1-score of the current run exceeds the best score recorded so far, the temporary models are made permanent, and the models from the previous best run are deleted.
- At the end of the search, the
saved_models/directory will contain only the K models from the winning hyperparameter combination.
The ssl model follows a structured, three-phase workflow to first find the best architecture and then systematically compare different pre-trained models and their layers.
Phase 1: Feature Extraction
- Goal: Pre-extract and save feature representations from all SSL models you wish to analyze.
- Action:
- In
config.yaml, populate thessl.ssl_model_nameslist with all desired models (e.g.,'facebook/wav2vec2-base-960h','microsoft/wavlm-base'). - Run the extraction script:
python -m scripts.feature_extraction
- In
Phase 2: Finding the Optimal Downstream Architecture
- Goal: Find the best sequential model (e.g., Transformer, BiLSTM) and its hyperparameters for processing the SSL features. This is done once using a strong reference model.
- Action:
- In
config.yaml, setactive_model: 'ssl',hyperparameter_search_mode: trueandrun_layer_sweep: false. - Specify your single reference model using the singular key:
ssl_model_name: 'facebook/wav2vec2-base-960h'. Thessl_model_nameslist is ignored in this mode. - Run the training script to start the grid search:
python -m scripts.train
- Outcome: This process will save the best-performing architecture's parameters to
saved_models/ssl/best_params_ssl.json.
- In
Phase 3: Automated Sweep for Model and Layer Comparison
- Goal: Fairly compare the performance of each layer from each SSL model, using the optimal architecture found in Phase 2.
- Action:
- In
config.yaml, sethyperparameter_search_mode: falseandrun_layer_sweep: true. - Ensure
ssl_model_namescontains the list of models for which you extracted features in Phase 1. - Populate
layers_to_usewith the layer indices you want to test (e.g.,[0, 1, ..., 12]). - Run the training script:
python -m scripts.train
- Run the testing script: After training is complete, run the test script to evaluate all models on the held-out dev and test set.
python -m scripts.test
- Run the results plotting script to visualize the performance across different models and layers.
python -m scripts.plot_results
- In
The code is organized into a modular structure to ensure clarity, maintainability, and reusability.
datasets/: Should contain the raw DAIC-WOZ and E-DAIC-WOZ datasets. The preprocessedE1-DAIC-WOZdataset will also be generated here.features/: Directory where features extracted by SSL models (e.g., Wav2Vec2) are saved, if used.scripts/: Contains executable scripts for preprocessing, feature extraction, training, and testing the models.src/: The core of the project, organized as a Python package.preprocessor.py: Handles the initial processing of audio and transcripts to create the unified E1-DAIC-WOZ dataset.cnn_module/: Contains modules related to the 1D-CNN model (data loader, architecture).svm_module/: Contains modules related to the SVM models (data loader, model logic).ssl_module/: Contains modules for the first SSL-based model (chunk-based data loader, sequential architecture).trainer.py: Unified Trainer class for all PyTorch models.evaluator.py: Unified Evaluator class for all models.utils.py: Contains shared utility functions (e.g., metrics calculation, seeding).
saved_models/: Directory where trained models (.pth,.pkl) will be saved.results/: Directory where evaluation results (.csv) will be saved.config.yaml: Central configuration file to manage all experiment parameters.
This project presents a systematic and comparative study of three distinct machine learning paradigms for the automatic detection of depression from speech, using the DAIC-WOZ dataset. The complexity of the models progressively increases, providing a clear view of the evolution from classical methods to the current state-of-the-art.
-
Classical Baseline: Support Vector Machine (SVM)
- This approach establishes a robust baseline using a traditional two-stage pipeline.
- It relies on hand-crafted acoustic features that capture engineered aspects of articulation, phonation (e.g., jitter, shimmer), and prosody (e.g., pitch, energy).
- Its purpose is to represent the classical methods that have long dominated the field and provide a solid, interpretable benchmark for comparison.
-
Supervised Deep Learning: 1-D Convolutional Neural Network (CNN)
- This model serves as an "evolutionary bridge" by moving to an end-to-end approach.
- It operates directly on segmented raw audio waveforms, learning relevant features automatically instead of relying on manual engineering.
- The architecture is designed to capture hierarchical patterns in speech, from local phonetic textures to longer-term prosodic features. Its goal is to outperform the SVM by learning richer, more discriminative representations from the data itself.
-
State-of-the-Art Hierarchical Model with Self-Supervised Learning (SSL)
- This is the project's most innovative contribution, designed to address the critical challenge of limited labeled data in clinical settings.
- Feature Extractor: It leverages a large, pre-trained foundation model (like
wav2vec 2.0,HuBERT, etc.) to extract rich, high-level representations from audio. These models are pre-trained on vast amounts of unlabeled speech, allowing them to understand the fundamental structure of human speech. - Sequential Modeling: The features from the foundation model are fed into a sequential module (a Transformer Encoder or BiLSTM) to model long-term temporal dependencies across an entire interview.
- Hierarchical Pooling: The model uses a sophisticated Gated Attention Pooling mechanism to intelligently aggregate information over time, allowing it to focus on the most relevant speech segments for a final diagnosis.