Tugas Besar Pembelajaran Mesin Multimodal (IF25-40304)
Kelompok 09 β Institut Teknologi Sumatera
Semester Ganjil 2024/2025
| Nama | NIM |
|---|---|
| Lois Novel E. Gurning | 122140098 |
| Sakti Mujahid Imani | 122140123 |
| Apridian Saputra | 122140143 |
| Joshia Fernandes Sectio Purba | 122140170 |
| Sikah Nubuahtul Ilmi | 122140208 |
Sistem Multimodal Music Emotion Recognition (MER) untuk mengklasifikasikan lagu ke dalam 5 klaster emosi MIREX menggunakan tiga modalitas:
- π΅ Audio - Analisis sinyal audio
- π Lyrics - Analisis teks lirik
- πΉ MIDI - Analisis data musik symbolic
Late Fusion - Menggabungkan output probability dari setiap modalitas untuk prediksi akhir yang lebih akurat.
Baseline
- Audio: CRNN (Convolutional Recurrent Neural Network)
- Lyrics: BERT base (bert-base-uncased)
- MIDI: BiGRU + Attention
Improved
- Audio: PANN (Pre-trained Audio Neural Network - Cnn14)
- Lyrics: DeBERTa-v3-base (Enhanced attention mechanism)
- MIDI: BiGRU + SVM (Robust untuk small dataset)
π‘ Catatan: Project ini menerapkan iterative improvement dari baseline ke model yang lebih advanced, dengan dokumentasi lengkap untuk menunjukkan progression dan justifikasi.
Kyuubi-MML/
β
βββ π README.md # Dokumentasi utama (file ini)
β
βββ π data/ # Dataset & metadata
β βββ master_tracks.csv # Metadata 903 lagu
β βββ split_global.csv # Train/val/test split
β
βββ π notebooks/ # Jupyter Notebooks
β βββ 01_EDA/ # Exploratory Data Analysis
β β βββ 01_EDA_Multimodal.ipynb
β β βββ 02_Data_Splitting.ipynb
β βββ 02_Preprocessing/ # Preprocessing pipelines
β β βββ 01_Audio_Preprocessing.ipynb
β β βββ 02_Lyrics_Preprocessing.ipynb
β βββ 03_Baseline/ # Baseline models
β β βββ 01_Audio_CRNN.ipynb
β β βββ 02_Lyrics_BERT.ipynb
β β βββ 03_MIDI_BiGRU_Attention.ipynb
β βββ 04_Improved/ # Improved models
β β βββ 01_Audio_PANN.ipynb
β β βββ 02_Lyrics_DeBERTa.ipynb
β β βββ 03_MIDI_BiGRU_SVM.ipynb
β β βββ 03_MIDI_Complete_Pipeline.ipynb
β βββ 05_Fusion/ # Multimodal fusion
β βββ fusion.py
β βββ smart_fusion.py
β βββ fusion_evaluation_finale.py
β
βββ π results/ # Hasil eksperimen
β βββ baseline/ # Baseline results
β βββ improved/ # Improved results
β βββ fusion/ # Fusion results
β
βββ π models/ # Saved model checkpoints
βββ π reports/ # Laporan milestone
βββ π figures/ # Visualisasi & plot
βββ π docs/ # Dokumentasi tambahan
βββ π miditrainsvm/ # MIDI training artifacts
- Framework: MIREX (Music Information Retrieval Evaluation eXchange)
- Reference: Panda et al. (2013)
| Modalitas | Jumlah Sampel | Coverage |
|---|---|---|
| Audio | 903 | 100% |
| Lyrics | 764 | ~85% |
| MIDI | 193 | ~21% |
- Cluster 1: Passionate / Rousing / Confident / Boisterous / Rowdy
- Cluster 2: Cheerful / Fun / Sweet / Amiable
- Cluster 3: Poignant / Wistful / Brooding
- Cluster 4: Humorous / Quirky / Witty
- Cluster 5: Aggressive / Tense / Intense
- Train: ~70-80%
- Validation: ~10-15%
- Test: ~10-15%
- Strategy: Stratified split untuk maintain class balance
Analisis Intra-Modal - Per modalitas
- Audio: Mel-spectrogram patterns, duration distribution
- Lyrics: Word frequency, text length, common words per cluster
- MIDI: Pitch/velocity distribution, duration patterns
Analisis Inter-Modal - Antar modalitas
- Correlation analysis
- Audio-Lyrics-MIDI alignment
- Modality availability matrix
Analisis Target - Terhadap label
- Class imbalance detection
- Feature importance per cluster
Visualisasi t-SNE
- Feature embeddings visualization
- Cluster quality assessment
Audio:
- Sample rate: 32,000 Hz (PANN) / 22,050 Hz (CRNN)
- Duration: 10 detik uniform
- Feature: Log-Mel Spectrogram (128 mel bands)
- Augmentation: Multi-crop strategy (start, middle, end)
Lyrics:
- Tokenization: BERT/DeBERTa tokenizer
- Max length: 256 tokens
- Padding & truncation
- Lowercase normalization
MIDI:
- Event extraction (pitch, velocity, duration)
- Embedding layer
- Sequence padding
Audio: CRNN
Input (Mel-Spec) β CNN layers β RNN layers β Dense β Softmax (5 classes)
- Performance: ~43% accuracy, ~0.38 macro F1
- Issue: Underfitting, butuh pre-trained model
Lyrics: BERT Base
Input (tokens) β BERT encoder β Pooler β Classifier β Softmax (5 classes)
- Performance: ~42% accuracy, ~0.40 macro F1
- Issue: Semantic similarity causing confusion
MIDI: BiGRU + Attention
Input (events) β Embedding β BiGRU β Attention β Dense β Softmax (5 classes)
- Performance: ~25% accuracy, ~0.20 macro F1
- Issue: Dataset terlalu kecil, overfitting
Audio: PANN (Cnn14)
Input β Pre-trained Cnn14 β Feature extractor β Fine-tuned classifier β Softmax
- Pre-trained on AudioSet
- Multi-crop inference strategy
- Expected: Better audio representation
Lyrics: DeBERTa-v3-base
Input β DeBERTa encoder (disentangled attention) β Pooler β Classifier β Softmax
- Enhanced mask decoder
- Layer freezing strategy (freeze lower 0-7, fine-tune upper)
- Expected: Better semantic understanding
MIDI: BiGRU + SVM
Input β BiGRU (frozen) β Feature extraction β SVM classifier (RBF kernel) β Softmax
- BiGRU as feature extractor
- SVM with balanced class weights
- Expected: Robust untuk small dataset, avoid overfitting
Simple Average Fusion
P_final = (P_audio + P_lyrics + P_midi) / 3F1-Weighted Fusion
w_i = F1_i / (F1_audio + F1_lyrics + F1_midi)
P_final = w_audio * P_audio + w_lyrics * P_lyrics + w_midi * P_midiSmart Fusion (Missing Modality Handling)
- Adaptive per-sample fusion
- Supports partial modality combinations
- Coverage: 903 samples (semua audio)
| Model | Modality | Accuracy | Macro F1 | Notes |
|---|---|---|---|---|
| CRNN | Audio | ~43% | ~0.38 | Underfitting |
| BERT | Lyrics | ~42% | ~0.40 | Semantic confusion |
| BiGRU+Attn | MIDI | ~25% | ~0.20 | Small dataset |
| Model | Modality | Improvement | Expected Gain |
|---|---|---|---|
| PANN | Audio | Pre-trained | Better representation |
| DeBERTa | Lyrics | Enhanced attn | Better semantics |
| BiGRU+SVM | MIDI | SVM classifier | Avoid overfitting |
Ablation Study (pada intersection samples)
| Combination | Strategy | N Samples | Performance |
|---|---|---|---|
| Audio only | - | 903 | Baseline unimodal |
| Lyrics only | - | 764 | Baseline unimodal |
| MIDI only | - | 193 | Baseline unimodal |
| Audio + Lyrics | Simple avg | 764 | Multimodal boost |
| Audio + MIDI | Simple avg | 193 | Multimodal boost |
| Lyrics + MIDI | Simple avg | 193 | Multimodal boost |
| All (Full) | Smart fusion | 903 | Best coverage |
Key Findings:
- β Multimodal fusion > best unimodal
- β Smart fusion memberikan coverage terluas
- β F1-weighted lebih baik dari simple average
β οΈ MIDI contribution terbatas karena dataset kecil
- Dokumen proposal (5-7 halaman)
- Slide presentasi (10-15 menit)
- Latar belakang, rumusan masalah, tujuan
- Deskripsi dataset & rencana metode
- Deliverable:
reports/Proposal.pdf
- Analisis intra-modal (Audio, Lyrics, MIDI)
- Analisis inter-modal & target
- Visualisasi t-SNE
- Identifikasi masalah data
- Deliverables:
notebooks/01_EDA/01_EDA_Multimodal.ipynbreports/EDA Multimodal Kelompok 09.pdf
- Baseline models (CRNN, BERT, BiGRU+Attention)
- Setup eksperimen & hyperparameters
- Hasil baseline & learning curves
- Error analysis
- Rencana optimalisasi
- Deliverables:
notebooks/03_Baseline/*.ipynbreports/Preliminary Experiment Kelompok 09.pdf
- Improved models (PANN, DeBERTa, BiGRU+SVM)
- Multimodal fusion experiments
- Evaluation & comparison
- Deliverables:
reports/Final Project.pdf
-
MIREX Dataset
- Panda et al. (2013) - Multi-modal Music Emotion Recognition
Deep Learning Frameworks:
- PyTorch 2.0+
- Transformers (Hugging Face)
- torchaudio
Audio Processing:
- librosa
- pretty_midi
- PANNs (audioset_tagging_cnn)
Machine Learning:
- scikit-learn (SVM, metrics)
- numpy, pandas
Visualization:
- matplotlib, seaborn
- t-SNE
Motivasi Improvement:
-
Audio (CRNN β PANN)
- CRNN underfitting karena kurang data training
- PANN pre-trained pada AudioSet (2M+ audio clips)
- Transfer learning memberikan better feature extraction
-
Lyrics (BERT β DeBERTa)
- BERT kesulitan dengan semantic similarity
- DeBERTa punya disentangled attention mechanism
- Lebih baik dalam contextual understanding
-
MIDI (BiGRU+Attn β BiGRU+SVM)
- Dataset MIDI sangat kecil (193 samples)
- Neural network classifier cenderung overfit
- SVM lebih robust untuk small data
- BiGRU tetap digunakan sebagai feature extractor
Baseline Results:
results/baseline/audio_prob.csv- CRNN probabilitiesresults/baseline/lyric_prob.csv- BERT probabilitiesresults/baseline/midi_prob.csv- BiGRU+Attention probabilities
Improved Results:
results/improved/audio_prob_for_fusion.csv- PANN probabilitiesresults/improved/lyrics_prob_for_fusion2.csv- DeBERTa probabilitiesresults/improved/midi_prob_for_fusion.csv- BiGRU+SVM probabilities
Pembagian peran dalam project:
- Joshia: EDA Audio, CRNN baseline, PANN improvement
- Apridian: EDA Lyrics, BERT baseline, DeBERTa improvement
- Sikah: EDA MIDI, BiGRU baseline, BiGRU+SVM improvement
- Louis: Fusion strategy, evaluation, comparison
- Sakti: Documentation, visualization, report writing
Project ini dibuat untuk keperluan akademik dalam mata kuliah Pembelajaran Mesin Multimodal (IF25-40304), Institut Teknologi Sumatera.
Terima kasih kepada:
- Dosen pengampu mata kuliah Pembelajaran Mesin Multimodal, Bapak I Wayan Wiprayoga Wisesa, S.Kom., M.Kom.
- Penyedia dataset MIREX
π΅ Made with β€οΈ by Kelompok 09 π΅