Skip to content

Sakti-122140123/Kyuubi-MML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎡 Multimodal Music Emotion Recognition

Pengenalan Emosi Musik Berbasis Late Fusion pada Dataset MIREX

Tugas Besar Pembelajaran Mesin Multimodal (IF25-40304)
Kelompok 09 β€” Institut Teknologi Sumatera
Semester Ganjil 2024/2025


πŸ‘₯ Anggota Kelompok

Nama NIM
Lois Novel E. Gurning 122140098
Sakti Mujahid Imani 122140123
Apridian Saputra 122140143
Joshia Fernandes Sectio Purba 122140170
Sikah Nubuahtul Ilmi 122140208

πŸ“Œ Ringkasan Project

Sistem Multimodal Music Emotion Recognition (MER) untuk mengklasifikasikan lagu ke dalam 5 klaster emosi MIREX menggunakan tiga modalitas:

  • 🎡 Audio - Analisis sinyal audio
  • πŸ“ Lyrics - Analisis teks lirik
  • 🎹 MIDI - Analisis data musik symbolic

Strategi Pendekatan

Late Fusion - Menggabungkan output probability dari setiap modalitas untuk prediksi akhir yang lebih akurat.

Model yang Digunakan

Baseline

  • Audio: CRNN (Convolutional Recurrent Neural Network)
  • Lyrics: BERT base (bert-base-uncased)
  • MIDI: BiGRU + Attention

Improved

  • Audio: PANN (Pre-trained Audio Neural Network - Cnn14)
  • Lyrics: DeBERTa-v3-base (Enhanced attention mechanism)
  • MIDI: BiGRU + SVM (Robust untuk small dataset)

πŸ’‘ Catatan: Project ini menerapkan iterative improvement dari baseline ke model yang lebih advanced, dengan dokumentasi lengkap untuk menunjukkan progression dan justifikasi.


πŸ“‚ Struktur Repository

Kyuubi-MML/
β”‚
β”œβ”€β”€ πŸ“„ README.md                    # Dokumentasi utama (file ini)
β”‚
β”œβ”€β”€ πŸ“ data/                        # Dataset & metadata
β”‚   β”œβ”€β”€ master_tracks.csv           # Metadata 903 lagu
β”‚   └── split_global.csv            # Train/val/test split
β”‚
β”œβ”€β”€ πŸ“ notebooks/                   # Jupyter Notebooks
β”‚   β”œβ”€β”€ 01_EDA/                     # Exploratory Data Analysis
β”‚   β”‚   β”œβ”€β”€ 01_EDA_Multimodal.ipynb
β”‚   β”‚   └── 02_Data_Splitting.ipynb
β”‚   β”œβ”€β”€ 02_Preprocessing/           # Preprocessing pipelines
β”‚   β”‚   β”œβ”€β”€ 01_Audio_Preprocessing.ipynb
β”‚   β”‚   └── 02_Lyrics_Preprocessing.ipynb
β”‚   β”œβ”€β”€ 03_Baseline/                # Baseline models
β”‚   β”‚   β”œβ”€β”€ 01_Audio_CRNN.ipynb
β”‚   β”‚   β”œβ”€β”€ 02_Lyrics_BERT.ipynb
β”‚   β”‚   └── 03_MIDI_BiGRU_Attention.ipynb
β”‚   β”œβ”€β”€ 04_Improved/                # Improved models
β”‚   β”‚   β”œβ”€β”€ 01_Audio_PANN.ipynb
β”‚   β”‚   β”œβ”€β”€ 02_Lyrics_DeBERTa.ipynb
β”‚   β”‚   β”œβ”€β”€ 03_MIDI_BiGRU_SVM.ipynb
β”‚   β”‚   └── 03_MIDI_Complete_Pipeline.ipynb
β”‚   └── 05_Fusion/                  # Multimodal fusion
β”‚       β”œβ”€β”€ fusion.py
β”‚       β”œβ”€β”€ smart_fusion.py
β”‚       └── fusion_evaluation_finale.py
β”‚
β”œβ”€β”€ πŸ“ results/                     # Hasil eksperimen
β”‚   β”œβ”€β”€ baseline/                   # Baseline results
β”‚   β”œβ”€β”€ improved/                   # Improved results
β”‚   └── fusion/                     # Fusion results
β”‚
β”œβ”€β”€ πŸ“ models/                      # Saved model checkpoints
β”œβ”€β”€ πŸ“ reports/                     # Laporan milestone
β”œβ”€β”€ πŸ“ figures/                     # Visualisasi & plot
β”œβ”€β”€ πŸ“ docs/                        # Dokumentasi tambahan
└── πŸ“ miditrainsvm/               # MIDI training artifacts


πŸ“Š Dataset

Sumber

  • Framework: MIREX (Music Information Retrieval Evaluation eXchange)
  • Reference: Panda et al. (2013)

Ketersediaan Data

Modalitas Jumlah Sampel Coverage
Audio 903 100%
Lyrics 764 ~85%
MIDI 193 ~21%

Label Emosi (5 Cluster MIREX)

  1. Cluster 1: Passionate / Rousing / Confident / Boisterous / Rowdy
  2. Cluster 2: Cheerful / Fun / Sweet / Amiable
  3. Cluster 3: Poignant / Wistful / Brooding
  4. Cluster 4: Humorous / Quirky / Witty
  5. Cluster 5: Aggressive / Tense / Intense

Data Splitting

  • Train: ~70-80%
  • Validation: ~10-15%
  • Test: ~10-15%
  • Strategy: Stratified split untuk maintain class balance

πŸ”¬ Metodologi

1. Exploratory Data Analysis (EDA)

Analisis Intra-Modal - Per modalitas

  • Audio: Mel-spectrogram patterns, duration distribution
  • Lyrics: Word frequency, text length, common words per cluster
  • MIDI: Pitch/velocity distribution, duration patterns

Analisis Inter-Modal - Antar modalitas

  • Correlation analysis
  • Audio-Lyrics-MIDI alignment
  • Modality availability matrix

Analisis Target - Terhadap label

  • Class imbalance detection
  • Feature importance per cluster

Visualisasi t-SNE

  • Feature embeddings visualization
  • Cluster quality assessment

2. Preprocessing

Audio:

  • Sample rate: 32,000 Hz (PANN) / 22,050 Hz (CRNN)
  • Duration: 10 detik uniform
  • Feature: Log-Mel Spectrogram (128 mel bands)
  • Augmentation: Multi-crop strategy (start, middle, end)

Lyrics:

  • Tokenization: BERT/DeBERTa tokenizer
  • Max length: 256 tokens
  • Padding & truncation
  • Lowercase normalization

MIDI:

  • Event extraction (pitch, velocity, duration)
  • Embedding layer
  • Sequence padding

3. Model Architecture

Baseline Models (Milestone 3)

Audio: CRNN

Input (Mel-Spec) β†’ CNN layers β†’ RNN layers β†’ Dense β†’ Softmax (5 classes)
  • Performance: ~43% accuracy, ~0.38 macro F1
  • Issue: Underfitting, butuh pre-trained model

Lyrics: BERT Base

Input (tokens) β†’ BERT encoder β†’ Pooler β†’ Classifier β†’ Softmax (5 classes)
  • Performance: ~42% accuracy, ~0.40 macro F1
  • Issue: Semantic similarity causing confusion

MIDI: BiGRU + Attention

Input (events) β†’ Embedding β†’ BiGRU β†’ Attention β†’ Dense β†’ Softmax (5 classes)
  • Performance: ~25% accuracy, ~0.20 macro F1
  • Issue: Dataset terlalu kecil, overfitting

Improved Models (Milestone 4)

Audio: PANN (Cnn14)

Input β†’ Pre-trained Cnn14 β†’ Feature extractor β†’ Fine-tuned classifier β†’ Softmax
  • Pre-trained on AudioSet
  • Multi-crop inference strategy
  • Expected: Better audio representation

Lyrics: DeBERTa-v3-base

Input β†’ DeBERTa encoder (disentangled attention) β†’ Pooler β†’ Classifier β†’ Softmax
  • Enhanced mask decoder
  • Layer freezing strategy (freeze lower 0-7, fine-tune upper)
  • Expected: Better semantic understanding

MIDI: BiGRU + SVM

Input β†’ BiGRU (frozen) β†’ Feature extraction β†’ SVM classifier (RBF kernel) β†’ Softmax
  • BiGRU as feature extractor
  • SVM with balanced class weights
  • Expected: Robust untuk small dataset, avoid overfitting

4. Fusion Strategy

Simple Average Fusion

P_final = (P_audio + P_lyrics + P_midi) / 3

F1-Weighted Fusion

w_i = F1_i / (F1_audio + F1_lyrics + F1_midi)
P_final = w_audio * P_audio + w_lyrics * P_lyrics + w_midi * P_midi

Smart Fusion (Missing Modality Handling)

  • Adaptive per-sample fusion
  • Supports partial modality combinations
  • Coverage: 903 samples (semua audio)

πŸ“ˆ Hasil Eksperimen

Unimodal Performance

Baseline Results

Model Modality Accuracy Macro F1 Notes
CRNN Audio ~43% ~0.38 Underfitting
BERT Lyrics ~42% ~0.40 Semantic confusion
BiGRU+Attn MIDI ~25% ~0.20 Small dataset

Improved Results

Model Modality Improvement Expected Gain
PANN Audio Pre-trained Better representation
DeBERTa Lyrics Enhanced attn Better semantics
BiGRU+SVM MIDI SVM classifier Avoid overfitting

Multimodal Fusion

Ablation Study (pada intersection samples)

Combination Strategy N Samples Performance
Audio only - 903 Baseline unimodal
Lyrics only - 764 Baseline unimodal
MIDI only - 193 Baseline unimodal
Audio + Lyrics Simple avg 764 Multimodal boost
Audio + MIDI Simple avg 193 Multimodal boost
Lyrics + MIDI Simple avg 193 Multimodal boost
All (Full) Smart fusion 903 Best coverage

Key Findings:

  • βœ… Multimodal fusion > best unimodal
  • βœ… Smart fusion memberikan coverage terluas
  • βœ… F1-weighted lebih baik dari simple average
  • ⚠️ MIDI contribution terbatas karena dataset kecil

🎯 Milestone Progress

βœ… Milestone 1: Proposal

  • Dokumen proposal (5-7 halaman)
  • Slide presentasi (10-15 menit)
  • Latar belakang, rumusan masalah, tujuan
  • Deskripsi dataset & rencana metode
  • Deliverable: reports/Proposal.pdf

βœ… Milestone 2: EDA Multimodal

  • Analisis intra-modal (Audio, Lyrics, MIDI)
  • Analisis inter-modal & target
  • Visualisasi t-SNE
  • Identifikasi masalah data
  • Deliverables:
    • notebooks/01_EDA/01_EDA_Multimodal.ipynb
    • reports/EDA Multimodal Kelompok 09.pdf

βœ… Milestone 3: Preliminary Experiment

  • Baseline models (CRNN, BERT, BiGRU+Attention)
  • Setup eksperimen & hyperparameters
  • Hasil baseline & learning curves
  • Error analysis
  • Rencana optimalisasi
  • Deliverables:
    • notebooks/03_Baseline/*.ipynb
    • reports/Preliminary Experiment Kelompok 09.pdf

βœ… Milestone 4: Laporan Akhir

  • Improved models (PANN, DeBERTa, BiGRU+SVM)
  • Multimodal fusion experiments
  • Evaluation & comparison
  • Deliverables:
    • reports/Final Project.pdf

πŸ“š Referensi Utama

  1. MIREX Dataset

    • Panda et al. (2013) - Multi-modal Music Emotion Recognition

πŸ”§ Technical Stack

Deep Learning Frameworks:

  • PyTorch 2.0+
  • Transformers (Hugging Face)
  • torchaudio

Audio Processing:

  • librosa
  • pretty_midi
  • PANNs (audioset_tagging_cnn)

Machine Learning:

  • scikit-learn (SVM, metrics)
  • numpy, pandas

Visualization:

  • matplotlib, seaborn
  • t-SNE

πŸ“ Catatan Penting

Perubahan dari Baseline ke Improved

Motivasi Improvement:

  1. Audio (CRNN β†’ PANN)

    • CRNN underfitting karena kurang data training
    • PANN pre-trained pada AudioSet (2M+ audio clips)
    • Transfer learning memberikan better feature extraction
  2. Lyrics (BERT β†’ DeBERTa)

    • BERT kesulitan dengan semantic similarity
    • DeBERTa punya disentangled attention mechanism
    • Lebih baik dalam contextual understanding
  3. MIDI (BiGRU+Attn β†’ BiGRU+SVM)

    • Dataset MIDI sangat kecil (193 samples)
    • Neural network classifier cenderung overfit
    • SVM lebih robust untuk small data
    • BiGRU tetap digunakan sebagai feature extractor

File Naming Convention

Baseline Results:

  • results/baseline/audio_prob.csv - CRNN probabilities
  • results/baseline/lyric_prob.csv - BERT probabilities
  • results/baseline/midi_prob.csv - BiGRU+Attention probabilities

Improved Results:

  • results/improved/audio_prob_for_fusion.csv - PANN probabilities
  • results/improved/lyrics_prob_for_fusion2.csv - DeBERTa probabilities
  • results/improved/midi_prob_for_fusion.csv - BiGRU+SVM probabilities

🀝 Kontribusi Tim

Pembagian peran dalam project:

  • Joshia: EDA Audio, CRNN baseline, PANN improvement
  • Apridian: EDA Lyrics, BERT baseline, DeBERTa improvement
  • Sikah: EDA MIDI, BiGRU baseline, BiGRU+SVM improvement
  • Louis: Fusion strategy, evaluation, comparison
  • Sakti: Documentation, visualization, report writing

πŸ“„ Lisensi

Project ini dibuat untuk keperluan akademik dalam mata kuliah Pembelajaran Mesin Multimodal (IF25-40304), Institut Teknologi Sumatera.


πŸ™ Acknowledgments

Terima kasih kepada:

  • Dosen pengampu mata kuliah Pembelajaran Mesin Multimodal, Bapak I Wayan Wiprayoga Wisesa, S.Kom., M.Kom.
  • Penyedia dataset MIREX

🎡 Made with ❀️ by Kelompok 09 🎡

About

Sistem pengenalan emosi musik berbasis multimodal yang memanfaatkan audio, lirik, dan MIDI dengan pendekatan Late Fusion. Dikembangkan untuk Tugas Besar mata kuliah Pembelajaran Mesin Multimodal (IF25-40304).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors