Skip to content

sslfactory/EmoBench-M

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EmoBench-M

project logo

Dataset Google Drive License: Apache 2.0

🌸 About β€’ πŸ“° News β€’ πŸ“¦ Dataset β€’ πŸ”₯ Quick Start β€’ πŸ“œ Citation

🌸 About

This repository contains the official evaluation code and data for the paper "EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models". See more details in our paper.

Can Multimodal Large Language Models (MLLMs) understand human emotions in dynamic, multimodal settings? To address this question, we introduce EmoBench-M, a comprehensive benchmark grounded in psychological theories of Emotional Intelligence (EI), designed to evaluate the EI capabilities of MLLMs across video, audio, and text. EmoBench-M spans 13 diverse scenarios across three key dimensions of EI: Foundational Emotion Recognition, Conversational Emotion Understanding, and Socially Complex Emotion Analysis. It includes over 5000 carefully curated samples and both classification and generation tasks, covering a wide range of real-world affective contexts. Through extensive evaluations of state-of-the-art MLLMsβ€”including open-source models like Qwen2.5-VL and InternVL2.5, and proprietary models such as Gemini 2.0 Flashβ€”we find that (i) current MLLMs significantly lag behind human performance, especially in conversational and socially complex tasks; (ii) model size alone does not guarantee better emotional reasoning; and (iii) nuanced social emotions and intent understanding remain particularly challenging. We hope EmoBench-M provides a solid foundation for future research toward emotionally intelligent AI systems.

Alt text

πŸ“° News

πŸ† Leaderboard

Method FER CEU SCEA Avg.
Human 62.0 84.4 72.7 73.0
πŸ…Gemini-2.0-Flash 61.4 53.4 72.0 62.3
πŸ₯ˆGemini-1.5-Flash 59.7 55.6 68.6 61.3
πŸ₯‰Gemini-2.0-Flash-Thinking 57.7 54.2 70.0 60.6
Qwen2.5-VL-78B-Instruct 53.0 47.9 72.5 57.8
GLM-4V-PLUS 56.1 47.3 69.6 57.7
InternVL2.5-38B 57.6 48.9 56.6 54.4
Qwen2-Audio-7B-Instruct 59.9 43.3 55.7 53.0
InternVL2.5-78B 53.0 44.5 59.8 52.4
Video-LLaMA2.1-7B-16F 50.9 46.1 57.5 51.5
InternVideo2-Chat-8B 50.6 40.2 63.6 51.5
Video-LLaMA2-7B-16F 51.4 37.1 64.5 51.0
InternVL2.5-4B 54.5 49.3 49.0 50.9
InternVL2.5-8B 51.2 45.7 54.2 50.4
Video-LLaMA2.1-7B-AV 50.4 46.1 49.5 48.7
Video-LLaMA2-72B 50.7 37.3 61.8 49.9
Video-LLaMA2-7B 45.4 34.5 61.3 47.1
MiniCPM-V-2.6-8B 40.0 43.1 56.5 46.5
LongVA-DPO-7B 45.7 32.1 53.5 43.8
Emotion-LLaMA 36.9 30.7 54.1 40.6
πŸ‘€ Random 23.1 19.8 33.3 25.4

πŸ“¦ Dataset

To use this benchmark, please first download the original video files and corresponding annotation .json files from the link below:

Dataset Google Drive

Each JSON file contains conversation-style prompts and labels aligned with the corresponding video clips. The structure looks like:

[
  {
    "id": "0",
    "video": "videos/ch-simsv2s/aqgy4_0004/00023.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nThe person in video says: ... Determine the emotion conveyed..."
      },
      {
        "from": "gpt",
        "value": "negative"
      }
    ]
  }
]

πŸ“ Dataset Structure

EmoBench-M/
β”œβ”€β”€ benchmark_json/           # JSON files containing metadata and annotations for each dataset
β”‚   β”œβ”€β”€ FGMSA.json    # Test instructions for the FGMSA dataset
β”‚   β”œβ”€β”€ MC-EIU.json           # 500-sample test set for the MC-EIU dataset
β”‚   β”œβ”€β”€ MELD.json     # Test instructions for the MELD dataset
β”‚   β”œβ”€β”€ MOSEI.json            # 500-sample test set for the MOSEI dataset
β”‚   β”œβ”€β”€ MOSI.json             # 500-sample test set for the MOSI dataset
β”‚   β”œβ”€β”€ MUSTARD.json               # 500-sample test set for the MUSTARD dataset
β”‚   β”œβ”€β”€ RAVDSS_song.json           # 500-sample test set for the RAVDSS song subset
β”‚   β”œβ”€β”€ RAVDSS_speech.json         # 500-sample test set for the RAVDSS speech subset
β”‚   β”œβ”€β”€ SIMS.json             # 500-sample test set for the SIMS dataset
β”‚   β”œβ”€β”€ ch-simsv2s.json       # 500-sample test set for the Chinese SIMS v2s dataset
β”‚   β”œβ”€β”€ funny.json    # Test instructions for the UR-FUNNY dataset
β”‚   β”œβ”€β”€ mer2023.json # Test instructions for the MER2023 dataset
β”‚   └── smile.json           # Test data for the SMILE dataset
└── dataset/              # Corresponding video files for each dataset
    β”œβ”€β”€ FGMSA/
    β”‚   └── videos/
    β”‚       └── FGMSA/        # Video files for the FGMSA dataset
    β”œβ”€β”€ MC-EIU/
    β”‚   └── videos/
    β”‚       └── MC-EIU/       # Video files for the MC-EIU dataset
    β”œβ”€β”€ MELD/
    β”‚   └── videos/
    β”‚       └── MELD/         # Video files for the MELD dataset
    β”œβ”€β”€ MOSEI/
    β”‚   └── videos/
    β”‚       └── MOSEI/        # Video files for the MOSEI dataset
    β”œβ”€β”€ MOSI/
    β”‚   └── videos/
    β”‚       └── MOSI/         # Video files for the MOSI dataset
    β”œβ”€β”€ MUSTARD/
    β”‚   └── videos/
    β”‚       └── MUSTARD/      # Video files for the MUSTARD dataset
    β”œβ”€β”€ RAVDSS_song/
    β”‚   └── videos/
    β”‚       └── RAVDSS/       # Video files for the RAVDSS song subset
    β”œβ”€β”€ RAVDSS_speech/
    β”‚   └── videos/
    β”‚       └── RAVDSS/       # Video files for the RAVDSS speech subset
    β”œβ”€β”€ SIMS_test/
    β”‚   └── videos/
    β”‚       └── SIMS/         # Video files for the SIMS dataset
    β”œβ”€β”€ ch-simsv2s/
    β”‚   └── videos/
    β”‚       └── ch-simsv2s/   # Video files for the Chinese SIMS v2s dataset
    β”œβ”€β”€ funny/
    β”‚   └── videos/
    β”‚       └── UR-FUNNY/     # Video files for the UR-FUNNY dataset
    β”œβ”€β”€ mer2023/
    β”‚   └── videos/
    β”‚       └── MER2023/      # Video files for the MER2023 dataset
    └── smile/
        └── videos/
            └── SMILE/       # Video files for the SMILE dataset

πŸ“‚ Dtat Structure Overview

  • benchmark_json/: Contains JSON files with metadata and annotations for each dataset, including test instructions and sample information.
  • dataset/: Corresponding video files for each dataset, organized into subdirectories named after each dataset.

πŸ”₯ Quick Start

EmoBench-M encompasses three primary evaluation tasks: Classification, Joint Emotion + Intent, and Generation. Each dataset is associated with one of these tasks.

πŸ§ͺ Evaluation Usage

Install Dependencies

pip install -r requirements.txt

1. Classification

  • Task: Classify videos into predefined emotional categories.
  • Command:
    python eval.py classification --json results.json --output classification.json
  • Input JSON (e.g. results.json) Format:
    [
      {"video": "sample1.mp4", "expected_value": "positive", "predicted_value": "positive"},
      {"video": "sample2.mp4", "expected_value": "neutral", "predicted_value": "negative"}
    ]
  • Output Format:
    {
      "accuracy": 0.85,
      "precision": 0.84,
      "recall": 0.83,
      "f1_score": 0.83
    }
  • Applicable Datasets: All datasets except MC-EIU.json and smile_test_data.json.

2. Joint Emotion + Intent

  • Task: Simultaneously predict the emotion and intent conveyed in a video.
  • Command:
    python eval.py joint --json emotions.json --output joint.json
  • Input JSON (e.g. emotions.json) Format:
    [
      {
        "modal_path": "sample1.mp4",
        "expected_emotion": "happy",
        "predicted_emotion": "happy",
        "expected_intent": "encouraging",
        "predicted_intent": "encouraging"
      }
    ]
  • Output Format:
    {
      "joint_accuracy": 0.80,
      "joint_precision": 0.79,
      "joint_recall": 0.78,
      "joint_f1": 0.78,
      "total": 100
    }
  • Applicable Dataset: MC-EIU.json.

3. Generation

  • Task: Generate a textual description of the video's content.
  • Command:
    python eval.py generation --json gen.json --output generation.json
  • Input JSON (e.g. gen.json) Format:
    [
      {"video": "sample1.mp4", "prediction": "I am very happy", "reference": "I feel happy"}
    ]
  • Output Format:
    {
      "avg_bleu": 0.35,
      "avg_rouge": 0.42,
      "avg_bert": 0.75,
      "total": 100
    }
  • Applicable Dataset: smile.json.

Important Notes for Researchers and Developers

  • Input JSON Preparation:

    Researchers and developers need to write scripts tailored to their trained or tested models to generate the aforementioned input JSON files (results.json, emotions.json, gen.json). This ensures that eval.py can correctly load and evaluate the data.

  • Evaluation Output:

    Evaluation results will be saved in the specified output JSON files, facilitating further analysis and comparison of different model performances.

  • All-in-One Evaluation:

    You can also use the all mode to run all three evaluations simultaneously. For example:

    python eval.py all \
      --classification-json results.json \
      --joint-json emotions.json \
      --generation-json gen.json \
      --output-dir results/

    This will generate three files: results/classification.json, results/joint.json, and results/generation.json, corresponding to the evaluation metrics of each task.


πŸ“œ Citation

@article{hu2025emobench,
  title={EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models},
  author={Hu, He and Zhou, Yucheng and You, Lianzhong and Xu, Hongbo and Wang, Qianning and Lian, Zheng and Yu, Fei Richard and Ma, Fei and Cui, Laizhong},
  journal={arXiv preprint arXiv:2502.04424},
  year={2025}
  }

About

EmoBench-M: A benchmark for evaluating Emotional Intelligence in Multimodal Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%