πΈ About β’ π° News β’ π¦ Dataset β’ π₯ Quick Start β’ π Citation
This repository contains the official evaluation code and data for the paper "EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models". See more details in our paper.
Can Multimodal Large Language Models (MLLMs) understand human emotions in dynamic, multimodal settings? To address this question, we introduce EmoBench-M, a comprehensive benchmark grounded in psychological theories of Emotional Intelligence (EI), designed to evaluate the EI capabilities of MLLMs across video, audio, and text. EmoBench-M spans 13 diverse scenarios across three key dimensions of EI: Foundational Emotion Recognition, Conversational Emotion Understanding, and Socially Complex Emotion Analysis. It includes over 5000 carefully curated samples and both classification and generation tasks, covering a wide range of real-world affective contexts. Through extensive evaluations of state-of-the-art MLLMsβincluding open-source models like Qwen2.5-VL and InternVL2.5, and proprietary models such as Gemini 2.0 Flashβwe find that (i) current MLLMs significantly lag behind human performance, especially in conversational and socially complex tasks; (ii) model size alone does not guarantee better emotional reasoning; and (iii) nuanced social emotions and intent understanding remain particularly challenging. We hope EmoBench-M provides a solid foundation for future research toward emotionally intelligent AI systems.
- [2025-07-08] We open-sourced the code and dataset for EmoBench-M on GitHub!
- [2025-02-06] Paper submitted to arXiv: https://arxiv.org/abs/2502.04424.
- [2025-02-05] Created the official project website: https://emo-gml.github.io/.
| Method | FER | CEU | SCEA | Avg. |
|---|---|---|---|---|
| Human | 62.0 | 84.4 | 72.7 | 73.0 |
| π Gemini-2.0-Flash | 61.4 | 53.4 | 72.0 | 62.3 |
| π₯Gemini-1.5-Flash | 59.7 | 55.6 | 68.6 | 61.3 |
| π₯Gemini-2.0-Flash-Thinking | 57.7 | 54.2 | 70.0 | 60.6 |
| Qwen2.5-VL-78B-Instruct | 53.0 | 47.9 | 72.5 | 57.8 |
| GLM-4V-PLUS | 56.1 | 47.3 | 69.6 | 57.7 |
| InternVL2.5-38B | 57.6 | 48.9 | 56.6 | 54.4 |
| Qwen2-Audio-7B-Instruct | 59.9 | 43.3 | 55.7 | 53.0 |
| InternVL2.5-78B | 53.0 | 44.5 | 59.8 | 52.4 |
| Video-LLaMA2.1-7B-16F | 50.9 | 46.1 | 57.5 | 51.5 |
| InternVideo2-Chat-8B | 50.6 | 40.2 | 63.6 | 51.5 |
| Video-LLaMA2-7B-16F | 51.4 | 37.1 | 64.5 | 51.0 |
| InternVL2.5-4B | 54.5 | 49.3 | 49.0 | 50.9 |
| InternVL2.5-8B | 51.2 | 45.7 | 54.2 | 50.4 |
| Video-LLaMA2.1-7B-AV | 50.4 | 46.1 | 49.5 | 48.7 |
| Video-LLaMA2-72B | 50.7 | 37.3 | 61.8 | 49.9 |
| Video-LLaMA2-7B | 45.4 | 34.5 | 61.3 | 47.1 |
| MiniCPM-V-2.6-8B | 40.0 | 43.1 | 56.5 | 46.5 |
| LongVA-DPO-7B | 45.7 | 32.1 | 53.5 | 43.8 |
| Emotion-LLaMA | 36.9 | 30.7 | 54.1 | 40.6 |
| π Random | 23.1 | 19.8 | 33.3 | 25.4 |
To use this benchmark, please first download the original video files and corresponding annotation .json files from the link below:
Each JSON file contains conversation-style prompts and labels aligned with the corresponding video clips. The structure looks like:
[
{
"id": "0",
"video": "videos/ch-simsv2s/aqgy4_0004/00023.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nThe person in video says: ... Determine the emotion conveyed..."
},
{
"from": "gpt",
"value": "negative"
}
]
}
]EmoBench-M/
βββ benchmark_json/ # JSON files containing metadata and annotations for each dataset
β βββ FGMSA.json # Test instructions for the FGMSA dataset
β βββ MC-EIU.json # 500-sample test set for the MC-EIU dataset
β βββ MELD.json # Test instructions for the MELD dataset
β βββ MOSEI.json # 500-sample test set for the MOSEI dataset
β βββ MOSI.json # 500-sample test set for the MOSI dataset
β βββ MUSTARD.json # 500-sample test set for the MUSTARD dataset
β βββ RAVDSS_song.json # 500-sample test set for the RAVDSS song subset
β βββ RAVDSS_speech.json # 500-sample test set for the RAVDSS speech subset
β βββ SIMS.json # 500-sample test set for the SIMS dataset
β βββ ch-simsv2s.json # 500-sample test set for the Chinese SIMS v2s dataset
β βββ funny.json # Test instructions for the UR-FUNNY dataset
β βββ mer2023.json # Test instructions for the MER2023 dataset
β βββ smile.json # Test data for the SMILE dataset
βββ dataset/ # Corresponding video files for each dataset
βββ FGMSA/
β βββ videos/
β βββ FGMSA/ # Video files for the FGMSA dataset
βββ MC-EIU/
β βββ videos/
β βββ MC-EIU/ # Video files for the MC-EIU dataset
βββ MELD/
β βββ videos/
β βββ MELD/ # Video files for the MELD dataset
βββ MOSEI/
β βββ videos/
β βββ MOSEI/ # Video files for the MOSEI dataset
βββ MOSI/
β βββ videos/
β βββ MOSI/ # Video files for the MOSI dataset
βββ MUSTARD/
β βββ videos/
β βββ MUSTARD/ # Video files for the MUSTARD dataset
βββ RAVDSS_song/
β βββ videos/
β βββ RAVDSS/ # Video files for the RAVDSS song subset
βββ RAVDSS_speech/
β βββ videos/
β βββ RAVDSS/ # Video files for the RAVDSS speech subset
βββ SIMS_test/
β βββ videos/
β βββ SIMS/ # Video files for the SIMS dataset
βββ ch-simsv2s/
β βββ videos/
β βββ ch-simsv2s/ # Video files for the Chinese SIMS v2s dataset
βββ funny/
β βββ videos/
β βββ UR-FUNNY/ # Video files for the UR-FUNNY dataset
βββ mer2023/
β βββ videos/
β βββ MER2023/ # Video files for the MER2023 dataset
βββ smile/
βββ videos/
βββ SMILE/ # Video files for the SMILE datasetπ Dtat Structure Overview
- benchmark_json/: Contains JSON files with metadata and annotations for each dataset, including test instructions and sample information.
- dataset/: Corresponding video files for each dataset, organized into subdirectories named after each dataset.
EmoBench-M encompasses three primary evaluation tasks: Classification, Joint Emotion + Intent, and Generation. Each dataset is associated with one of these tasks.
pip install -r requirements.txt- Task: Classify videos into predefined emotional categories.
- Command:
python eval.py classification --json results.json --output classification.json
- Input JSON (e.g. results.json) Format:
[ {"video": "sample1.mp4", "expected_value": "positive", "predicted_value": "positive"}, {"video": "sample2.mp4", "expected_value": "neutral", "predicted_value": "negative"} ] - Output Format:
{ "accuracy": 0.85, "precision": 0.84, "recall": 0.83, "f1_score": 0.83 } - Applicable Datasets: All datasets except MC-EIU.json and smile_test_data.json.
- Task: Simultaneously predict the emotion and intent conveyed in a video.
- Command:
python eval.py joint --json emotions.json --output joint.json
- Input JSON (e.g. emotions.json) Format:
[ { "modal_path": "sample1.mp4", "expected_emotion": "happy", "predicted_emotion": "happy", "expected_intent": "encouraging", "predicted_intent": "encouraging" } ] - Output Format:
{ "joint_accuracy": 0.80, "joint_precision": 0.79, "joint_recall": 0.78, "joint_f1": 0.78, "total": 100 } - Applicable Dataset: MC-EIU.json.
- Task: Generate a textual description of the video's content.
- Command:
python eval.py generation --json gen.json --output generation.json
- Input JSON (e.g. gen.json) Format:
[ {"video": "sample1.mp4", "prediction": "I am very happy", "reference": "I feel happy"} ] - Output Format:
{ "avg_bleu": 0.35, "avg_rouge": 0.42, "avg_bert": 0.75, "total": 100 } - Applicable Dataset: smile.json.
-
Input JSON Preparation:
Researchers and developers need to write scripts tailored to their trained or tested models to generate the aforementioned input JSON files (results.json, emotions.json, gen.json). This ensures that eval.py can correctly load and evaluate the data.
-
Evaluation Output:
Evaluation results will be saved in the specified output JSON files, facilitating further analysis and comparison of different model performances.
-
All-in-One Evaluation:
You can also use the all mode to run all three evaluations simultaneously. For example:
python eval.py all \ --classification-json results.json \ --joint-json emotions.json \ --generation-json gen.json \ --output-dir results/
This will generate three files: results/classification.json, results/joint.json, and results/generation.json, corresponding to the evaluation metrics of each task.
@article{hu2025emobench,
title={EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models},
author={Hu, He and Zhou, Yucheng and You, Lianzhong and Xu, Hongbo and Wang, Qianning and Lian, Zheng and Yu, Fei Richard and Ma, Fei and Cui, Laizhong},
journal={arXiv preprint arXiv:2502.04424},
year={2025}
}
