Multimodal Content Annotation for Identification of Comic Mischief

Overview

This repository contains the project report and associated materials for the Mini Project II titled "Multimodal Content Annotation for Identification of Comic Mischief", submitted as part of the B.Tech degree requirements at the Indian Institute of Information Technology Dharwad (IIIT Dharwad). The project was conducted by a team of fourth-year Computer Science and Engineering students under the guidance of Dr. Sunil Saumya, Assistant Professor, Dept. of CSE, IIIT Dharwad, from August 2021 to November 2021.

Name	Regd. No
Sumith Sai Budde	18BCS101
Shaik Fharook	18BCS091
Syed Sufyan Ahmed	18BCS103
Shubham Shinde	18BCS095

Project Description

The project focuses on the automated detection of comic mischief in videos, which refers to mild harm inflicted on video characters in a humorous manner. The content may span multiple modalities including video frames, soundtrack, and dialogue. The primary objective is to develop multimodal research techniques to label online video content, particularly to protect young viewers from inappropriate material.

Objectives

Develop binary and multi-label classification models to detect comic mischief.
Preprocess video and audio data using traditional and generator-based approaches.
Evaluate model performance using various architectures like CNN, VGG16, LSTM, and SVM.

Methodology

Dataset: Utilized the MOCHA Grand Challenge @ ICMI 2021 dataset, comprising 998 scenes from 347 YouTube and IMDB videos.
Tasks:
- Task 1: Binary detection of comic mischief (presence/absence).
- Task 2: Fine-grained detection with labels: mature humor, slapstick humor, gory humor, and sarcasm.
Pre-processing: Includes traditional video processing, generator-based video processing, and audio preprocessing using ffmpeg and moviepy.
Models: Experimented with 7 models including CNN + Dense, VGG16 + Dense, CNN + LSTM + Dense, VGG16 + LSTM + Dense, and an SVM Audio Classifier.

Results

Models were evaluated on training and validation accuracies, with results varying based on approach and model type.
Detailed analysis and comparisons are provided in the report, highlighting challenges with small datasets and potential improvements.

Conclusion & Future Scope

The project proposes various approaches for predicting comic mischief, with low accuracy due to limited data. Future work could involve using larger datasets, fine-tuning parameters, and exploring ensemble learning with audio and video modalities.

Note: Refer to the report Multimodal_Content_Annotation_for_Identification_of_Comic_Mischief.pdf for detailed methodology and model implementation steps.

References

A. Rehman and S. B. Belhaouari, “Deep Learning for Video Classification: A Review”. TechRxiv, 16-Aug-2021.
W. Samek et al., “Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications,” Proc. IEEE, vol. 109, no. 3, pp. 247–278, 2021.
S. Kiranyaz et al., “1D convolutional neural networks and applications: A survey,” Mech. Syst. Signal Process., vol. 151, 2021.
N. Minallah et al., “On the performance of fusion based planet-scope and Sentinel-2 data for crop classification using inception inspired deep convolutional neural network,” PLoS One, vol. 15, no. 9 September, 2020.
A. U. Rehman and A. Bermak, “Averaging neural network ensembles model for quantification of volatile organic compound,” 2019 15th Int. Wirel. Commun. Mob. Comput. Conf. IWCMC 2019, pp. 848–852, 2019.
D. Brezeale, D. J. Cook, and S. Member, "Automatic video Classification: A survey of the literature," IEEE Transactions on Systems, Man, and Cybernetics, Part C.
Z. Wu et al., “Deep learning for video classification and captioning,” Front. Multimed. Res., pp. 3–29, 2017.
A. Karpathy et al., "Large-Scale Video Classification with Convolutional Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725-1732, doi: 10.1109/CVPR.2014.223.
Simonyan, Karen & Zisserman, Andrew. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv 1409.1556.
Sun, Chen et al. (2019). VideoBERT: A Joint Model for Video and Language Representation Learning. 7463-7472. 10.1109/ICCV.2019.00756.
A Flexible CNN Framework for Multi-Label Image Classification Yunchao Wei et al.

License

This project is for academic purposes only. Redistribution or modification of the report content requires permission from the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Research Papers		Research Papers
dataset		dataset
notebooks		notebooks
.gitignore		.gitignore
Multimodal_Content_Annotation_for_Identification_of_Comic_Mischief.pdf		Multimodal_Content_Annotation_for_Identification_of_Comic_Mischief.pdf
Pre-Processing.ipynb		Pre-Processing.ipynb
README.md		README.md
Test.ipynb		Test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Content Annotation for Identification of Comic Mischief

Overview

Project Description

Objectives

Methodology

Results

Conclusion & Future Scope

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Content Annotation for Identification of Comic Mischief

Overview

Project Description

Objectives

Methodology

Results

Conclusion & Future Scope

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages