This project focuses on creating a multi-modal dataset for hate speech detection, combining text (tweets) and images (associated with the tweets). The goal is to prepare a clean, structured dataset that can be used to train a deep learning model capable of detecting hate speech in multi-modal content.
The dataset preparation involves these key phases:
- Data Downloading and Image Loading.
- Text Cleaning and Preprocessing.
- Image Preprocessing.
- Multi-Modal Dataset Creation.
- DataLoader Preparation.
- Multi-Modal Embeddings Generation.
- Model Training on Generated Embeddings.
- Objective: Download the dataset and extract relevant information (text, image URLs, and labels).
- Process:
- The dataset is loaded from a JSON file (
MMHS150K_GT.json). - Each row contains:
tweet_text: The text content of the tweet.img_url: The URL of the associated image.labels: Multi-label classification for hate speech (e.g.,[4, 1, 3]corresponds to[Religion, Racist, Homophobe]).
- The dataset is loaded from a JSON file (
- Objective: Download images from URLs and save them locally.
- Process:
- A function (
load_dataset) is created to:- Iterate through the dataset rows.
- Download images from the
img_urlcolumn. - Save images locally in a folder (
dataset/images/) with numeric filenames (e.g.,0.jpg,1.jpg). - Skip rows where the image cannot be downloaded (e.g., broken links or network errors).
- Key Details:
- If an image fails to download, the entire row is skipped to ensure text-image matching.
- Numeric filenames are used to avoid issues with special characters or spaces in filenames.
- A function (
- Broken Image URLs: Some image URLs were inaccessible, leading to skipped rows.
- Network Errors: The function skips problematic rows and ensures only valid text-image pairs are included in the final dataset.
- Objective: Test the data loading and preprocessing pipeline on a small subset of the dataset.
- Process:
- The first 50 samples are loaded and processed.
- Images are downloaded, and text and labels are extracted.
- The dataset is saved as
dataset.csv.
- Dataset Structure:
tweet_text: The text content of the tweet.image_path: Local path to the downloaded image.labels: Multi-label classification for hate speech.
- Inconsistent Data: Some rows had missing or invalid data (e.g., empty text or labels).
- Solution: Rows with missing or invalid data were skipped during preprocessing.
- Objective: Clean the text data to remove noise and prepare it for tokenization.
- Process:
- Lowercasing: Convert all text to lowercase.
- Remove URLs: Remove any URLs from the text.
- Remove User Mentions: Remove Twitter handles (e.g.,
@username). - Remove Special Characters: Remove punctuation, emojis, and other special characters.
- Remove Extra Whitespace: Trim extra spaces and newlines.
- Objective: Convert cleaned text into tokenized input for BERT.
- Process:
- Use the
BertTokenizerfrom Hugging Face to tokenize the text. - Generate:
input_ids: Tokenized text converted to numerical IDs.attention_mask: Mask to indicate which tokens are actual words and which are padding.
- Use the
- Tokenization Parameters:
max_length=128: Pad/truncate text to 128 tokens.padding="max_length": Pad shorter texts to the maximum length.truncation=True: Truncate longer texts to the maximum length.
- Text Length Variability: Some tweets were longer than 128 tokens.
- Solution: Truncate longer texts to fit the model’s input size.
- Objective: Preprocess images for input into a ResNet-50 model.
- Process:
- Resize: Resize images to
224x224pixels. - Normalize: Normalize pixel values using ImageNet statistics (
mean = [0.485, 0.456, 0.406],std = [0.229, 0.224, 0.225]). - Convert to Tensor: Convert images to PyTorch tensors.
- Resize: Resize images to
- Objective: Save preprocessed images for later use.
- Process:
- Save transformed images as
.ptfiles in a folder (dataset/transformed_images/). - Update the dataset to include paths to the transformed images (
transformed_image_path).
- Save transformed images as
- Image Shape: Transformed images have the shape
[3, 224, 224](3 color channels, 224x224 resolution). - Normalization: Pixel values are normalized to the range
[-3, 3]after transformation.
- Image Loading Errors: Some images failed to load or transform.
- Solution: Use a black placeholder image for failed transformations.
The DataLoader is a PyTorch utility that handles batching, shuffling, and loading of the multi-modal dataset. It takes the MultiModal Dataset and prepares it for training by organizing the data into batches and providing an iterator over the dataset.
- Objective: Combine text, images, and labels into a PyTorch
Dataset. - Process:
- Create a custom
MultiModalDatasetclass that:- Loads tokenized text (
input_ids,attention_mask). - Loads transformed images from
.ptfiles. - Loads labels as tensors.
- Loads tokenized text (
- Create a custom
- Objective: Create a
DataLoaderto handle batching and shuffling. - Process:
- Use the
DataLoaderclass to:- Batch the dataset into groups of 32 samples.
- Shuffle the data to improve training.
- Provide an iterator for looping through the dataset.
- Use the
- Batch Size: 32 samples per batch.
- Shuffling: Enabled to randomize the order of samples.
- Output Shapes:
input_ids:[32, 128](32 samples, 128 tokens each).attention_mask:[32, 128](32 samples, 128 tokens each).images:[32, 3, 224, 224](32 images, 3 channels, 224x224 resolution).labels:[32, 3](32 samples, 3 labels each).
5.6 Final Multi-Modal Dataset Prepared
The DataLoader used is a runtime object that loads data from the dataset for model training.
- The dataset (text, images, and labels) is saved in:
- dataset_transformed.csv (text and labels).
- dataset/transformed_images/ (transformed images as .pt files).
- Description: The text content of the tweet.
- Preprocessing:
- Cleaned to remove noise (e.g., URLs, user mentions, special characters).
- Converted to lowercase.
- Purpose: Used as an input for the text-based model (e.g., BERT).
- Description: The local path to the original downloaded image (e.g.,
dataset/images/0.jpg). - Preprocessing:
- Images are downloaded from URLs and saved locally.
- Numeric filenames are used to avoid issues with special characters or spaces.
- Purpose: Provides a reference to the original image file.
- Description: Multi-label classification for hate speech (e.g.,
[4, 1, 3]corresponds to[Religion, Racist, Homophobe]). - Preprocessing:
- Converted from string representation (e.g.,
"[4, 1, 3]") to a list of integers usingast.literal_eval.
- Converted from string representation (e.g.,
- Purpose: Used as the target variable for training the model.
- Description: Tokenized text converted to numerical IDs for input into BERT.
- Preprocessing:
- Generated using the
BertTokenizer. - Padded/truncated to a fixed length of 128 tokens.
- Generated using the
- Purpose: Represents the tokenized text for the text-based model.
- Description: Mask to indicate which tokens are actual words and which are padding.
- Preprocessing:
- Generated alongside
input_idsusing theBertTokenizer. - Contains
1for actual tokens and0for padding tokens.
- Generated alongside
- Purpose: Helps the model ignore padding tokens during training.
- Description: The local path to the preprocessed image (e.g.,
dataset/transformed_images/0.pt). - Preprocessing:
- Images are resized to
224x224pixels. - Normalized using ImageNet statistics.
- Converted to PyTorch tensors and saved as
.ptfiles.
- Images are resized to
- Purpose: Provides a reference to the preprocessed image tensor for the image-based model (i.e., ResNet-50).
To improve multi-modal learning, we generate text, image, and multimodal embeddings. These embeddings represent the underlying semantic information in a numerical format, enabling deep learning models to detect patterns more effectively.
- Model Used:
BERT-based transformer - Process:
- Tokenized tweet texts are passed through a pre-trained BERT model.
- The
[CLS]token representation is extracted as the text embedding. - The resulting vector size is
768.
- Usage:
- Captures contextual meaning of tweets.
- Used as input to the multi-modal classifier.
- Model Used:
ResNet-50 (pre-trained on ImageNet) - Process:
- Preprocessed images are fed into a ResNet-50 model.
- The final layer before classification is extracted as the image embedding.
- The resulting vector size is
2048.
- Usage:
- Captures high-level visual features from images.
- Helps in detecting hate symbols or offensive visual elements.
- Fusion Method: Concatenation of Text + Image embeddings
- Final Embedding Shape:
[768 + 2048 = 2816] - Purpose:
- Provides a unified representation of both text and images.
- Helps models jointly learn from both modalities.
- The embeddings are saved in
embeddings_final.ptfor efficient retrieval. - Emergency Diagnostic Check:
- Ensures no missing (
NaN) or infinite values in embeddings. - Checks mean and standard deviation to identify anomalies.
- Ensures no missing (
This embeddings-based approach strengthens the multi-modal hate speech detection system, improving its ability to understand complex relationships between text and images.
- Creating a hate speech detection system using deep learning requires datasets encompassing a wide range of text, audio, and images (if applicable) to understand and detect hate speech across multiple modalities. Here are some common datasets often used for multi-modal hate speech detection:
MMHS150K (Multi-Modal Hate Speech 150K):
- Contains 150,000 tweets, each with an associated image and labels for offensive and hate speech.
- Focused on Twitter data, this dataset captures real-world examples of hate speech in a multi-modal format.
- Dataset link
This dataset is accompanied by pre-processing steps, especially for dealing with images and text separately, before feeding them into a deep learning model such as a BERT+ResNet setup for multi-modal classification tasks.
