Skip to content

hananaq/Arabic-Tweet-Classification-MARBERT-k5-fold

Repository files navigation

Arabic-Tweet-Classification-MARBERT-k5-fold

This repository contains a Jupyter Notebook designed for Arabic text classification using the MARBERT model, k-fold cross-validation was used to ensure robust performance evaluation. MARBERT is a state-of-the-art transformer-based model fine-tuned for tasks involving Arabic natural language processing (NLP).
you can check my paper to know more about the work and the results obtained.

Dataset

the dataset used to train the model can be obtained from the following link https://www.sciencedirect.com/science/article/pii/S2352340923009472#bib0001. it's collected from Twitter and has two classes Spam and Ham.
I have also added the dataset to this project for ease of use.

Prerequisites

Ensure you have the following dependencies installed before running the notebook:

Python 3.7+
Jupyter Notebook
Hugging Face Transformers
PyTorch
scikit-learn
pandas
numpy
matplotlib

Notes for excution

During the execution of this project, the dataset is accessed from Google Drive, and the trained model is saved back to Google Drive. To ensure the project runs correctly, update the path parameter to match your own Google Drive directory.

For example: path = '/content/drive/MyDrive/Colab/AR/'

Replace '/content/drive/MyDrive/Colab/AR/' with the path where your dataset and model will be stored in your Google Drive.

Model Evaluation

The performance of the trained MARBERT model was evaluated using 5-fold cross-validation to ensure robust and unbiased results. During cross-validation:
  • The dataset was split into 5 folds, with each fold used once as a validation set while the remaining folds were used for training.
  • Precision, recall, and F1-score were calculated for each fold.
  • At the end of the evaluation, the average results for both class 0 (Ham) and class 1 (Spam) were obtained.

Final 5-Fold Cross-Validation Results

Class 0 (Ham):
  • Precision: 0.9943
  • Recall: 0.9950
  • F1-score: 0.9947
Class 1 (Spam):
  • Precision: 0.9963
  • Recall: 0.9957
  • F1-score: 0.9960

Overall Metrics
Confusion Matrix:
[[11189 56]
[ 64 14851]]

Overall Accuracy: 0.9954

These results demonstrate the model's excellent performance in accurately classifying both Ham and Spam tweets, with a near-perfect accuracy and strong F1-scores for both classes.

About

This Jupyter Notebook demonstrates Arabic text classification using the MARBERT model, incorporating cross-validation to ensure robust performance evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors