This repository presents a comparative study of transformer-based models for fake news classification. The project fine-tunes and evaluates two lightweight BERT variants — DistilBERT and TinyBERT — to analyze their performance, efficiency, and suitability for real-world NLP classification tasks.
Fake news detection is a critical NLP problem that requires both accuracy and efficiency. In this project, two compact transformer models are fine-tuned on a labeled fake news dataset and systematically compared based on classification performance and computational efficiency.
The project demonstrates an end-to-end NLP pipeline using Hugging Face Transformers, covering data preprocessing, model training, evaluation, and comparison.
- The dataset is sourced from GitHub
- Contains labeled news articles classified as real or fake
- Text-based binary classification problem
- Dataset is split into:
- Training set
- Validation set
- Test set
- Stratified splitting is used to preserve label distribution across splits
- A distilled version of BERT
- Retains most of BERT’s performance with fewer parameters
- Faster training and inference compared to BERT-base
- A heavily compressed BERT variant
- Optimized for low-latency and low-resource environments
- Smaller model size with reduced computational cost
-
Environment Setup
- PyTorch
- Hugging Face Transformers and Datasets
- Scikit-learn
- Pandas and NumPy
-
Exploratory Data Analysis
- Dataset inspection
- Class distribution analysis
-
Data Splitting
- Stratified train, validation, and test split
-
Dataset Conversion
- Conversion from Pandas DataFrame to Hugging Face
DatasetandDatasetDict
- Conversion from Pandas DataFrame to Hugging Face
-
Label Encoding
- Creation of
label2idandid2labelmappings
- Creation of
-
Tokenization
- Tokenization using respective model tokenizers
- Padding and truncation applied
- Removal of unnecessary columns for efficient training
-
Model Fine-Tuning
- DistilBERT fine-tuned for sequence classification
- TinyBERT fine-tuned for sequence classification
- Training handled using Hugging Face
TrainerAPI
-
Evaluation Metrics
- Accuracy
- Precision
- Recall
- F1-score
-
Model Comparison
- Performance comparison between DistilBERT and TinyBERT
- Analysis of accuracy vs model efficiency trade-offs
-
Model Saving
- Fine-tuned models and tokenizers saved for inference or deployment
- Both DistilBERT and TinyBERT demonstrate strong performance on the fake news classification task
- DistilBERT achieves higher classification accuracy and F1-score
- TinyBERT offers faster inference and lower memory usage with a small performance trade-off
- The comparison highlights the balance between model size, speed, and predictive performance
- Python
- PyTorch
- Hugging Face Transformers
- Hugging Face Datasets
- Scikit-learn
- Google Colab (GPU)