Skip to content

Debadri1999/AI-Factuality-Detection-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽ“ AI Factuality Detection for Educational Content

Data4Good Competition 2025 - Team DataDynasts

Data4Good

Python Scikit-Learn License Competition Stars Forks

Final CV AUC Score: 0.9354


๐Ÿ“– Overview

Artificial Intelligence is revolutionizing education, but AI-generated "hallucinations"โ€”confidently stated but factually incorrect informationโ€”pose significant risks to learners. This project tackles the critical challenge of detecting and classifying AI factuality in educational contexts.

๐ŸŽฏ Problem Statement

Given an AI-generated answer to an educational question with supporting context, classify the response as:

  • Factual: Accurate and supported by context
  • Contradiction: Incorrect or contradicting the provided context
  • Irrelevant: Unrelated to the question asked

๐Ÿ† Key Achievements

  • โœ… 0.9354 Macro-Averaged AUC ROC on 5-fold cross-validation
  • โœ… Advanced ensemble learning with HistGradientBoosting + Random Forest
  • โœ… Multi-level feature engineering: semantic, structural, and character-level
  • โœ… Robust prediction system tested on 21,021 training examples
  • โœ… Successfully predicted 2,000 test cases for competition submission

๐Ÿš€ Quick Start

Prerequisites

Python 3.8+
pandas >= 1.3.0
numpy >= 1.21.0
scikit-learn >= 1.0.0

Installation

  1. Clone the repository
git clone https://github.com/Debadri1999/AI-Factuality-Detection-ML.git
cd AI-Factuality-Detection-ML
  1. Install dependencies
pip install -r requirements.txt
  1. Run the model
python src/train_model.py
  1. Generate predictions
python src/predict.py --input data/test.json --output submission.json

๐Ÿ“Š Dataset

Training Data (data/train.json)

  • Size: 21,021 examples
  • Features: Question, Context, Answer, Type
  • Classes: Factual, Contradiction, Irrelevant
  • Distribution: Stratified across all folds

Test Data (data/test.json)

  • Size: 2,000 examples
  • Task: Predict the type for each AI-generated answer

Data Schema

{
  "ID": 1,
  "question": "What is photosynthesis?",
  "context": "Photosynthesis is the process by which plants convert light energy...",
  "answer": "Plants use sunlight to make food through photosynthesis.",
  "type": "factual"
}

๐Ÿง  Methodology

1. Advanced Feature Engineering

Our approach combines multiple feature extraction techniques to capture different aspects of text similarity and logical coherence:

A. Semantic Features

  • Jaccard Similarity: Set-based word overlap between Answer and Context
  • Cosine Similarity (TF-IDF): Semantic alignment in vector space
  • Low similarity strongly indicates "Irrelevant" or "Contradiction"

B. Dual-Vectorization Strategy

  1. Word-Level TF-IDF (Unigrams + Bigrams)

    • Captures phrase meanings (e.g., "is not" vs. "is")
    • Max features: 3,000
    • Uses sublinear TF scaling
  2. Character-Level TF-IDF (3-5 char n-grams)

    • Robust against typos and technical terminology
    • Captures morphological patterns
    • Max features: 1,000

C. Structural Heuristics

  • Word Count Ratio: len(Answer) / len(Context)
  • Identifies over-explaining (hallucination) vs. concise factual summaries

2. Ensemble Model Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Feature Engineering             โ”‚
โ”‚  โ”œโ”€ Word TF-IDF (n=3000)           โ”‚
โ”‚  โ”œโ”€ Char TF-IDF (n=1000)           โ”‚
โ”‚  โ”œโ”€ Jaccard Similarity             โ”‚
โ”‚  โ”œโ”€ Cosine Similarity              โ”‚
โ”‚  โ””โ”€ Word Ratio                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Weighted Soft Voting            โ”‚
โ”‚                                     โ”‚
โ”‚  HistGradientBoosting (60%)        โ”‚
โ”‚  โ”œโ”€ max_iter: 300                  โ”‚
โ”‚  โ”œโ”€ max_depth: 10                  โ”‚
โ”‚  โ””โ”€ learning_rate: 0.05            โ”‚
โ”‚                                     โ”‚
โ”‚  Random Forest (40%)               โ”‚
โ”‚  โ”œโ”€ n_estimators: 200              โ”‚
โ”‚  โ””โ”€ max_depth: 15                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   5-Fold Stratified CV              โ”‚
โ”‚   Final Predictions (Soft Voting)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why This Architecture?

  • HistGradientBoosting (60%): Handles sparse high-dimensional data efficiently, finds complex non-linear patterns
  • Random Forest (40%): Reduces variance, prevents overfitting, provides stability
  • 5-Fold StratifiedKFold: Ensures consistent class distribution across folds

๐Ÿ“ Project Structure

data4good-ai-factuality-detection/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ train.json              # Training dataset (21,021 examples)
โ”‚   โ”œโ”€โ”€ test.json               # Test dataset (2,000 examples)
โ”‚   โ””โ”€โ”€ submission.json         # Final predictions
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ feature_engineering.py  # Feature extraction functions
โ”‚   โ”œโ”€โ”€ models.py               # Ensemble model implementation
โ”‚   โ”œโ”€โ”€ train_model.py          # Training pipeline
โ”‚   โ””โ”€โ”€ predict.py              # Prediction script
โ”‚
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ data4good_analysis.ipynb # Full EDA and model development
โ”‚
โ”œโ”€โ”€ assets/
โ”‚   โ”œโ”€โ”€ Data4Good.png           # Competition banner
โ”‚   โ”œโ”€โ”€ feature_importance.png  # Feature importance visualization
โ”‚   โ””โ”€โ”€ confusion_matrix.png    # Model performance visualization
โ”‚
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ cv_scores.csv           # Cross-validation results
โ”‚   โ””โ”€โ”€ model_metrics.json      # Detailed performance metrics
โ”‚
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ README.md                   # This file
โ”œโ”€โ”€ LICENSE                     # MIT License
โ””โ”€โ”€ .gitignore                  # Git ignore rules

๐Ÿ”ฌ Technical Deep Dive

Feature Importance Analysis

Our analysis revealed the most predictive features:

  1. Cosine Similarity (TF-IDF): 28.5%
  2. Word-Level TF-IDF Features: 24.3%
  3. Jaccard Similarity: 18.7%
  4. Character-Level TF-IDF: 16.2%
  5. Word Count Ratio: 12.3%

Model Performance Breakdown

Class Precision Recall F1-Score AUC
Factual 0.92 0.94 0.93 0.95
Contradiction 0.89 0.87 0.88 0.93
Irrelevant 0.91 0.90 0.91 0.94
Macro Avg 0.91 0.90 0.91 0.9354

๐Ÿ’ก Key Insights & Learnings

What Worked Well

  1. Character-level features proved crucial for handling technical terminology and variations
  2. Ensemble voting significantly improved stability over individual models
  3. Semantic similarity scores effectively distinguished irrelevant answers
  4. Stratified K-Fold maintained class balance across validation folds

Challenges Overcome

  1. Class Imbalance: Addressed through stratification and balanced weighting
  2. Sparse High-Dimensional Data: Resolved with gradient boosting + feature selection
  3. Computational Efficiency: Optimized using HistGradientBoosting over standard GBM

Future Improvements

  • ๐Ÿ”ฎ Integrate transformer-based models (BERT, RoBERTa) for contextual embeddings
  • ๐Ÿ”ฎ Implement attention mechanisms to identify key supporting evidence
  • ๐Ÿ”ฎ Explore zero-shot learning approaches for new question domains
  • ๐Ÿ”ฎ Add explainability layer (SHAP values) for interpretable predictions

๐Ÿ‘ฅ Team DataDynasts


๐Ÿ“š References & Resources

  1. Competition Details: Data4Good Challenge
  2. Scikit-Learn Documentation: Ensemble Methods
  3. Research Paper: "Detecting Hallucinations in AI-Generated Text" (Sample reference)
  4. TF-IDF Guide: Understanding TF-IDF

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Data4Good Competition organizers for providing this impactful challenge
  • Purdue University for supporting our participation
  • Open-source community for the amazing ML libraries

๐Ÿ“ž Contact & Feedback

Interested in collaborating or have questions about our approach?


About

AI-powered factuality detection system for educational content | Data4Good Competition 2024 | Ensemble ML (0.9354 AUC)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors