Skip to content

kasundularaam/toxi_scan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ToxiScan – Sinhala/Singlish Profanity Detection

Python FastAPI Next.js scikit-learn License: MIT Build Status Version

Lightweight, production-ready profanity detection for Sinhala, English, and Singlish. Hybrid design: a tiny AI word classifier (fast, robust to obfuscations) with optional regex rules and OCR (image β†’ text).

toxi_scan/
  server/         # FastAPI inference service (AI + optional rules + optional OCR)
  toxi_scan_ai/   # Model training project (data build + training + eval + export)
  toxiscan-app/   # Next.js frontend (demo UI with per-token confidence & threshold)

✨ Highlights

  • Speedy: char n-gram TF-IDF + Logistic Regression (no GPU, ms/word)
  • Obfuscation-tolerant: works with Singlish & creative spellings (e.g., Huβ€”ttaaa)
  • Threshold control: pick operating point for precision/recall; UI slider included
  • Per-token confidence: frontend shows probability for each detected token
  • OCR (optional): Gemini extracts text from images, then runs the same detector

🧱 Tech Stack

  • Model: scikit-learn (TfidfVectorizer(char 2–5) + LogisticRegression)
  • Backend: FastAPI + Uvicorn, optional Gemini OCR (google.genai or google.generativeai)
  • Frontend: Next.js (App Router) + Tailwind + shadcn/ui
  • Artifacts: server/model/word_cuss_lr.joblib, server/model/threshold.txt

πŸ“¦ Installation

1) Model training deps

cd toxi_scan/toxi_scan_ai
pip install pandas scikit-learn joblib regex
# (optional, if you experiment with transformers later)
pip install torch transformers datasets

2) Server deps

cd ../server
pip install fastapi uvicorn regex python-dotenv joblib scikit-learn
# OCR (optional – either one works; both is fine)
pip install google-genai google-generativeai

3) Frontend deps

cd ../toxiscan-app
npm install

πŸ“š Data & Training (in toxi_scan_ai/)

Inputs

  • normal.csv with columns: Sinhala, English, Singlish (We extract words from Sinhala + Singlish; English is ignored for the word model.)

  • cuss.json with:

    { "patterns": [ { "label": "profanity-singlish", "pattern": "Hutta" }, ... ] }

Build dataset β†’ data/words.csv

  • Unicode NFKC β†’ lowercase
  • Strip zero-width chars [\u200B-\u200D\uFEFF]
  • Compress long repeats (aaaaa β†’ aa)
  • Tokenization: \b[\p{L}\p{N}][\p{L}\p{N}\-._\u200B-\u200D]*\b

(Use the provided create_dataset.py from the training folder.)

Train the model

  • Vectorizer: TfidfVectorizer(analyzer="char", ngram_range=(2,5))

  • Classifier: LogisticRegression(max_iter=2000, class_weight="balanced")

  • Stratified split, then threshold search to maximize F1 on "cuss"

  • Exports:

    toxi_scan_ai/models/word_cuss_lr.joblib
    toxi_scan_ai/models/threshold.txt
    

Copy these to the server:

toxi_scan/server/model/word_cuss_lr.joblib
toxi_scan/server/model/threshold.txt

Example validation (from a real run)

  • Best operating threshold Ο„ β‰ˆ 0.746
  • AUPRC (cuss) β‰ˆ 0.4767
  • At Ο„=0.5 (not recommended): cuss P=0.375, R=0.656, F1=0.477 (we deploy with the learned Ο„ for higher precision)

πŸ”Œ Server (FastAPI) – toxi_scan/server

Env (.env)

HOST=0.0.0.0
PORT=8000
LOG_LEVEL=info
DETECTOR=ai             # ai | rules | hybrid

# OCR (optional) – either var name works
GEMINI_API_KEY=your_key   # or GOOGLE_API_KEY=your_key
GEMINI_MODEL=gemini-2.0-flash

Run

cd toxi_scan/server
python main.py
# or: uvicorn main:app --host 0.0.0.0 --port 8000

Endpoints

GET /health

Returns service status:

{
  "ok": true,
  "detector": "ai",
  "ai_loaded": true,
  "ai_threshold": 0.746,
  "patterns": 0,
  "ocr_enabled": true,
  "ocr_sdk": "google.genai",
  "gemini_model": "gemini-2.0-flash",
  "api_key_present": true
}

GET /is_cuss?word=...&threshold?=0.8

Quick word check:

{ "word": "Hutta", "normalized": "hutta", "is_cuss": true, "score": 0.966, "threshold": 0.746 }

POST /analyze/text

{ "text": "හෙࢽෝ Hutta πŸ˜… Huβ€”ttaaa ΰ·ƒΰ·”ΰΆ·!", "threshold": 0.75 }

Response:

{
  "source": "text",
  "raw_text": "…",
  "tagged_text": "හෙࢽෝ <cuss>Hutta</cuss> πŸ˜… <cuss>Huβ€”ttaaa</cuss> ΰ·ƒΰ·”ΰΆ·!",
  "matches": [
    { "label": "ai-cuss", "match": "Hutta", "normalized": "hutta", "score": 0.9659, "start": 6, "end": 11 },
    { "label": "ai-cuss", "match": "Huβ€”ttaaa", "normalized": "huβ€”ttaa", "score": 0.9642, "start": 15, "end": 23 }
  ]
}

POST /analyze/image (optional OCR)

Form-data: image=@file.jpg Extracts text via Gemini then runs the same token-level detection.

(optional) Patterns endpoints

  • GET /patterns β†’ returns current pattern list (if present)
  • POST /patterns β†’ replace pattern list (used in rules / hybrid modes)

πŸ–₯ Frontend (Next.js) – toxiscan-app

Config

# toxiscan-app/.env.local
NEXT_PUBLIC_TOXISCAN_API=http://localhost:8000

Run

cd toxi_scan/toxiscan-app
npm run dev

What you get

  • Text/Image tabs (image uses server OCR if enabled)
  • Threshold slider (0.30–0.95) with default fetched from /health
  • Inline highlighting via <cuss>…</cuss>
  • Per-token confidence table and overall confidence (avg of token scores)
  • Copy/clear actions, shadcn styling

πŸ” Typical Flow

User Text/Image
   ↓
Frontend β†’ POST /analyze/text or /analyze/image
   ↓
Server:
  - (image) OCR via Gemini β†’ text
  - tokenize β†’ normalize β†’ predict_proba(word)
  - compare to threshold Ο„
  - merge spans β†’ <cuss>…</cuss>
  - return matches + scores
   ↓
Frontend renders highlights + per-token & overall confidence

βš– Operating Point & Accuracy

  • Model is trained on imbalanced data; we learn the best threshold on validation to favor high precision.
  • Example: Ο„ β‰ˆ 0.746, AUPRC β‰ˆ 0.477 on the cuss class.
  • The UI exposes the threshold so teams can dial precision/recall to taste.

πŸ“ Artifacts & Versioning

  • Model file: server/model/word_cuss_lr.joblib
  • Threshold file: server/model/threshold.txt
  • /health exposes the active threshold; replace artifacts to upgrade the model.

πŸ‘₯ Team & Responsibilities

  • 21UG1056 – Data Engineering & Curation
  • 21UG1287 – Data Engineering & Curation
  • 21UG1073 – Data Engineering & Curation, Model Training, QA & Documentation
  • 21UG1376 – Data Engineering & Curation, Backend API & OCR, Frontend
  • 21UG1091 – Model Training
  • 21UG1092 – Model Training
  • 21UG1149 – Backend API & OCR
  • 21UG0460 – Backend API & OCR
  • 21UG951 – Frontend
  • 21UG1079 – QA & Documentation
  • 21UG1260 – QA & Documentation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

⭐ Support

If you find this project helpful, please give it a star!

About

πŸ›‘οΈ Profanity detection for Sinhala, English & Singlish with real-time processing, OCR support, and adjustable confidence thresholds. FastAPI + Next.js + scikit-learn.

Topics

Resources

Stars

Watchers

Forks

Contributors