Lightweight, production-ready profanity detection for Sinhala, English, and Singlish. Hybrid design: a tiny AI word classifier (fast, robust to obfuscations) with optional regex rules and OCR (image β text).
toxi_scan/
server/ # FastAPI inference service (AI + optional rules + optional OCR)
toxi_scan_ai/ # Model training project (data build + training + eval + export)
toxiscan-app/ # Next.js frontend (demo UI with per-token confidence & threshold)
- Speedy: char n-gram TF-IDF + Logistic Regression (no GPU, ms/word)
- Obfuscation-tolerant: works with Singlish & creative spellings (e.g.,
Huβttaaa) - Threshold control: pick operating point for precision/recall; UI slider included
- Per-token confidence: frontend shows probability for each detected token
- OCR (optional): Gemini extracts text from images, then runs the same detector
- Model: scikit-learn (
TfidfVectorizer(char 2β5) + LogisticRegression) - Backend: FastAPI + Uvicorn, optional Gemini OCR (
google.genaiorgoogle.generativeai) - Frontend: Next.js (App Router) + Tailwind + shadcn/ui
- Artifacts:
server/model/word_cuss_lr.joblib,server/model/threshold.txt
cd toxi_scan/toxi_scan_ai
pip install pandas scikit-learn joblib regex
# (optional, if you experiment with transformers later)
pip install torch transformers datasetscd ../server
pip install fastapi uvicorn regex python-dotenv joblib scikit-learn
# OCR (optional β either one works; both is fine)
pip install google-genai google-generativeaicd ../toxiscan-app
npm install-
normal.csvwith columns: Sinhala, English, Singlish (We extract words from Sinhala + Singlish; English is ignored for the word model.) -
cuss.jsonwith:{ "patterns": [ { "label": "profanity-singlish", "pattern": "Hutta" }, ... ] }
- Unicode NFKC β lowercase
- Strip zero-width chars
[\u200B-\u200D\uFEFF] - Compress long repeats (
aaaaa β aa) - Tokenization:
\b[\p{L}\p{N}][\p{L}\p{N}\-._\u200B-\u200D]*\b
(Use the provided create_dataset.py from the training folder.)
-
Vectorizer:
TfidfVectorizer(analyzer="char", ngram_range=(2,5)) -
Classifier:
LogisticRegression(max_iter=2000, class_weight="balanced") -
Stratified split, then threshold search to maximize F1 on "cuss"
-
Exports:
toxi_scan_ai/models/word_cuss_lr.joblib toxi_scan_ai/models/threshold.txt
Copy these to the server:
toxi_scan/server/model/word_cuss_lr.joblib
toxi_scan/server/model/threshold.txt
- Best operating threshold Ο β 0.746
- AUPRC (cuss) β 0.4767
- At Ο=0.5 (not recommended): cuss P=0.375, R=0.656, F1=0.477 (we deploy with the learned Ο for higher precision)
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=info
DETECTOR=ai # ai | rules | hybrid
# OCR (optional) β either var name works
GEMINI_API_KEY=your_key # or GOOGLE_API_KEY=your_key
GEMINI_MODEL=gemini-2.0-flashcd toxi_scan/server
python main.py
# or: uvicorn main:app --host 0.0.0.0 --port 8000Returns service status:
{
"ok": true,
"detector": "ai",
"ai_loaded": true,
"ai_threshold": 0.746,
"patterns": 0,
"ocr_enabled": true,
"ocr_sdk": "google.genai",
"gemini_model": "gemini-2.0-flash",
"api_key_present": true
}Quick word check:
{ "word": "Hutta", "normalized": "hutta", "is_cuss": true, "score": 0.966, "threshold": 0.746 }{ "text": "ΰ·ΰ·ΰΆ½ΰ· Hutta π
Huβttaaa ΰ·ΰ·ΰΆ·!", "threshold": 0.75 }Response:
{
"source": "text",
"raw_text": "β¦",
"tagged_text": "ΰ·ΰ·ΰΆ½ΰ· <cuss>Hutta</cuss> π
<cuss>Huβttaaa</cuss> ΰ·ΰ·ΰΆ·!",
"matches": [
{ "label": "ai-cuss", "match": "Hutta", "normalized": "hutta", "score": 0.9659, "start": 6, "end": 11 },
{ "label": "ai-cuss", "match": "Huβttaaa", "normalized": "huβttaa", "score": 0.9642, "start": 15, "end": 23 }
]
}Form-data: image=@file.jpg
Extracts text via Gemini then runs the same token-level detection.
GET /patternsβ returns current pattern list (if present)POST /patternsβ replace pattern list (used inrules/hybridmodes)
# toxiscan-app/.env.local
NEXT_PUBLIC_TOXISCAN_API=http://localhost:8000cd toxi_scan/toxiscan-app
npm run dev- Text/Image tabs (image uses server OCR if enabled)
- Threshold slider (0.30β0.95) with default fetched from
/health - Inline highlighting via
<cuss>β¦</cuss> - Per-token confidence table and overall confidence (avg of token scores)
- Copy/clear actions, shadcn styling
User Text/Image
β
Frontend β POST /analyze/text or /analyze/image
β
Server:
- (image) OCR via Gemini β text
- tokenize β normalize β predict_proba(word)
- compare to threshold Ο
- merge spans β <cuss>β¦</cuss>
- return matches + scores
β
Frontend renders highlights + per-token & overall confidence
- Model is trained on imbalanced data; we learn the best threshold on validation to favor high precision.
- Example: Ο β 0.746, AUPRC β 0.477 on the cuss class.
- The UI exposes the threshold so teams can dial precision/recall to taste.
- Model file:
server/model/word_cuss_lr.joblib - Threshold file:
server/model/threshold.txt /healthexposes the active threshold; replace artifacts to upgrade the model.
- 21UG1056 β Data Engineering & Curation
- 21UG1287 β Data Engineering & Curation
- 21UG1073 β Data Engineering & Curation, Model Training, QA & Documentation
- 21UG1376 β Data Engineering & Curation, Backend API & OCR, Frontend
- 21UG1091 β Model Training
- 21UG1092 β Model Training
- 21UG1149 β Backend API & OCR
- 21UG0460 β Backend API & OCR
- 21UG951 β Frontend
- 21UG1079 β QA & Documentation
- 21UG1260 β QA & Documentation
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you find this project helpful, please give it a star!