An AI-powered retrieval-augmented generation (RAG) system that enables users to ask natural language questions about Sri Lankan tax regulations and receive accurate, citation-backed answers extracted strictly from official Inland Revenue Department (IRD) documents.
This system leverages Large Language Models (LLMs) and vector databases to provide intelligent tax guidance by:
- Ingesting official IRD tax documents (PDFs)
- Converting them into searchable embeddings
- Answering tax questions with precise document citations
- Preventing hallucinations through prompt engineering
- Maintaining full audit trails with source references
Document Ingestion - Upload and process IRD PDF documents
Multi-Document Retrieval - Search across multiple tax documents simultaneously
Citation-Backed Answers - All responses include document, page, and section references
Safety Controls - Responses restricted to source documents only
REST API - Easy integration with frontend applications
Vector Database - Fast semantic search using Chroma + HuggingFace embeddings
LLM Integration - Groq API for fast inference without GPU requirements
| Component | Technology |
|---|---|
| Framework | FastAPI (Python) |
| LLM | Groq (llama-3.1-8b-instant) |
| Embeddings | HuggingFace (all-MiniLM-L6-v2) |
| Vector Database | Chroma + ChromaDB |
| PDF Processing | PyPDF, PyMuPDF |
| Text Splitting | LangChain RecursiveCharacterTextSplitter |
| Server | Uvicorn (ASGI) |
| API Validation | Pydantic |
- Python 3.10+
- pip (Python package manager)
- Virtual environment (recommended)
cd d:\Tax_Aipython -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # macOS/Linuxpip install -r requirements.txtCreate a .env file in the root directory with:
# API Keys (Required)
OPENAI_API_KEY=your_openai_api_key_here
GROQ_API_KEY=your_groq_api_key_here
# Model Configuration
LLM_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
TEMPERATURE=0.0
# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# Vector Database
VECTOR_DB_TYPE=chroma
VECTOR_DB_PATH=data/vector_store
# Server Configuration
API_HOST=0.0.0.0
API_PORT=8000mkdir -p data/raw
mkdir -p data/vector_storeAdd your IRD PDF documents to data/raw/ folder.
python -m uvicorn app.main:app --reloadServer will run at: http://127.0.0.1:8000
GET /api/v1/health
Response:
{
"status": "healthy",
"documents": {
"status": "active",
"count": 3,
"files": ["Corporate_Income_Tax_Guide.pdf"]
},
"vector_store_initialized": true
}GET /api/v1/documents
Response:
{
"status": "active",
"count": 3,
"files": ["CIT_Guide_2022_2023.pdf", "PN_IT_2025_01.pdf", "SET_Guide_2025_26.pdf"],
"path": "data/raw"
}POST /api/v1/initialize
Loads all PDFs from data/raw/, creates embeddings, and builds vector store.
Response:
{
"status": "success",
"documents_loaded": 3,
"pages_loaded": 450,
"chunks_created": 1523,
"message": "System initialized and ready for queries"
}POST /api/v1/upload
Content-Type: multipart/form-data
file: <PDF_FILE>
Response:
{
"message": "Document uploaded and processed successfully",
"filename": "CIT_Assessment_Guide.pdf",
"pages_loaded": 156,
"chunks_created": 387
}POST /api/v1/query
Content-Type: application/json
{
"question": "What is the Corporate Income Tax rate for AY 2022/2023?",
"k": 3
}
Response:
{
"answer": "According to the Corporate Income Tax Assessment Guide (AY 2022/2023), the standard Corporate Income Tax rate is 18% for resident companies and 28% for non-resident companies. However, certain categories may qualify for concessional rates as outlined in the Income Tax Act...",
"sources": [
{
"source": "Corporate_Income_Tax_Guide.pdf",
"page": 15,
"content": "The standard rate of Corporate Income Tax for AY 2022/2023 is 18% on taxable income of resident companies..."
},
{
"source": "Corporate_Income_Tax_Guide.pdf",
"page": 47,
"content": "Non-resident companies are subject to Corporate Income Tax at the rate of 28%..."
}
],
"disclaimer": "This response is based solely on IRD-published documents and is not professional tax advice."
}curl -X POST "http://127.0.0.1:8000/api/v1/query" \
-H "Content-Type: application/json" \
-d '{
"question": "How is Self Employment Tax calculated for 2025/2026?",
"k": 5
}'curl -X POST "http://127.0.0.1:8000/api/v1/query" \
-H "Content-Type: application/json" \
-d '{
"question": "What changes were introduced in Public Notice PN_IT_2025-01?",
"k": 3
}'- Navigate to:
http://127.0.0.1:8000/docs - Click on "POST /api/v1/query"
- Click "Try it out"
- Enter your question in the request body
- Click "Execute"
d:\Tax_Ai\
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application entry point
│ ├── config.py # Configuration & environment variables
│ ├── ingest.py # Document ingestion script (standalone)
│ ├── api/
│ │ ├── __init__.py
│ │ ├── routes.py # API endpoints
│ │ └── schemas.py # Pydantic request/response models
│ └── services/
│ ├── __init__.py
│ ├── loader.py # PDF document loader & orchestrator
│ ├── preprocessor.py # Text cleaning & normalization
│ ├── chunker.py # Text splitting into chunks
│ ├── embeddings.py # HuggingFace embedding service
│ ├── metadata_extractor.py # Document metadata enrichment
│ ├── vector_db.py # Chroma vector database management
│ └── rag_chain.py # RAG pipeline & LLM integration
├── data/
│ ├── raw/ # PDF documents (input)
│ └── vector_store/ # Chroma database (auto-created)
├── requirements.txt # Python dependencies
├── .env # Environment variables (create manually)
└── README.md # This file
1. INGESTION (app/services/loader.py)
↓
PDF → Load raw pages
2. PREPROCESSING (app/services/preprocessing.py)
↓
Raw text → Fix hyphenation, normalize spaces
3. CHUNKING (app/services/chunker.py)
↓
Clean text → Split into 1000-char chunks (200 overlap)
4. METADATA ENRICHMENT (app/services/metadata_extractor.py)
↓
Chunks → Add document name, page number, citation format
5. EMBEDDING (app/services/embeddings.py)
↓
Chunks → Convert to vector embeddings
6. VECTOR DB STORAGE (app/services/vector_db.py)
↓
Embeddings → Store in Chroma database
7. RETRIEVAL & ANSWERING (app/services/rag_chain.py)
↓
Question → Search similar chunks → LLM → Cite sources
- Prompt engineering restricts answers to source documents
- Strict retrieval only uses chunks found in vector database
- No external knowledge - LLM cannot invent tax rules
- Fallback message - "This information is not available in the provided IRD documents"
- Metadata tracking - Document name, page number, section stored with each chunk
- Source attribution - Every answer includes precise source references
- Audit trail - Citation format: "Document Name – Page X"
- Legal disclaimer - All responses include non-professional-advice notice
- Temperature = 0.0 - Deterministic responses, no creativity
- No speculative answers - System clearly states when info is unavailable
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 200 # Overlap between chunks (context preservation)Why? Tax documents often contain multi-line rules. Overlap prevents cutting sentences.
k = 3 (default) # Number of similar chunks retrievedWhy? Balances retrieval time vs answer completeness.
MODEL = "llama-3.1-8b-instant"
TEMPERATURE = 0.0 # Strict factual mode, no creativityWhy? Tax guidance requires 100% accuracy, no hallucinations.
Solution:
POST http://127.0.0.1:8000/api/v1/initializeSolution:
pip install langchain-community chromadbSolution:
- Add
GROQ_API_KEY=...to.envfile - Restart the server
Solution:
- Ensure PDFs are in
data/raw/folder - Check file extensions are
.pdf(lowercase)
The system is designed to ingest these documents:
-
Corporate Income Tax Assessment Guide (AY 2022/2023)
-
Public Notice – Income Tax (PN_IT_2025-01)
-
Self Employment Tax (SET) Detailed Guide (AY 2025/2026)
Optional Enhancements:
- Inland Revenue Act No. 24 of 2017
- VAT Act & VAT guides
- PAYE and WHT circulars
- Tax filing deadline notices
curl -X POST "http://127.0.0.1:8000/api/v1/upload" \
-F "file=@data/raw/CIT_Guide.pdf"curl -X POST "http://127.0.0.1:8000/api/v1/initialize"curl -X POST "http://127.0.0.1:8000/api/v1/query" \
-H "Content-Type: application/json" \
-d '{
"question": "What is the Corporate Income Tax rate?",
"k": 3
}'curl "http://127.0.0.1:8000/api/v1/health"- PDF Quality - Documents are text-based PDFs (not scanned images)
- Language - All documents are in English
- Chunking - 1000 characters is optimal for tax document structures
- Retrieval - Top-3 chunks provide sufficient context for accurate answers
- API Keys - Groq and OpenAI API keys are valid and have sufficient quota
- Disclaimer - Users understand this is not professional tax advice
- Source Authority - Only official IRD documents are used (no third-party sources)
This project is designed for educational and compliance assistance purposes.
For issues or feature requests:
- Check the Troubleshooting section above
- Verify all
.envvariables are correctly set - Ensure PDFs are in the correct folder
- Check server logs for detailed error messages
- Add IRD Documents - Download and place PDFs in
data/raw/ - Initialize System - Run
/api/v1/initialize - Test Queries - Use Swagger UI or curl commands
- Monitor Citations - Verify sources are accurate and helpful
Version: 1.0.0
Last Updated: January 27, 2026
Status: Production Ready ✅