KIT-108 is a structured digital Sanskrit corpus and transliteration toolkit
designed for Digital Humanities, Philology, Comparative Linguistics, and Dharmic Studies.
This project serves two parallel purposes:
- 🗂 Corpus Preservation – Structured archival packaging of major Dharmic texts
- 🔤 Multi-Script Rendering Engine – Machine-generated script conversion layer
The corpus is distributed via Internet Archive,
while this repository provides the processing engine used to normalize and transliterate textual layers.
Included textual families (depending on collection build):
- Bhagavad Gita
- Mahabharata
- Valmiki Ramayana
- Selected Vedic Texts
- GRETIL Archive Texts
- SARIT XML Corpus
- Muktabodha Digital Texts
- Supplementary research corpora
Machine-generated script layers may include:
- IAST (Romanized Sanskrit)
- Devanagari
- IPA
- Thai
- Lao
- Khmer
- Burmese
All transliteration layers are:
⚙ Machine-generated
📖 Textually faithful (no semantic modification)
🔍 Designed for research indexing & comparative study
This repository contains the lightweight transliteration engine used to process ZIP-based Sanskrit corpora.
- 📦 Batch processing from ZIP archives
- 📄 Supports TXT / XML / JSON
- ⚡ Multi-core processing
- 🔁 Cached transliteration for efficiency
- 🔤 Powered by Aksharamukha engine
- 🧠 Designed for large-scale digital humanities workflows
pip install -r requirements.txtDependencies:
aksharamukhatqdm
Python 3.9+ recommended.
Basic example:
python aom.py sample.zip --source IAST --target Thai
Advanced example:
python aom.py gretil.zip \
--source Devanagari \
--target IAST \
--out output_folder \
--workers 8Corpus ZIP
↓
Parse Text (TXT / XML / JSON)
↓
Normalize
↓
Transliteration Engine
↓
Multi-Script Output Layer
This toolkit supports:
- Sanskrit philological comparison
- Script evolution studies
- Southeast Asian Indic transmission research
- Digital text alignment
- Corpus indexing and search optimization
- Cross-script Dharmic textual analysis
Educational Purpose Only
This project supports:
- Non-commercial academic use
- Digital preservation
- Public domain archival
Historical materials included in distributed corpus builds
are either:
- Public domain
- Orphan works
- Research-distributed texts
No commercial profit is derived.
May this knowledge sharing (Dharma-Dana) lead to peace and wisdom.
KIT-Universal-Transliteration/
│
├── aom.py
├── proc.py
├── requirements.txt
├── README.md
└── LICENSE
The Internet Archive hosts:
📦 Corpus Data Collections
This GitHub repository provides:
🔧 Processing & Transliteration Engine
Both are complementary components of KIT-108.
Planned future expansions:
- Script alignment matrix mode
- TEI-aware structural preservation
- Layer comparison export
- Corpus metadata auto-generation
- Linguistic tokenization layer
- Search-optimized normalized layer
KIT-108 Project
Digital Humanities Initiative
Southeast Asian Dharmic Preservation Framework
Year: 2026
🕉 End of README