Skip to content

kit119/KIT-108

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕉 KIT-108 — Dharmic Hinduism Multi-Script Sanskrit Corpus

Universal Sanskrit Transliteration & Digital Humanities Toolkit


📜 Project Overview

KIT-108 is a structured digital Sanskrit corpus and transliteration toolkit
designed for Digital Humanities, Philology, Comparative Linguistics, and Dharmic Studies.

This project serves two parallel purposes:

  1. 🗂 Corpus Preservation – Structured archival packaging of major Dharmic texts
  2. 🔤 Multi-Script Rendering Engine – Machine-generated script conversion layer

The corpus is distributed via Internet Archive,
while this repository provides the processing engine used to normalize and transliterate textual layers.


📚 Corpus Scope

Included textual families (depending on collection build):

  • Bhagavad Gita
  • Mahabharata
  • Valmiki Ramayana
  • Selected Vedic Texts
  • GRETIL Archive Texts
  • SARIT XML Corpus
  • Muktabodha Digital Texts
  • Supplementary research corpora

🔠 Supported Script Layers

Machine-generated script layers may include:

  • IAST (Romanized Sanskrit)
  • Devanagari
  • IPA
  • Thai
  • Lao
  • Khmer
  • Burmese

All transliteration layers are:

⚙ Machine-generated
📖 Textually faithful (no semantic modification)
🔍 Designed for research indexing & comparative study


🏗 Repository Function

This repository contains the lightweight transliteration engine used to process ZIP-based Sanskrit corpora.

Core Capabilities

  • 📦 Batch processing from ZIP archives
  • 📄 Supports TXT / XML / JSON
  • ⚡ Multi-core processing
  • 🔁 Cached transliteration for efficiency
  • 🔤 Powered by Aksharamukha engine
  • 🧠 Designed for large-scale digital humanities workflows

⚙ Installation

pip install -r requirements.txt

Dependencies:

  • aksharamukha
  • tqdm

Python 3.9+ recommended.


🚀 Usage

Basic example:

python aom.py sample.zip --source IAST --target Thai

Advanced example:

python aom.py gretil.zip \
  --source Devanagari \
  --target IAST \
  --out output_folder \
  --workers 8

📂 Example Workflow

Corpus ZIP
   ↓
Parse Text (TXT / XML / JSON)
   ↓
Normalize
   ↓
Transliteration Engine
   ↓
Multi-Script Output Layer

🧠 Research Applications

This toolkit supports:

  • Sanskrit philological comparison
  • Script evolution studies
  • Southeast Asian Indic transmission research
  • Digital text alignment
  • Corpus indexing and search optimization
  • Cross-script Dharmic textual analysis

⚖ Legal & Rights Notice

Educational Purpose Only

This project supports:

  • Non-commercial academic use
  • Digital preservation
  • Public domain archival

Historical materials included in distributed corpus builds
are either:

  • Public domain
  • Orphan works
  • Research-distributed texts

No commercial profit is derived.

May this knowledge sharing (Dharma-Dana) lead to peace and wisdom.


🧭 Project Structure

KIT-Universal-Transliteration/
│
├── aom.py
├── proc.py
├── requirements.txt
├── README.md
└── LICENSE

🌏 Relationship to Internet Archive Release

The Internet Archive hosts:

📦 Corpus Data Collections

This GitHub repository provides:

🔧 Processing & Transliteration Engine

Both are complementary components of KIT-108.


🔮 Roadmap

Planned future expansions:

  • Script alignment matrix mode
  • TEI-aware structural preservation
  • Layer comparison export
  • Corpus metadata auto-generation
  • Linguistic tokenization layer
  • Search-optimized normalized layer

👤 Project Identity

KIT-108 Project
Digital Humanities Initiative
Southeast Asian Dharmic Preservation Framework

Year: 2026


🕉 End of README