An engineering-focused project covering data acquisition, dataset curation, tokenization, training, and evaluation of a domain-specific AI/ML LLM.
The primary goal of this project is to train a domain-specific Large Language Model (LLM) from scratch, focused on Artificial Intelligence and Machine Learning knowledge.
This repository documents and implements the entire LLM pipeline end-to-end, with a strong emphasis on engineering clarity rather than speed or scale.
This is not about using pre-trained APIs. This is about understanding and building the system.
This repository covers:
- Automated and semi-automated collection of AI/ML-related books and resources
- Topic-driven data gathering (subdomains within AI/ML)
- Metadata generation
- Duplicate detection
- Content validation and filtering
- Text normalization
- Vocabulary construction
- Token statistics and analysis
- Training a language model from scratch
- Domain-focused learning objectives
- Domain-specific evaluation prompts
- Qualitative and quantitative assessment
- Model optimization and quantization
- Inference pipeline setup
- API endpoint design and implementation
- Performance monitoring and logging
- Engineering-first approach
- Automation where possible, manual control where necessary
- Slow is acceptable, correctness is not optional
- Reproducibility over convenience
- Production-ready code from the start
The system is designed to reduce manual effort while keeping the learning process transparent and debuggable.
ai-ml-domain-llm/
│
├── data-collection/ # Agents, tools, metadata generators
├── data-checking/ # Deduplication, validation, statistics
├── datasets/ # Curated and processed datasets
├── tokenization/ # Tokenizer experiments and analysis
├── training/ # Model architecture and training scripts
├── evaluation/ # Evaluation prompts and metrics
├── deployment/ # Production deployment configurations
├── docs/ # Architecture and design notes
└── README.md
- Project initialization
- Data collection architecture design
- Tooling for metadata and duplicate checking
The repository will evolve incrementally as each stage of the pipeline is implemented.
This project is developed collaboratively by a small team, with shared responsibility across system design, data engineering, and model experimentation.
This project is intended for educational and research purposes only. All data handling is performed with respect to learning objectives and system design exploration.
Most people use LLMs. Few understand how they are built and deployed.
This project exists to gain real, hands-on experience with LLM engineering—from data collection through training to production deployment. It's an end-to-end journey through the complete lifecycle of building and shipping a language model.
From raw text → tokens → parameters → behavior → production.
Date: January 25, 2026
This branch contains comprehensive fixes for critical bugs and improvements to code quality, reliability, and production-readiness.
-
JSON File Corruption Prevention
- Fixed race condition in
save_to_json_file()that could corruptresources.json - Added proper
f.truncate()after writing - Added recovery mechanism for corrupted JSON files
- Fixed race condition in
-
Unified Memory System
- REMOVED SQLite local database (
shared_memory.db) - Now using Supabase cloud exclusively for shared memory
- Eliminated dual memory system that was causing sync issues
- All team members now see the same duplicate detection data
- REMOVED SQLite local database (
-
Browser Resource Management
- Added proper cleanup in
finallyblocks to prevent zombie browser processes - Browsers now always close even on errors
- Configurable headless mode via
BROWSER_HEADLESSenvironment variable
- Added proper cleanup in
-
Conditional Download Delays
- Download cooldown (40s) now only applies to successful downloads
- Configurable via
DOWNLOAD_COOLDOWNenvironment variable - Failures no longer waste time waiting
-
Environment Variable Validation
- Added startup checks for required variables (
OPENAI_API_KEY,SUPABASE_URL,SUPABASE_KEY) - Clear error messages guide users to fix missing configuration
- Prevents cryptic runtime errors
- Added startup checks for required variables (
-
LLM Timeout Protection
- All OpenAI API calls now have configurable timeouts (default: 30s)
- Prevents indefinite hangs on network issues
- Graceful fallbacks when LLM calls fail
-
Conversation Memory Leak Fix
- Added automatic conversation history trimming (default: 20 messages)
- Prevents token overflow in long sessions
- Configurable via
MAX_CONVERSATION_HISTORY
-
Retry Logic with Exponential Backoff
- LLM calls retry up to 3 times on failure
- Exponential backoff (1s, 2s, 4s) prevents hammering failed endpoints
- User-friendly error messages
-
MongoDB Duplicate Key Handling
- Gracefully handles unique index violations in WebDataTracker
- No more crashes on duplicate insertions
- Proper error counting and reporting
-
Input Sanitization
- User names are now sanitized (alphanumeric + basic chars only)
- Limited to 50 characters
- Prevents potential injection issues
- Configuration Management: All magic numbers moved to environment variables
- Error Handling: Specific exception catching instead of broad
except Exception - Resource Cleanup: Proper context management and cleanup in error cases
- .env.example Files: Added template files for both DataCollector and WebDataTracker
- Type Safety: Added validation for loaded JSON configurations
New environment variables (see .env.example files):
BROWSER_HEADLESS- Set totruefor production serversDOWNLOAD_COOLDOWN- Adjust download rate limitingLLM_TIMEOUT- Prevent hanging on slow API responsesMAX_DOWNLOADS_PER_ACCOUNT- Adjust for Z-Library limit changesMAX_CONVERSATION_HISTORY- Control memory usage in long sessions
- Removed SQLite dependency:
shared_memory.dbis no longer used- Action Required: Ensure Supabase credentials are in
.env - Old SQLite data will not be migrated automatically
- Action Required: Ensure Supabase credentials are in
- Environment variables now required: Script will exit if missing
- Action Required: Copy
.env.exampleto.envand fill in values
- Action Required: Copy
Before merging to main:
- ✅ Test with missing environment variables
- ✅ Test download flow with multiple accounts
- ✅ Verify Supabase duplicate detection works across team members
- ✅ Test conversation trimming in long sessions
- ✅ Verify browser cleanup (check for zombie processes)
- ✅ Test WebDataTracker duplicate handling
DataCollectornValidatorAgent/agent.py- Memory leak fix, validation, retriesDataCollectornValidatorAgent/mcp_server.py- JSON fix, SQLite removal, browser cleanupManualDataValidator/WebDataTracker/WebDataTracker/app.py- Duplicate key handlingManualDataValidator/WebDataTracker/WebDataTracker/openai_service.py- Timeout protection- NEW:
DataCollectornValidatorAgent/.env.example - NEW:
ManualDataValidator/WebDataTracker/WebDataTracker/.env.example
-
Backup your data (just in case):
cp data/resources.json data/resources.json.backup
-
Set up environment variables:
cd DataCollectornValidatorAgent cp .env.example .env # Edit .env with your actual credentials
-
Verify Supabase table exists (run once):
python memory.py # Follow instructions to create table in Supabase dashboard -
Test the agent:
python agent.py