Epics and Stories
Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences
Goal: Transform 44,777 human-annotated database records into clean, windowed, BIO-tagged training data with stratified splits.
Story 1.1: Data Export & Exploration
- As a data engineer, I want to export all 44,777 annotated records (text + span offsets for title, author, translator) from the database into a structured format (JSONL or similar), so that I can analyze and prepare training data.
- Acceptance Criteria:
- Export script pulls text, span start/end, and label type for each annotation
- Output format includes text body, list of spans with {label, start, end}
- Basic statistics generated: label distribution, span length distribution, text length distribution
Story 1.2: Windowed Text Extraction
- As a data engineer, I want to extract the first N and last N syllables from each text (splitting at tsheg/shad boundaries), so that the model focuses on regions where bibliographic metadata appears.
- Acceptance Criteria:
- Configurable window size (e.g., first 200 / last 200 syllables)
- Span offsets remapped correctly to windowed text positions
- Analysis report: what percentage of title/author/translator spans are fully captured at various window sizes
- Handle edge cases: texts shorter than 2*N syllables (use full text)
Story 1.3: BIO Tag Sequence Generation
- As a data engineer, I want to convert windowed text with span annotations into BIO-tagged syllable sequences, so that models can be trained on standard sequence labeling format.
- Acceptance Criteria:
- Each syllable tagged as B-TITLE, I-TITLE, B-AUTHOR, I-AUTHOR, B-TRANSLATOR, I-TRANSLATOR, or O
- Handles overlapping or adjacent spans correctly
- Validation: reconstruct original spans from BIO tags and verify match with source annotations
Story 1.4: Stratified Train/Val/Test Split
- As a data engineer, I want to create stratified 80/10/10 splits ensuring balanced representation of edge cases, so that evaluation is fair and representative.
- Acceptance Criteria:
- Stratification by: presence/absence of each label type, text length bucket, source type if available
- Split statistics report showing distribution balance across splits
- Reproducible split with fixed random seed
- Splits saved as separate files with consistent format
Epic 2: Model Training -- Three-Architecture Experiment
Goal: Fine-tune BERT/ModernBERT, T5, and Gemma 4 on the prepared dataset to compare architectures empirically.
Story 2.1: BERT/ModernBERT Token Classification
- As an ML engineer, I want to fine-tune a Tibetan-adapted BERT or ModernBERT model with a BIO token classification head on the training data, so that I have a baseline encoder-only model for span extraction.
- Acceptance Criteria:
- Load pre-trained Tibetan BERT / ModernBERT from HuggingFace
- Add linear token classification head (7 classes: B/I for 3 labels + O)
- Hyperparameter tuning: learning rate, batch size, epochs, warmup
- Training logs with loss curves and validation metrics per epoch
- Best checkpoint saved based on validation F1
Story 2.2: T5 Sequence-to-Sequence Fine-Tuning
- As an ML engineer, I want to fine-tune the Tibetan T5 model to generate structured bibliographic output from windowed text input, so that I can compare seq2seq vs token classification approaches.
- Acceptance Criteria:
- Design input/output format (e.g., input: windowed text, output: "Title: ... | Author: ... | Translator: ...")
- Two-stage: continued pre-training on Tibetan corpus if needed, then task fine-tuning
- Span realignment logic: map generated text back to character offsets in source
- Training logs and best checkpoint saved
Story 2.3: Gemma 4 Fine-Tuning
- As an ML engineer, I want to fine-tune Gemma 4 on the span extraction task, so that I can evaluate whether a larger generative model provides higher precision.
- Acceptance Criteria:
- LoRA or QLoRA fine-tuning to manage model size
- Prompt design for structured span extraction
- Two-stage training if beneficial
- Span realignment logic from generated output to source offsets
- Training logs and best checkpoint saved
Story 2.4: Training Infrastructure Setup
- As an ML engineer, I want to set up reproducible training infrastructure (GPU environment, dependency management, experiment tracking), so that all three experiments are comparable and reproducible.
- Acceptance Criteria:
- Consistent training environment (Docker or conda) with pinned dependencies
- Experiment tracking (W&B, MLflow, or Trackio) logging hyperparameters, metrics, artifacts
- GPU provisioning plan (HuggingFace Jobs, cloud, or local)
Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection
Goal: Build a comprehensive evaluation pipeline to compare all three models on precision, error types, and operational metrics, then select the winner.
Story 3.1: Entity-Level Metrics Implementation
- As an ML engineer, I want to compute per-entity Precision, Recall, and F1 using both exact span match and partial span match on the test set, so that I can measure model accuracy at the entity level.
- Acceptance Criteria:
- Exact match: predicted span must match gold span boundaries perfectly
- Partial match: credit for overlapping spans (e.g., IoU-based)
- Per-label breakdown: separate scores for Title, Author, Translator
- Micro and macro averages
Story 3.2: Error Taxonomy Classification
- As an ML engineer, I want to categorize every test-set error into types (missed entity, wrong entity type, wrong boundary, hallucinated entity), so that I understand each model's failure patterns.
- Acceptance Criteria:
- Automated error classifier that buckets each prediction error
- Error distribution report per model
- Confusion matrix: title vs author vs translator misclassifications
- Sample errors exported for manual inspection
Story 3.3: Operational Metrics & Composite Scoring
- As an ML engineer, I want to measure inference latency (p50, p95), model size, and memory footprint for each model on CPU, then compute a composite score weighting precision vs speed, so that I can make a data-driven architecture decision.
- Acceptance Criteria:
- Latency benchmark on standardized CPU hardware with 100+ test texts
- Model size (parameters, disk size) and peak RAM usage
- Composite score formula (user-defined weights, e.g., 70% precision / 30% operational)
- Final comparison table and recommendation
Story 3.4: Model Selection Report
- As a project lead, I want a written comparison report with the recommendation for which model to deploy, so that stakeholders understand the tradeoff and decision rationale.
- Acceptance Criteria:
- Side-by-side metrics table for all three architectures
- Error analysis summary highlighting each model's strengths/weaknesses
- Deployment feasibility assessment (CPU inference viability)
- Clear recommendation with justification
Epic 4: API Service -- Real-Time Prediction Endpoint
Goal: Deploy the selected model as a FastAPI REST service with confidence thresholding for annotator use.
Hosting: HuggingFace Inference Endpoints (GPU or CPU). Automatic scale-to-zero when not in use -- no charges during idle time.
Story 4.1: Model Serving Setup on HuggingFace
- As a backend engineer, I want to deploy the selected fine-tuned model as a HuggingFace Inference Endpoint with a custom FastAPI handler that accepts Tibetan text and returns predicted spans, so that annotators can get real-time predictions with zero idle-time cost.
- Acceptance Criteria:
- POST endpoint: accepts raw Tibetan text
- Response: list of {label, start, end, confidence, predicted_text} objects
- Hosted on HuggingFace Inference Endpoints (GPU or CPU tier based on model selection)
- Scale-to-zero enabled: endpoint auto-sleeps when not in use, no charges during idle
- Cold start latency documented (time to wake from zero)
- Response time under 30 seconds for typical text lengths (excluding cold start)
- Health check endpoint
Story 4.2: Confidence Thresholding
- As a backend engineer, I want to implement per-label confidence thresholds that filter out low-confidence predictions, so that annotators only see high-precision suggestions.
- Acceptance Criteria:
- Configurable threshold per label type (title, author, translator)
- Threshold values determined from validation set precision-recall curves
- API response indicates which predictions passed/failed threshold
- Threshold values adjustable via configuration without redeployment
Story 4.3: Windowed Text Processing in API
- As a backend engineer, I want the API to automatically apply windowed extraction (first/last N syllables) before model inference, so that callers can send full text without pre-processing.
- Acceptance Criteria:
- Automatic tsheg/shad tokenization and windowing
- Span offsets in response mapped back to original full-text positions
- Consistent with the windowing logic used during training
Story 4.4: API Documentation & Integration Guide
- As a backend engineer, I want OpenAPI/Swagger documentation and an integration guide, so that the annotation tool team can integrate the API.
- Acceptance Criteria:
- Auto-generated OpenAPI spec from FastAPI
- Example requests/responses for each endpoint
- Error handling documentation
- Integration guide with code samples
Epic 5: Deployment & Feedback Loop
Goal: Deploy the API to production, integrate with the annotation tool, and establish a manual retraining workflow.
Story 5.1: Production Deployment on HuggingFace
- As a DevOps engineer, I want to configure the HuggingFace Inference Endpoint for production use with monitoring, so that annotators can use it reliably.
- Acceptance Criteria:
- HuggingFace Endpoint configured with appropriate instance type (GPU or CPU based on model selection)
- Scale-to-zero enabled with acceptable cold start time documented
- Basic monitoring via HuggingFace dashboard: uptime, request count, latency, error rate
- Logging of predictions for audit and retraining data collection
- Graceful error handling and auto-restart policy
Story 5.2: Annotation Tool Integration
- As a frontend engineer, I want to integrate the prediction API with the existing annotation tool, so that annotators see model suggestions inline while they work.
- Acceptance Criteria:
- API called when annotator opens a text for annotation
- Predicted spans displayed as suggestions that can be accepted, adjusted, or deleted
- Annotator corrections saved alongside original predictions for tracking
Story 5.3: Prediction vs. Correction Tracking
- As a data engineer, I want to log model predictions alongside annotator final decisions, so that correction data can be used for future retraining.
- Acceptance Criteria:
- Each annotation session records: model predictions, annotator final spans, diffs
- Export tool to extract correction data in training-ready format
- Basic dashboard or report: model accuracy vs. annotator corrections over time
Story 5.4: Manual Retraining Runbook
- As an ML engineer, I want a documented runbook for retraining the model with accumulated corrected data, so that anyone on the team can trigger a model improvement cycle.
- Acceptance Criteria:
- Step-by-step guide: export corrections, merge with existing training data, retrain, evaluate, deploy
- Criteria for when to retrain (e.g., every 5K new corrections)
- Validation checklist: new model must beat current model on test set before deployment
Epics and Stories
Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences
Goal: Transform 44,777 human-annotated database records into clean, windowed, BIO-tagged training data with stratified splits.
Story 1.1: Data Export & Exploration
Story 1.2: Windowed Text Extraction
Story 1.3: BIO Tag Sequence Generation
Story 1.4: Stratified Train/Val/Test Split
Epic 2: Model Training -- Three-Architecture Experiment
Goal: Fine-tune BERT/ModernBERT, T5, and Gemma 4 on the prepared dataset to compare architectures empirically.
Story 2.1: BERT/ModernBERT Token Classification
Story 2.2: T5 Sequence-to-Sequence Fine-Tuning
Story 2.3: Gemma 4 Fine-Tuning
Story 2.4: Training Infrastructure Setup
Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection
Goal: Build a comprehensive evaluation pipeline to compare all three models on precision, error types, and operational metrics, then select the winner.
Story 3.1: Entity-Level Metrics Implementation
Story 3.2: Error Taxonomy Classification
Story 3.3: Operational Metrics & Composite Scoring
Story 3.4: Model Selection Report
Epic 4: API Service -- Real-Time Prediction Endpoint
Goal: Deploy the selected model as a FastAPI REST service with confidence thresholding for annotator use.
Hosting: HuggingFace Inference Endpoints (GPU or CPU). Automatic scale-to-zero when not in use -- no charges during idle time.
Story 4.1: Model Serving Setup on HuggingFace
Story 4.2: Confidence Thresholding
Story 4.3: Windowed Text Processing in API
Story 4.4: API Documentation & Integration Guide
Epic 5: Deployment & Feedback Loop
Goal: Deploy the API to production, integrate with the annotation tool, and establish a manual retraining workflow.
Story 5.1: Production Deployment on HuggingFace
Story 5.2: Annotation Tool Integration
Story 5.3: Prediction vs. Correction Tracking
Story 5.4: Manual Retraining Runbook