text title and author detection model.

## Epics and Stories

### Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences

**Goal:** Transform 44,777 human-annotated database records into clean, windowed, BIO-tagged training data with stratified splits.

**Story 1.1: Data Export & Exploration**
- As a data engineer, I want to export all 44,777 annotated records (text + span offsets for title, author, translator) from the database into a structured format (JSONL or similar), so that I can analyze and prepare training data.
- **Acceptance Criteria:**
  - Export script pulls text, span start/end, and label type for each annotation
  - Output format includes text body, list of spans with {label, start, end}
  - Basic statistics generated: label distribution, span length distribution, text length distribution

**Story 1.2: Windowed Text Extraction**
- As a data engineer, I want to extract the first N and last N syllables from each text (splitting at tsheg/shad boundaries), so that the model focuses on regions where bibliographic metadata appears.
- **Acceptance Criteria:**
  - Configurable window size (e.g., first 200 / last 200 syllables)
  - Span offsets remapped correctly to windowed text positions
  - Analysis report: what percentage of title/author/translator spans are fully captured at various window sizes
  - Handle edge cases: texts shorter than 2*N syllables (use full text)

**Story 1.3: BIO Tag Sequence Generation**
- As a data engineer, I want to convert windowed text with span annotations into BIO-tagged syllable sequences, so that models can be trained on standard sequence labeling format.
- **Acceptance Criteria:**
  - Each syllable tagged as B-TITLE, I-TITLE, B-AUTHOR, I-AUTHOR, B-TRANSLATOR, I-TRANSLATOR, or O
  - Handles overlapping or adjacent spans correctly
  - Validation: reconstruct original spans from BIO tags and verify match with source annotations

**Story 1.4: Stratified Train/Val/Test Split**
- As a data engineer, I want to create stratified 80/10/10 splits ensuring balanced representation of edge cases, so that evaluation is fair and representative.
- **Acceptance Criteria:**
  - Stratification by: presence/absence of each label type, text length bucket, source type if available
  - Split statistics report showing distribution balance across splits
  - Reproducible split with fixed random seed
  - Splits saved as separate files with consistent format

---

### Epic 2: Model Training -- Three-Architecture Experiment

**Goal:** Fine-tune BERT/ModernBERT, T5, and Gemma 4 on the prepared dataset to compare architectures empirically.

**Story 2.1: BERT/ModernBERT Token Classification**
- As an ML engineer, I want to fine-tune a Tibetan-adapted BERT or ModernBERT model with a BIO token classification head on the training data, so that I have a baseline encoder-only model for span extraction.
- **Acceptance Criteria:**
  - Load pre-trained Tibetan BERT / ModernBERT from HuggingFace
  - Add linear token classification head (7 classes: B/I for 3 labels + O)
  - Hyperparameter tuning: learning rate, batch size, epochs, warmup
  - Training logs with loss curves and validation metrics per epoch
  - Best checkpoint saved based on validation F1

**Story 2.2: T5 Sequence-to-Sequence Fine-Tuning**
- As an ML engineer, I want to fine-tune the Tibetan T5 model to generate structured bibliographic output from windowed text input, so that I can compare seq2seq vs token classification approaches.
- **Acceptance Criteria:**
  - Design input/output format (e.g., input: windowed text, output: "Title: ... | Author: ... | Translator: ...")
  - Two-stage: continued pre-training on Tibetan corpus if needed, then task fine-tuning
  - Span realignment logic: map generated text back to character offsets in source
  - Training logs and best checkpoint saved

**Story 2.3: Gemma 4 Fine-Tuning**
- As an ML engineer, I want to fine-tune Gemma 4 on the span extraction task, so that I can evaluate whether a larger generative model provides higher precision.
- **Acceptance Criteria:**
  - LoRA or QLoRA fine-tuning to manage model size
  - Prompt design for structured span extraction
  - Two-stage training if beneficial
  - Span realignment logic from generated output to source offsets
  - Training logs and best checkpoint saved

**Story 2.4: Training Infrastructure Setup**
- As an ML engineer, I want to set up reproducible training infrastructure (GPU environment, dependency management, experiment tracking), so that all three experiments are comparable and reproducible.
- **Acceptance Criteria:**
  - Consistent training environment (Docker or conda) with pinned dependencies
  - Experiment tracking (W&B, MLflow, or Trackio) logging hyperparameters, metrics, artifacts
  - GPU provisioning plan (HuggingFace Jobs, cloud, or local)

---

### Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection

**Goal:** Build a comprehensive evaluation pipeline to compare all three models on precision, error types, and operational metrics, then select the winner.

**Story 3.1: Entity-Level Metrics Implementation**
- As an ML engineer, I want to compute per-entity Precision, Recall, and F1 using both exact span match and partial span match on the test set, so that I can measure model accuracy at the entity level.
- **Acceptance Criteria:**
  - Exact match: predicted span must match gold span boundaries perfectly
  - Partial match: credit for overlapping spans (e.g., IoU-based)
  - Per-label breakdown: separate scores for Title, Author, Translator
  - Micro and macro averages

**Story 3.2: Error Taxonomy Classification**
- As an ML engineer, I want to categorize every test-set error into types (missed entity, wrong entity type, wrong boundary, hallucinated entity), so that I understand each model's failure patterns.
- **Acceptance Criteria:**
  - Automated error classifier that buckets each prediction error
  - Error distribution report per model
  - Confusion matrix: title vs author vs translator misclassifications
  - Sample errors exported for manual inspection

**Story 3.3: Operational Metrics & Composite Scoring**
- As an ML engineer, I want to measure inference latency (p50, p95), model size, and memory footprint for each model on CPU, then compute a composite score weighting precision vs speed, so that I can make a data-driven architecture decision.
- **Acceptance Criteria:**
  - Latency benchmark on standardized CPU hardware with 100+ test texts
  - Model size (parameters, disk size) and peak RAM usage
  - Composite score formula (user-defined weights, e.g., 70% precision / 30% operational)
  - Final comparison table and recommendation

**Story 3.4: Model Selection Report**
- As a project lead, I want a written comparison report with the recommendation for which model to deploy, so that stakeholders understand the tradeoff and decision rationale.
- **Acceptance Criteria:**
  - Side-by-side metrics table for all three architectures
  - Error analysis summary highlighting each model's strengths/weaknesses
  - Deployment feasibility assessment (CPU inference viability)
  - Clear recommendation with justification

---

### Epic 4: API Service -- Real-Time Prediction Endpoint

**Goal:** Deploy the selected model as a FastAPI REST service with confidence thresholding for annotator use.
**Hosting:** HuggingFace Inference Endpoints (GPU or CPU). Automatic scale-to-zero when not in use -- no charges during idle time.

**Story 4.1: Model Serving Setup on HuggingFace**
- As a backend engineer, I want to deploy the selected fine-tuned model as a HuggingFace Inference Endpoint with a custom FastAPI handler that accepts Tibetan text and returns predicted spans, so that annotators can get real-time predictions with zero idle-time cost.
- **Acceptance Criteria:**
  - POST endpoint: accepts raw Tibetan text
  - Response: list of {label, start, end, confidence, predicted_text} objects
  - Hosted on HuggingFace Inference Endpoints (GPU or CPU tier based on model selection)
  - Scale-to-zero enabled: endpoint auto-sleeps when not in use, no charges during idle
  - Cold start latency documented (time to wake from zero)
  - Response time under 30 seconds for typical text lengths (excluding cold start)
  - Health check endpoint

**Story 4.2: Confidence Thresholding**
- As a backend engineer, I want to implement per-label confidence thresholds that filter out low-confidence predictions, so that annotators only see high-precision suggestions.
- **Acceptance Criteria:**
  - Configurable threshold per label type (title, author, translator)
  - Threshold values determined from validation set precision-recall curves
  - API response indicates which predictions passed/failed threshold
  - Threshold values adjustable via configuration without redeployment

**Story 4.3: Windowed Text Processing in API**
- As a backend engineer, I want the API to automatically apply windowed extraction (first/last N syllables) before model inference, so that callers can send full text without pre-processing.
- **Acceptance Criteria:**
  - Automatic tsheg/shad tokenization and windowing
  - Span offsets in response mapped back to original full-text positions
  - Consistent with the windowing logic used during training

**Story 4.4: API Documentation & Integration Guide**
- As a backend engineer, I want OpenAPI/Swagger documentation and an integration guide, so that the annotation tool team can integrate the API.
- **Acceptance Criteria:**
  - Auto-generated OpenAPI spec from FastAPI
  - Example requests/responses for each endpoint
  - Error handling documentation
  - Integration guide with code samples

---

### Epic 5: Deployment & Feedback Loop

**Goal:** Deploy the API to production, integrate with the annotation tool, and establish a manual retraining workflow.

**Story 5.1: Production Deployment on HuggingFace**
- As a DevOps engineer, I want to configure the HuggingFace Inference Endpoint for production use with monitoring, so that annotators can use it reliably.
- **Acceptance Criteria:**
  - HuggingFace Endpoint configured with appropriate instance type (GPU or CPU based on model selection)
  - Scale-to-zero enabled with acceptable cold start time documented
  - Basic monitoring via HuggingFace dashboard: uptime, request count, latency, error rate
  - Logging of predictions for audit and retraining data collection
  - Graceful error handling and auto-restart policy

**Story 5.2: Annotation Tool Integration**
- As a frontend engineer, I want to integrate the prediction API with the existing annotation tool, so that annotators see model suggestions inline while they work.
- **Acceptance Criteria:**
  - API called when annotator opens a text for annotation
  - Predicted spans displayed as suggestions that can be accepted, adjusted, or deleted
  - Annotator corrections saved alongside original predictions for tracking

**Story 5.3: Prediction vs. Correction Tracking**
- As a data engineer, I want to log model predictions alongside annotator final decisions, so that correction data can be used for future retraining.
- **Acceptance Criteria:**
  - Each annotation session records: model predictions, annotator final spans, diffs
  - Export tool to extract correction data in training-ready format
  - Basic dashboard or report: model accuracy vs. annotator corrections over time

**Story 5.4: Manual Retraining Runbook**
- As an ML engineer, I want a documented runbook for retraining the model with accumulated corrected data, so that anyone on the team can trigger a model improvement cycle.
- **Acceptance Criteria:**
  - Step-by-step guide: export corrections, merge with existing training data, retrain, evaluate, deploy
  - Criteria for when to retrain (e.g., every 5K new corrections)
  - Validation checklist: new model must beat current model on test set before deployment


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text title and author detection model. #366

Epics and Stories

Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences

Epic 2: Model Training -- Three-Architecture Experiment

Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection

Epic 4: API Service -- Real-Time Prediction Endpoint

Epic 5: Deployment & Feedback Loop

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

text title and author detection model. #366

Description

Epics and Stories

Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences

Epic 2: Model Training -- Three-Architecture Experiment

Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection

Epic 4: API Service -- Real-Time Prediction Endpoint

Epic 5: Deployment & Feedback Loop

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions