Skip to content

text title and author detection model. #366

@gangagyatso4364

Description

@gangagyatso4364

Epics and Stories

Epic 1: Data Pipeline -- From Database to Training-Ready BIO Sequences

Goal: Transform 44,777 human-annotated database records into clean, windowed, BIO-tagged training data with stratified splits.

Story 1.1: Data Export & Exploration

  • As a data engineer, I want to export all 44,777 annotated records (text + span offsets for title, author, translator) from the database into a structured format (JSONL or similar), so that I can analyze and prepare training data.
  • Acceptance Criteria:
    • Export script pulls text, span start/end, and label type for each annotation
    • Output format includes text body, list of spans with {label, start, end}
    • Basic statistics generated: label distribution, span length distribution, text length distribution

Story 1.2: Windowed Text Extraction

  • As a data engineer, I want to extract the first N and last N syllables from each text (splitting at tsheg/shad boundaries), so that the model focuses on regions where bibliographic metadata appears.
  • Acceptance Criteria:
    • Configurable window size (e.g., first 200 / last 200 syllables)
    • Span offsets remapped correctly to windowed text positions
    • Analysis report: what percentage of title/author/translator spans are fully captured at various window sizes
    • Handle edge cases: texts shorter than 2*N syllables (use full text)

Story 1.3: BIO Tag Sequence Generation

  • As a data engineer, I want to convert windowed text with span annotations into BIO-tagged syllable sequences, so that models can be trained on standard sequence labeling format.
  • Acceptance Criteria:
    • Each syllable tagged as B-TITLE, I-TITLE, B-AUTHOR, I-AUTHOR, B-TRANSLATOR, I-TRANSLATOR, or O
    • Handles overlapping or adjacent spans correctly
    • Validation: reconstruct original spans from BIO tags and verify match with source annotations

Story 1.4: Stratified Train/Val/Test Split

  • As a data engineer, I want to create stratified 80/10/10 splits ensuring balanced representation of edge cases, so that evaluation is fair and representative.
  • Acceptance Criteria:
    • Stratification by: presence/absence of each label type, text length bucket, source type if available
    • Split statistics report showing distribution balance across splits
    • Reproducible split with fixed random seed
    • Splits saved as separate files with consistent format

Epic 2: Model Training -- Three-Architecture Experiment

Goal: Fine-tune BERT/ModernBERT, T5, and Gemma 4 on the prepared dataset to compare architectures empirically.

Story 2.1: BERT/ModernBERT Token Classification

  • As an ML engineer, I want to fine-tune a Tibetan-adapted BERT or ModernBERT model with a BIO token classification head on the training data, so that I have a baseline encoder-only model for span extraction.
  • Acceptance Criteria:
    • Load pre-trained Tibetan BERT / ModernBERT from HuggingFace
    • Add linear token classification head (7 classes: B/I for 3 labels + O)
    • Hyperparameter tuning: learning rate, batch size, epochs, warmup
    • Training logs with loss curves and validation metrics per epoch
    • Best checkpoint saved based on validation F1

Story 2.2: T5 Sequence-to-Sequence Fine-Tuning

  • As an ML engineer, I want to fine-tune the Tibetan T5 model to generate structured bibliographic output from windowed text input, so that I can compare seq2seq vs token classification approaches.
  • Acceptance Criteria:
    • Design input/output format (e.g., input: windowed text, output: "Title: ... | Author: ... | Translator: ...")
    • Two-stage: continued pre-training on Tibetan corpus if needed, then task fine-tuning
    • Span realignment logic: map generated text back to character offsets in source
    • Training logs and best checkpoint saved

Story 2.3: Gemma 4 Fine-Tuning

  • As an ML engineer, I want to fine-tune Gemma 4 on the span extraction task, so that I can evaluate whether a larger generative model provides higher precision.
  • Acceptance Criteria:
    • LoRA or QLoRA fine-tuning to manage model size
    • Prompt design for structured span extraction
    • Two-stage training if beneficial
    • Span realignment logic from generated output to source offsets
    • Training logs and best checkpoint saved

Story 2.4: Training Infrastructure Setup

  • As an ML engineer, I want to set up reproducible training infrastructure (GPU environment, dependency management, experiment tracking), so that all three experiments are comparable and reproducible.
  • Acceptance Criteria:
    • Consistent training environment (Docker or conda) with pinned dependencies
    • Experiment tracking (W&B, MLflow, or Trackio) logging hyperparameters, metrics, artifacts
    • GPU provisioning plan (HuggingFace Jobs, cloud, or local)

Epic 3: Evaluation Framework -- Standardized Benchmark & Model Selection

Goal: Build a comprehensive evaluation pipeline to compare all three models on precision, error types, and operational metrics, then select the winner.

Story 3.1: Entity-Level Metrics Implementation

  • As an ML engineer, I want to compute per-entity Precision, Recall, and F1 using both exact span match and partial span match on the test set, so that I can measure model accuracy at the entity level.
  • Acceptance Criteria:
    • Exact match: predicted span must match gold span boundaries perfectly
    • Partial match: credit for overlapping spans (e.g., IoU-based)
    • Per-label breakdown: separate scores for Title, Author, Translator
    • Micro and macro averages

Story 3.2: Error Taxonomy Classification

  • As an ML engineer, I want to categorize every test-set error into types (missed entity, wrong entity type, wrong boundary, hallucinated entity), so that I understand each model's failure patterns.
  • Acceptance Criteria:
    • Automated error classifier that buckets each prediction error
    • Error distribution report per model
    • Confusion matrix: title vs author vs translator misclassifications
    • Sample errors exported for manual inspection

Story 3.3: Operational Metrics & Composite Scoring

  • As an ML engineer, I want to measure inference latency (p50, p95), model size, and memory footprint for each model on CPU, then compute a composite score weighting precision vs speed, so that I can make a data-driven architecture decision.
  • Acceptance Criteria:
    • Latency benchmark on standardized CPU hardware with 100+ test texts
    • Model size (parameters, disk size) and peak RAM usage
    • Composite score formula (user-defined weights, e.g., 70% precision / 30% operational)
    • Final comparison table and recommendation

Story 3.4: Model Selection Report

  • As a project lead, I want a written comparison report with the recommendation for which model to deploy, so that stakeholders understand the tradeoff and decision rationale.
  • Acceptance Criteria:
    • Side-by-side metrics table for all three architectures
    • Error analysis summary highlighting each model's strengths/weaknesses
    • Deployment feasibility assessment (CPU inference viability)
    • Clear recommendation with justification

Epic 4: API Service -- Real-Time Prediction Endpoint

Goal: Deploy the selected model as a FastAPI REST service with confidence thresholding for annotator use.
Hosting: HuggingFace Inference Endpoints (GPU or CPU). Automatic scale-to-zero when not in use -- no charges during idle time.

Story 4.1: Model Serving Setup on HuggingFace

  • As a backend engineer, I want to deploy the selected fine-tuned model as a HuggingFace Inference Endpoint with a custom FastAPI handler that accepts Tibetan text and returns predicted spans, so that annotators can get real-time predictions with zero idle-time cost.
  • Acceptance Criteria:
    • POST endpoint: accepts raw Tibetan text
    • Response: list of {label, start, end, confidence, predicted_text} objects
    • Hosted on HuggingFace Inference Endpoints (GPU or CPU tier based on model selection)
    • Scale-to-zero enabled: endpoint auto-sleeps when not in use, no charges during idle
    • Cold start latency documented (time to wake from zero)
    • Response time under 30 seconds for typical text lengths (excluding cold start)
    • Health check endpoint

Story 4.2: Confidence Thresholding

  • As a backend engineer, I want to implement per-label confidence thresholds that filter out low-confidence predictions, so that annotators only see high-precision suggestions.
  • Acceptance Criteria:
    • Configurable threshold per label type (title, author, translator)
    • Threshold values determined from validation set precision-recall curves
    • API response indicates which predictions passed/failed threshold
    • Threshold values adjustable via configuration without redeployment

Story 4.3: Windowed Text Processing in API

  • As a backend engineer, I want the API to automatically apply windowed extraction (first/last N syllables) before model inference, so that callers can send full text without pre-processing.
  • Acceptance Criteria:
    • Automatic tsheg/shad tokenization and windowing
    • Span offsets in response mapped back to original full-text positions
    • Consistent with the windowing logic used during training

Story 4.4: API Documentation & Integration Guide

  • As a backend engineer, I want OpenAPI/Swagger documentation and an integration guide, so that the annotation tool team can integrate the API.
  • Acceptance Criteria:
    • Auto-generated OpenAPI spec from FastAPI
    • Example requests/responses for each endpoint
    • Error handling documentation
    • Integration guide with code samples

Epic 5: Deployment & Feedback Loop

Goal: Deploy the API to production, integrate with the annotation tool, and establish a manual retraining workflow.

Story 5.1: Production Deployment on HuggingFace

  • As a DevOps engineer, I want to configure the HuggingFace Inference Endpoint for production use with monitoring, so that annotators can use it reliably.
  • Acceptance Criteria:
    • HuggingFace Endpoint configured with appropriate instance type (GPU or CPU based on model selection)
    • Scale-to-zero enabled with acceptable cold start time documented
    • Basic monitoring via HuggingFace dashboard: uptime, request count, latency, error rate
    • Logging of predictions for audit and retraining data collection
    • Graceful error handling and auto-restart policy

Story 5.2: Annotation Tool Integration

  • As a frontend engineer, I want to integrate the prediction API with the existing annotation tool, so that annotators see model suggestions inline while they work.
  • Acceptance Criteria:
    • API called when annotator opens a text for annotation
    • Predicted spans displayed as suggestions that can be accepted, adjusted, or deleted
    • Annotator corrections saved alongside original predictions for tracking

Story 5.3: Prediction vs. Correction Tracking

  • As a data engineer, I want to log model predictions alongside annotator final decisions, so that correction data can be used for future retraining.
  • Acceptance Criteria:
    • Each annotation session records: model predictions, annotator final spans, diffs
    • Export tool to extract correction data in training-ready format
    • Basic dashboard or report: model accuracy vs. annotator corrections over time

Story 5.4: Manual Retraining Runbook

  • As an ML engineer, I want a documented runbook for retraining the model with accumulated corrected data, so that anyone on the team can trigger a model improvement cycle.
  • Acceptance Criteria:
    • Step-by-step guide: export corrections, merge with existing training data, retrain, evaluate, deploy
    • Criteria for when to retrain (e.g., every 5K new corrections)
    • Validation checklist: new model must beat current model on test set before deployment

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Story.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions