TECHNICAL DEBT

Code quality issues, refactoring priorities, and technical improvements across all project phases.

Last Updated: November 2025 Priority Legend: 🔴 High | 🟡 Medium | 🟢 Low

Code Quality Issues

Database Layer

1. Complete MDB to PostgreSQL Migration 🔴

Issue: Legacy .mdb files still in use; incomplete migration to PostgreSQL
Impact: Poor query performance, difficult to maintain, no referential integrity
Effort: 40 hours
Priority: High (Phase 1 blocker)
Solution:
- Migrate all 3 MDB files (avall.mdb, Pre2008.mdb, PRE1982.MDB)
- Implement foreign key constraints
- Create materialized views for complex joins
- Add full-text search indexes on narratives table

2. Schema Normalization 🟡

Issue: Denormalized tables with duplicate data (e.g., aircraft makes/models)
Impact: Data inconsistencies, increased storage, update anomalies
Effort: 20 hours
Priority: Medium
Solution:
- Create lookup tables: aircraft_makes, aircraft_models, airports
- Normalize events, aircraft, and crew tables
- Add database constraints and triggers

3. Missing Indexes 🔴

Issue: Slow queries on frequently accessed columns (ev_id, Aircraft_Key, occurrence_code)
Impact: API response time >500ms for complex queries
Effort: 8 hours
Priority: High
Solution:
- Add B-tree indexes on foreign keys
- Add GIN indexes for full-text search
- Add composite indexes for common query patterns
- Monitor query performance with pg_stat_statements

API Layer

4. Missing Input Validation 🔴

Issue: API endpoints lack comprehensive input validation (SQL injection risk)
Impact: Security vulnerabilities, potential data corruption
Effort: 15 hours
Priority: High
Solution:
- Use Pydantic models for all request/response schemas
- Add regex validation for dates, IDs, codes
- Implement rate limiting and request size limits
- Add input sanitization for user-generated content

5. No Request Pagination 🟡

Issue: GET /events returns all records (100K+), causing timeouts
Impact: API performance degradation, poor UX
Effort: 10 hours
Priority: Medium
Solution:
- Implement cursor-based pagination (offset/limit)
- Add page size limits (max 1000 records)
- Return total count in headers
- Document pagination in API docs

6. Inconsistent Error Handling 🟡

Issue: Some endpoints return 500 errors instead of 400/404
Impact: Poor API experience, difficult debugging
Effort: 12 hours
Priority: Medium
Solution:
- Standardize error response format: {"error": "message", "code": 400}
- Use FastAPI exception handlers
- Add detailed error messages with context
- Log all errors to centralized logging (Sentry, CloudWatch)

Machine Learning

7. No Model Versioning 🔴

Issue: ML models not versioned; difficult to rollback if performance degrades
Impact: Risk of deploying broken models, no audit trail
Effort: 15 hours
Priority: High
Solution:
- Implement MLflow Model Registry
- Tag models with versions (v1.0, v1.1, etc.)
- Track model lineage: training data → hyperparameters → metrics
- Automate model deployment with CI/CD

8. Feature Engineering Not Modular 🟡

Issue: Feature engineering code scattered across notebooks; not reusable
Impact: Difficult to reproduce, training/serving skew
Effort: 25 hours
Priority: Medium
Solution:
- Create AviationFeatureEngineer class
- Implement fit/transform methods (scikit-learn API)
- Serialize feature pipelines with joblib
- Add unit tests for feature transformations

9. No Model Monitoring 🔴

Issue: Production models not monitored for drift, performance degradation
Impact: Silent model failures, inaccurate predictions
Effort: 30 hours
Priority: High
Solution:
- Implement Evidently AI for drift detection
- Track data drift (KS test, PSI)
- Track prediction drift (distribution shifts)
- Set up alerts (Slack/email) for drift detection

NLP & AI

10. BERT Model Not Optimized 🟡

Issue: SafeAeroBERT model size 400MB+, slow inference (500ms)
Impact: High latency for narrative classification
Effort: 20 hours
Priority: Medium
Solution:
- Apply model quantization (INT8, ONNX)
- Use DistilBERT or TinyBERT (50% smaller)
- Implement model caching (Redis)
- Batch inference for throughput

11. RAG Chunking Strategy 🟡

Issue: Fixed chunk size (512 tokens) doesn't work for all narratives
Impact: Loss of context, poor retrieval quality
Effort: 15 hours
Priority: Medium
Solution:
- Implement semantic chunking (split on sentences/paragraphs)
- Use overlapping chunks (50-100 token overlap)
- Experiment with chunk sizes (256, 512, 1024)
- A/B test chunking strategies

12. Knowledge Graph Disconnected 🟢

Issue: 20%+ of nodes disconnected (no relationships)
Impact: Incomplete graph, poor traversals
Effort: 18 hours
Priority: Low
Solution:
- Improve relationship extraction (LLM prompts)
- Add implicit relationships (co-occurrence, temporal)
- Run entity resolution to merge duplicates
- Validate graph connectivity

Refactoring Priorities

High Impact

1. Modularize ETL Pipeline 🔴

Current: Single monolithic Airflow DAG
Target: 5+ reusable DAGs (extract, transform, load, validate, feature_eng)
Benefit: Easier to debug, parallel execution, independent scaling
Effort: 35 hours

2. Separate API from Business Logic 🔴

Current: FastAPI routes contain business logic (queries, transformations)
Target: 3-tier architecture (API → Service → Repository)
Benefit: Testability, reusability, clean separation of concerns
Effort: 40 hours

3. Extract Configuration to Environment Variables 🔴

Current: Hardcoded config values (database URLs, API keys)
Target: Use .env files, Kubernetes ConfigMaps/Secrets
Benefit: Security, flexibility, easier deployment
Effort: 10 hours

Medium Impact

4. Replace mdbtools with DuckDB 🟡

Current: mdbtools for MDB extraction (slow, dependencies)
Target: Use DuckDB to query MDB files directly (10x faster)
Benefit: Faster extraction, fewer dependencies, SQL queries
Effort: 15 hours

5. Consolidate Duplicate Code 🟡

Current: Copy-pasted code for data validation, error handling
Target: Create shared utility libraries (ntsb_utils/)
Benefit: DRY principle, easier maintenance
Effort: 20 hours

6. Improve Notebook Organization 🟡

Current: 50+ Jupyter notebooks with inconsistent naming
Target: Organize by phase (eda/, modeling/, analysis/), add README
Benefit: Easier to find, reproducibility
Effort: 8 hours

Low Impact

7. Upgrade Python Dependencies 🟢

Current: Some packages 2-3 versions behind (pandas 1.5.3 → 2.2.0)
Target: Update to latest stable versions
Benefit: Bug fixes, performance improvements, new features
Effort: 12 hours (includes testing)

8. Standardize Logging 🟢

Current: Inconsistent logging (print statements, custom loggers)
Target: Use structlog with JSON formatting
Benefit: Centralized logging, easier to parse, better debugging
Effort: 10 hours

9. Add Type Hints 🟢

Current: 30% of Python code lacks type hints
Target: 90%+ coverage with mypy static type checking
Benefit: Better IDE support, catch bugs early
Effort: 15 hours

Legacy Code Migration

Phase 1: Database Layer

MDB → PostgreSQL (40 hours)
- Extract all tables with mdb-export
- Transform data (fix dates, clean nulls, normalize)
- Load with COPY command (10x faster than INSERT)
- Validate record counts, data types
Add Foreign Keys (10 hours)
- events → aircraft (ev_id)
- aircraft → engines (Aircraft_Key)
- events → findings (ev_id)
- Add ON DELETE CASCADE where appropriate
Create Materialized Views (8 hours)
- accident_summary_view (pre-joined events + aircraft + injuries)
- monthly_stats_view (aggregated monthly statistics)
- Refresh strategy: daily or on-demand

Phase 2: ETL Pipeline

Airflow DAGs Refactoring (35 hours)
- Split monolithic DAG into 5 modular DAGs
- Add task dependencies (TaskGroups)
- Implement dynamic DAG generation (for monthly updates)
- Add retries and alerting
Data Quality Framework (25 hours)
- Replace custom validation with Great Expectations
- Create 50+ expectations (data types, ranges, nulls)
- Add data quality dashboard (Grafana)
- Automate quality reports (weekly)

Phase 3: API Layer

FastAPI Refactoring (40 hours)
- Move business logic to service layer
- Create repository pattern for database access
- Add dependency injection
- Improve test coverage (>80%)
Authentication Overhaul (20 hours)
- Replace custom JWT with Auth0 or Clerk
- Add OAuth2 support (Google, GitHub)
- Implement RBAC with permissions

Performance Bottlenecks

Database

1. Slow Queries 🔴

Query: SELECT * FROM events WHERE narrative LIKE '%engine failure%'
Current: 5-10 seconds (full table scan)
Target: <100ms with full-text search index

Solution:

CREATE INDEX idx_narrative_fts ON events USING GIN(to_tsvector('english', narrative));

-- Query with full-text search
SELECT * FROM events WHERE to_tsvector('english', narrative) @@ to_tsquery('engine & failure');

2. Large Table Joins 🔴

Query: JOIN events + aircraft + findings (100K+ records)
Current: 2-5 seconds
Target: <500ms with materialized view

Solution:

CREATE MATERIALIZED VIEW accident_details AS
SELECT e.*, a.make, a.model, f.cause_factor
FROM events e
LEFT JOIN aircraft a ON e.ev_id = a.ev_id
LEFT JOIN findings f ON e.ev_id = f.ev_id;

-- Refresh daily
REFRESH MATERIALIZED VIEW CONCURRENTLY accident_details;

API

3. No Response Caching 🟡

Endpoint: GET /stats (expensive aggregation query)
Current: Computed on every request (2-3 seconds)
Target: <50ms with Redis caching

Solution:

import redis
import json

redis_client = redis.Redis(host='redis', port=6379)

@app.get("/stats")
async def get_stats():
    # Check cache
    cached = redis_client.get("stats")
    if cached:
        return json.loads(cached)

    # Compute stats
    stats = compute_stats()

    # Cache for 1 hour
    redis_client.setex("stats", 3600, json.dumps(stats))

    return stats

Machine Learning

4. Slow Feature Engineering 🟡

Current: Feature engineering on single CPU (5 minutes for 100K records)
Target: <1 minute with parallelization

Solution:

from joblib import Parallel, delayed

def engineer_features_batch(batch):
    return feature_engineer.transform(batch)

# Parallel processing
batches = np.array_split(df, 10)
results = Parallel(n_jobs=10)(delayed(engineer_features_batch)(batch) for batch in batches)
df_features = pd.concat(results)

5. ML Model Inference 🟡

Current: 200ms per prediction (XGBoost)
Target: <50ms with batching

Solution:

# Batch prediction API
@app.post("/ml/predict/batch")
async def predict_batch(requests: list[PredictionRequest]):
    X = pd.DataFrame([req.dict() for req in requests])
    X = feature_engineer.transform(X)

    # Batch prediction (10x faster than individual)
    predictions = model.predict_proba(X)

    return [
        {"fatal_probability": pred[1]}
        for pred in predictions
    ]

Testing Coverage Gaps

Current Coverage: 45%

Missing Tests

API Endpoints (20% coverage)
- Unit tests for request validation
- Integration tests for database queries
- Load tests for performance
Feature Engineering (30% coverage)
- Unit tests for each feature transformer
- Edge cases (nulls, outliers, invalid codes)
- Integration tests (end-to-end pipeline)
ML Models (40% coverage)
- Unit tests for model training
- Performance regression tests
- Data drift detection tests
ETL Pipeline (50% coverage)
- DAG validation tests
- Data quality tests
- Error handling tests

Target Coverage: 80%+

Testing Strategy:

# API endpoint test
def test_get_events():
    response = client.get("/events?limit=10")
    assert response.status_code == 200
    assert len(response.json()) <= 10

# Feature engineering test
def test_temporal_features():
    df = pd.DataFrame({'event_date': ['2024-01-15']})
    features = feature_engineer.create_temporal_features(df)
    assert 'month_sin' in features.columns
    assert -1 <= features['month_sin'].iloc[0] <= 1

# ML model test
def test_xgboost_prediction():
    X_test = pd.DataFrame({...})  # Sample features
    predictions = model.predict_proba(X_test)
    assert predictions.shape == (len(X_test), 2)
    assert np.allclose(predictions.sum(axis=1), 1.0)

Documentation Gaps

Missing Documentation

API Reference 🔴
- Complete OpenAPI specification
- Request/response examples for all endpoints
- Error codes and handling
Database Schema 🔴
- Entity-relationship diagrams (ERD)
- Table descriptions, column definitions
- Index and constraint documentation
Deployment Guide 🟡
- Kubernetes deployment steps
- Configuration management (Helm values)
- Troubleshooting guide
User Guides 🟡
- Quickstart tutorial (5 minutes)
- Advanced usage (ML predictions, RAG queries)
- Dashboard walkthrough
Developer Guide 🟢
- Project structure overview
- Contributing guidelines
- Code style guide (Black, isort, flake8)

Dependency Updates Needed

Critical (Security)

requests 2.28.2 → 2.31.0 (CVE-2023-32681)
pillow 9.5.0 → 10.1.0 (CVE-2023-44271)
transformers 4.30.0 → 4.36.0 (security fixes)

Important (Features)

pandas 1.5.3 → 2.2.0 (performance improvements, nullable dtypes)
polars 0.18.0 → 0.20.0 (10-20% faster)
fastapi 0.100.0 → 0.109.0 (new features, bug fixes)
pydantic 1.10.12 → 2.5.3 (v2 rewrite, 20x faster validation)

Nice-to-Have (Performance)

scikit-learn 1.3.0 → 1.4.0 (optimized algorithms)
torch 2.0.1 → 2.1.2 (faster inference, better CUDA support)
xgboost 1.7.6 → 2.0.3 (GPU training improvements)

Estimated Total Effort

Category	Hours
Database refactoring	78
API improvements	72
ML/NLP optimization	88
Testing & documentation	85
Dependency updates	30
Total	353 hours

Recommendation: Prioritize high-impact, high-priority items (🔴) first. Tackle over 2-3 months alongside new feature development.

Last Updated: November 2025 Version: 1.0

FilesExpand file tree

TECHNICAL_DEBT.md

Latest commit

History

TECHNICAL_DEBT.md

File metadata and controls

TECHNICAL DEBT

Code Quality Issues

Database Layer

1. Complete MDB to PostgreSQL Migration 🔴

2. Schema Normalization 🟡

3. Missing Indexes 🔴

API Layer

4. Missing Input Validation 🔴

5. No Request Pagination 🟡

6. Inconsistent Error Handling 🟡

Machine Learning

7. No Model Versioning 🔴

8. Feature Engineering Not Modular 🟡

9. No Model Monitoring 🔴

NLP & AI

10. BERT Model Not Optimized 🟡

11. RAG Chunking Strategy 🟡

12. Knowledge Graph Disconnected 🟢

Refactoring Priorities

High Impact

1. Modularize ETL Pipeline 🔴

2. Separate API from Business Logic 🔴

3. Extract Configuration to Environment Variables 🔴

Medium Impact

4. Replace mdbtools with DuckDB 🟡

5. Consolidate Duplicate Code 🟡

6. Improve Notebook Organization 🟡

Low Impact

7. Upgrade Python Dependencies 🟢

8. Standardize Logging 🟢

9. Add Type Hints 🟢

Legacy Code Migration

Phase 1: Database Layer

Phase 2: ETL Pipeline

Phase 3: API Layer

Performance Bottlenecks

Database

1. Slow Queries 🔴

2. Large Table Joins 🔴

API

3. No Response Caching 🟡

Machine Learning

4. Slow Feature Engineering 🟡

5. ML Model Inference 🟡

Testing Coverage Gaps

Current Coverage: 45%

Missing Tests

Target Coverage: 80%+

Documentation Gaps

Missing Documentation

Dependency Updates Needed

Critical (Security)

Important (Features)

Nice-to-Have (Performance)

Estimated Total Effort