|
1 | | -# Financial AI MLOps - Walkthrough & Runbook |
2 | | - |
3 | | -This document serves as a complete architectural walkthrough and operational runbook for the Enterprise Financial Market Anomaly Detection MLOps system. |
| 1 | +# Financial AI MLOps - Comprehensive Architecture & Runbook |
| 2 | + |
| 3 | +This document serves as the single source of truth for the Enterprise Financial Market Anomaly Detection MLOps system. It provides a detailed architectural walkthrough, deployment guidelines, MLOps workflow documentation, and operational troubleshooting playbooks. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 1. System Architecture |
| 8 | + |
| 9 | +The project implements a real-time streaming MLOps architecture built entirely on Databricks. It identifies anomalous financial market transactions, trading behavior, and price action using streaming data and a multi-model evaluation process. |
| 10 | + |
| 11 | +### 1.1 Data Flow & Medallion Architecture |
| 12 | + |
| 13 | +```mermaid |
| 14 | +graph TD |
| 15 | + A1[Finnhub WebSockets] --> B[Bronze Layer<br/>Raw Ingestion] |
| 16 | + A2[Alpha Vantage REST] --> B |
| 17 | + |
| 18 | + subgraph Delta Live Tables |
| 19 | + B --> C[Silver Layer<br/>Cleaning & Filtering] |
| 20 | + C --> D[Gold Layer<br/>Feature Store & Aggregations] |
| 21 | + end |
| 22 | + |
| 23 | + D --> E[Model Training<br/>Tournament] |
| 24 | + D --> F[Model Serving<br/>Real-time Scoring] |
| 25 | + |
| 26 | + E --> G[(MLflow Registry)] |
| 27 | + G --> F |
| 28 | + |
| 29 | + F --> H[Dashboard / Alerting] |
| 30 | + |
| 31 | + D --> I[Evidently AI<br/>Drift Monitoring] |
| 32 | + I -.-> E |
| 33 | +``` |
4 | 34 |
|
5 | | -## 1. Repository Walkthrough |
| 35 | +### 1.2 Core Technologies |
| 36 | +- **Compute & Orchestration:** Databricks Asset Bundles (DABs), Delta Live Tables (DLT), Databricks Workflows. |
| 37 | +- **Machine Learning:** MLflow (Tracking & Registry), Scikit-Learn, LightGBM, XGBoost. |
| 38 | +- **Data Engineering:** PySpark, Delta Lake. |
| 39 | +- **Monitoring & Data Drift:** Evidently AI. |
| 40 | +- **Streaming Inputs:** Finnhub Websocket (real-time trades), Alpha Vantage REST (historical context). |
6 | 41 |
|
7 | | -### System Overview |
8 | | -This project is an enterprise-grade streaming MLOps application built on Databricks. It consumes live financial market data (via Finnhub WebSockets and Alpha Vantage), passes it through a Medallion architecture (Bronze, Silver, Gold), trains models to detect anomalies using a multi-model tournament, and alerts via a customized dashboard. |
| 42 | +--- |
9 | 43 |
|
10 | | -### Core Architecture & Technologies |
11 | | -- **Compute & Orchestration:** Databricks, Databricks Asset Bundles (DABs), Delta Live Tables (DLT) |
12 | | -- **Machine Learning:** MLflow (Tracking & Registry), multi-model tournament (LightGBM, XGBoost, Random Forest, Isolation Forest). |
13 | | -- **Data Engineering:** PySpark, Delta Lake. |
14 | | -- **Monitoring & Drift:** Evidently AI |
15 | | -- **Streaming:** Finnhub Websocket, Alpha Vantage REST. |
| 44 | +## 2. Directory Structure |
16 | 45 |
|
17 | | -### Directory Structure |
18 | 46 | ```text |
19 | 47 | . |
20 | 48 | ├── dashboard/ # HTML/JS/CSS frontend for viewing anomalies |
21 | 49 | ├── project_config.yml # Centralized hyperparameters, feature lists, and thresholds |
22 | 50 | ├── databricks.yml # Databricks Asset Bundles (DABs) configurations |
23 | | -├── pyproject.toml # Python dependencies and build system |
| 51 | +├── pyproject.toml # Python dependencies (uv/pip) and build system |
24 | 52 | ├── resources/ # YAML definitions for Databricks infrastructure |
25 | 53 | │ ├── drift_monitoring.yml # Scheduled drift detection job |
26 | | -│ ├── retraining_workflow.yml# Retraining & multi-model tournament pipeline |
27 | | -│ └── streaming_pipeline.yml# DLT pipeline definition |
| 54 | +│ ├── retraining_workflow.yml # Retraining & multi-model tournament pipeline |
| 55 | +│ └── streaming_pipeline.yml # DLT pipeline definition |
28 | 56 | ├── scripts/ # Entry level scripts / Notebook tasks run by Databricks Jobs |
29 | | -│ ├── financial/ |
30 | | -│ │ ├── collect_finnhub_stream.py |
31 | | -│ │ ├── train_tournament.py |
32 | | -│ │ ├── deploy_anomaly_model.py |
33 | | -│ │ └── detect_drift.py |
34 | | -└── src/ # Core business logic module |
| 57 | +│ └── financial/ |
| 58 | +│ ├── collect_finnhub_stream.py |
| 59 | +│ ├── train_tournament.py |
| 60 | +│ ├── deploy_anomaly_model.py |
| 61 | +│ ├── detect_drift.py |
| 62 | +│ └── rollback_model.py |
| 63 | +└── src/ # Core business logic module (financial_transactions) |
35 | 64 | └── financial_transactions/ |
36 | | - ├── config.py # Config loader |
37 | 65 | ├── dlt/ # Bronze, Silver, Gold transformations |
38 | 66 | ├── features/ # Feature engineering logic |
39 | 67 | ├── models/ # Model topologies and wrappers |
40 | 68 | └── monitoring/ # Evidently drift and data quality checks |
41 | 69 | ``` |
42 | 70 |
|
43 | | -## 2. Operational Runbook |
| 71 | +--- |
| 72 | + |
| 73 | +## 3. Local Setup & Development |
| 74 | + |
| 75 | +This project uses `uv` for lightning-fast Python dependency management and builds. |
44 | 76 |
|
45 | | -### Initial Setup and Deployment |
46 | | -1. **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are stored securely (e.g. Databricks Secrets or env vars). |
47 | | -2. **Environment Targets**: Modifying variables per environment (`dev`, `acc`, `prd`) is handled in `databricks.yml`. |
48 | | -3. **Deploying Infrastructure**: |
49 | | - To deploy pipelines and job updates to Databricks via DABs: |
| 77 | +### 3.1 Environment Setup |
| 78 | +1. **Install uv**: Follow the official guide to install `uv` (e.g., `curl -LsSf https://astral.sh/uv/install.sh | sh`). |
| 79 | +2. **Create Virtual Environment**: |
50 | 80 | ```bash |
51 | | - databricks bundle deploy -t dev |
| 81 | + uv venv |
| 82 | + source .venv/bin/activate # On Windows: .venv\Scripts\activate |
| 83 | + ``` |
| 84 | +3. **Install Dependencies**: |
| 85 | + ```bash |
| 86 | + uv pip install -e ".[dev,test,streaming]" |
52 | 87 | ``` |
53 | 88 |
|
54 | | -### Day-to-Day Operations |
| 89 | +### 3.2 Testing |
| 90 | +The project uses `pytest` for unit and integration testing. Tests are located in the `tests/` directory. |
| 91 | +```bash |
| 92 | +# Run all tests with coverage |
| 93 | +pytest tests/ --cov=src/financial_transactions |
| 94 | +``` |
55 | 95 |
|
56 | | -#### A. Delta Live Tables (Streaming Data Pipeline) |
57 | | -- **Name**: `financial-streaming-dlt` (defined in `resources/streaming_pipeline.yml`) |
58 | | -- **Status Check**: Ensure the DLT pipeline is running. If `continuous: false` is configured, it will run as a batch. Change to `true` for 24/7 web-socket streaming. |
59 | | -- **Failures in DLT**: Check the Databricks DLT UI. Look out for schema mismatches in data ingestion through `bronze_ingest.py`. |
| 96 | +--- |
60 | 97 |
|
61 | | -#### B. Model Retraining (Tournament) |
62 | | -- **Job Name**: `financial-retraining-workflow` |
63 | | -- **Trigger**: Run manually or via the specified schedule. (Currently defined as `PAUSED` in variables). |
64 | | -- **Process**: |
65 | | - 1. `train_tournament.py` evaluates LightGBM, XGBoost, Random Forest, & Isolation forest. |
66 | | - 2. The primary metric (`pr_auc`) is optimized. |
67 | | - 3. The `deploy_anomaly_model.py` task compares the winner with the current Champion (champion/challenger gating). If `pr_auc` improves by > `0.005`, the challenger replaces the champion in the Databricks Model Registry. |
| 98 | +## 4. Deployment Guide (Databricks Asset Bundles) |
68 | 99 |
|
69 | | -#### C. Drift & Monitoring |
70 | | -- **Job Name**: `financial-drift-monitoring` |
71 | | -- **Schedule**: Every 30 minutes (`0 */30 * * * ?`). |
72 | | -- **Logic**: Pulls `reference_window_days` vs current window. Calculates PSI & JS Divergence on features like price volatility and trade intensity. |
73 | | -- If drift threshold (e.g., PSI > 0.2) is breached, an alert event log is generated. |
| 100 | +All infrastructure (Pipelines, Jobs, Experiments) is declared as code using Databricks Asset Bundles (`databricks.yml`). |
74 | 101 |
|
75 | | -### Troubleshooting Scenarios |
| 102 | +### 4.1 Prerequisites |
| 103 | +- **Databricks CLI**: Must be installed and configured (`databricks configure`). |
| 104 | +- **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are available as environment variables or configured in your environment. |
76 | 105 |
|
77 | | -**Issue: High volume of False Positives in Anomaly Detection** |
78 | | -- **Action**: Check `project_config.yml` under `drift`. |
79 | | -- **Action**: Verify if feature distributions have shifted. Run `financial-drift-monitoring` manually to review the Evidently reports. |
80 | | -- **Mitigation**: Manually trigger `financial-retraining-workflow` to update the model to new market conditions. |
| 106 | +### 4.2 Environments |
| 107 | +Target environments are configured in `databricks.yml`: |
| 108 | +- `dev`: Development workspace (`mlops_dev` catalog). |
| 109 | +- `acc`: Acceptance/Staging workspace (`mlops_acc` catalog). |
| 110 | +- `prd`: Production workspace (`mlops_prd` catalog). |
81 | 111 |
|
82 | | -**Issue: WebSockets disconnect or stop ingesting data** |
83 | | -- **Action**: Investigate logs for `collect_finnhub_stream.py`. Ensure Databricks cluster has external internet access. Review Finnhub API rate limits. Restart DLT pipeline. |
| 112 | +### 4.3 Deployment Commands |
| 113 | +To build the Python wheel and deploy infrastructure to a specific target: |
| 114 | +```bash |
| 115 | +# Deploy to Development |
| 116 | +databricks bundle deploy -t dev |
84 | 117 |
|
85 | | -**Issue: Dashboard is disconnected** |
86 | | -- **Action**: The frontend connects to an exposed model endpoint or metrics export. Verify `export_dashboard_metrics.py` is running or API serving endpoints are accessible via Databricks Model Serving. |
| 118 | +# Deploy to Production |
| 119 | +databricks bundle deploy -t prd |
| 120 | +``` |
87 | 121 |
|
88 | | -### Emergency Rollback |
89 | | -- **Scenario**: A newly deployed model exhibits poor production performance that is impacting business logic. |
90 | | -- **Mitigation**: Execute `scripts/financial/rollback_model.py` to revert the `Champion` alias in MLflow to the previous approved version. |
| 122 | +--- |
| 123 | + |
| 124 | +## 5. MLOps Workflow & Multi-Model Tournament |
| 125 | + |
| 126 | +The core of this system is the autonomous retraining and evaluation engine. |
| 127 | + |
| 128 | +### 5.1 The Tournament (`train_tournament.py`) |
| 129 | +Triggered via `financial-retraining-workflow`, the tournament trains four different model architectures simultaneously: |
| 130 | +1. **LightGBM**: Highly efficient gradient boosting (Default Primary). |
| 131 | +2. **XGBoost**: Robust gradient boosting alternative. |
| 132 | +3. **Random Forest**: Ensemble method to prevent overfitting. |
| 133 | +4. **Isolation Forest**: Unsupervised anomaly detection. |
| 134 | + |
| 135 | +Hyperparameters for all models are centrally managed in `project_config.yml`. |
| 136 | + |
| 137 | +### 5.2 Champion / Challenger Gating (`deploy_anomaly_model.py`) |
| 138 | +Models are evaluated against a holdout dataset. The system employs a rigorous gating mechanism: |
| 139 | +- **Primary Metric**: `pr_auc` (Precision-Recall Area Under Curve). |
| 140 | +- **Threshold**: The Challenger must improve upon the existing Champion's `pr_auc` by a minimum of `0.005` (configurable in `project_config.yml`). |
| 141 | +- **Promotion**: If successful, the Challenger is registered in the MLflow Model Registry and alias tagged as the new `Champion`. |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## 6. Data Drift & Monitoring |
| 146 | + |
| 147 | +Data drift monitoring is handled by Evidently AI and orchestrated by `resources/drift_monitoring.yml`. |
| 148 | + |
| 149 | +- **Job**: `financial-drift-monitoring` |
| 150 | +- **Schedule**: Every 30 minutes (`0 */30 * * * ?`). |
| 151 | +- **Mechanism**: The `detect_drift.py` script compares a recent data window against a historical reference window. |
| 152 | +- **Metrics Evaluated**: Population Stability Index (PSI) and Jensen-Shannon (JS) Divergence on key features like price volatility and trade intensity. |
| 153 | +- **Alerting**: If drift exceeds the threshold defined in `project_config.yml`, an alert is generated, and the retraining pipeline may be triggered automatically. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## 7. Operational Playbooks & Troubleshooting |
| 158 | + |
| 159 | +### Scenario A: Delta Live Tables (DLT) Pipeline Failures |
| 160 | +- **Symptom**: `financial-streaming-dlt` job fails or stops processing records. |
| 161 | +- **Investigation**: |
| 162 | + 1. Check the DLT UI in Databricks. |
| 163 | + 2. If the failure occurs at `bronze_ingest.py`, verify that the Finnhub API rate limits haven't been exceeded or that the payload schema hasn't changed. |
| 164 | + 3. If running in batch mode (`continuous: false`), consider changing to `true` in `resources/streaming_pipeline.yml` for uninterrupted real-time streaming. |
| 165 | +- **Resolution**: Adjust schema evolution settings or rotate API keys if rate-limited. |
| 166 | + |
| 167 | +### Scenario B: High Volume of False Positives |
| 168 | +- **Symptom**: The dashboard indicates a massive spike in detected anomalies during normal market conditions. |
| 169 | +- **Investigation**: |
| 170 | + 1. Manually trigger the `financial-drift-monitoring` job. Review the Evidently AI drift reports. |
| 171 | + 2. Check `project_config.yml` to see if market volatility features have heavily drifted. |
| 172 | +- **Resolution**: If structural market drift is confirmed, manually trigger the `financial-retraining-workflow` to update the model baseline. |
| 173 | + |
| 174 | +### Scenario C: Emergency Model Rollback |
| 175 | +- **Symptom**: A newly deployed model exhibits severely degraded performance and is impacting downstream consumers. |
| 176 | +- **Investigation**: Verify model performance via the dashboard and MLflow real-time metrics. |
| 177 | +- **Resolution**: Execute the rollback script to immediately demote the current Champion and restore the previous approved version: |
| 178 | + ```bash |
| 179 | + # Can be executed via Databricks Workflows or a connected notebook |
| 180 | + python scripts/financial/rollback_model.py |
| 181 | + ``` |
| 182 | + |
| 183 | +### Scenario D: Missing Dashboard Metrics |
| 184 | +- **Symptom**: The frontend dashboard is blank or shows stale data. |
| 185 | +- **Investigation**: |
| 186 | + 1. Verify the `export_dashboard_metrics.py` task is completing successfully. |
| 187 | + 2. Ensure the Databricks Model Serving endpoint (if active) is accessible and not in a scaled-to-zero / cold-start state. |
0 commit comments