Skip to content

Commit 2f856de

Browse files
authored
Merge pull request #4 from Jakee4488/fix-ci-secrets
fix(ci): remove sensitive Databricks credentials from CI workflow and…
2 parents f042487 + 573db90 commit 2f856de

3 files changed

Lines changed: 180 additions & 67 deletions

File tree

.github/workflows/ci.yml

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@ on:
88

99
env:
1010
PYTHON_VERSION: "3.11"
11-
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
12-
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
13-
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
14-
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
1511
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
1612

1713
jobs:
@@ -90,6 +86,7 @@ jobs:
9086
bundle-validate:
9187
name: Validate Databricks Bundle
9288
runs-on: ubuntu-latest
89+
environment: dev
9390
needs: lint-and-test
9491
if: >
9592
github.event_name != 'pull_request' ||
@@ -101,23 +98,24 @@ jobs:
10198
shell: bash
10299
run: |
103100
if [[ -z "${DATABRICKS_HOST}" ]]; then
104-
echo "Missing DATABRICKS_HOST secret."
101+
echo "::error::Missing DATABRICKS_HOST secret."
105102
exit 1
106103
fi
107-
if [[ -z "${DATABRICKS_TOKEN}" && -z "${DATABRICKS_CLIENT_ID}" ]]; then
108-
echo "Missing authentication secrets! You must provide either DATABRICKS_TOKEN (for PAT) or DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET (for OAuth M2M)."
104+
if [[ -z "${DATABRICKS_TOKEN}" ]] && [[ -z "${DATABRICKS_CLIENT_ID}" || -z "${DATABRICKS_CLIENT_SECRET}" ]]; then
105+
echo "::error::Missing authentication secrets! Provide either DATABRICKS_TOKEN (for PAT) OR both DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET (for OAuth)."
109106
exit 1
110107
fi
111108
env:
112109
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
113110
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
114111
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
112+
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
115113

116114
- name: Install Databricks CLI
117115
run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
118116

119117
- name: Validate bundle
120-
run: databricks bundle validate
118+
run: databricks bundle validate -t dev
121119
env:
122120
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
123121
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

RUNBOOK.md

Lines changed: 156 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,187 @@
1-
# Financial AI MLOps - Walkthrough & Runbook
2-
3-
This document serves as a complete architectural walkthrough and operational runbook for the Enterprise Financial Market Anomaly Detection MLOps system.
1+
# Financial AI MLOps - Comprehensive Architecture & Runbook
2+
3+
This document serves as the single source of truth for the Enterprise Financial Market Anomaly Detection MLOps system. It provides a detailed architectural walkthrough, deployment guidelines, MLOps workflow documentation, and operational troubleshooting playbooks.
4+
5+
---
6+
7+
## 1. System Architecture
8+
9+
The project implements a real-time streaming MLOps architecture built entirely on Databricks. It identifies anomalous financial market transactions, trading behavior, and price action using streaming data and a multi-model evaluation process.
10+
11+
### 1.1 Data Flow & Medallion Architecture
12+
13+
```mermaid
14+
graph TD
15+
A1[Finnhub WebSockets] --> B[Bronze Layer<br/>Raw Ingestion]
16+
A2[Alpha Vantage REST] --> B
17+
18+
subgraph Delta Live Tables
19+
B --> C[Silver Layer<br/>Cleaning & Filtering]
20+
C --> D[Gold Layer<br/>Feature Store & Aggregations]
21+
end
22+
23+
D --> E[Model Training<br/>Tournament]
24+
D --> F[Model Serving<br/>Real-time Scoring]
25+
26+
E --> G[(MLflow Registry)]
27+
G --> F
28+
29+
F --> H[Dashboard / Alerting]
30+
31+
D --> I[Evidently AI<br/>Drift Monitoring]
32+
I -.-> E
33+
```
434

5-
## 1. Repository Walkthrough
35+
### 1.2 Core Technologies
36+
- **Compute & Orchestration:** Databricks Asset Bundles (DABs), Delta Live Tables (DLT), Databricks Workflows.
37+
- **Machine Learning:** MLflow (Tracking & Registry), Scikit-Learn, LightGBM, XGBoost.
38+
- **Data Engineering:** PySpark, Delta Lake.
39+
- **Monitoring & Data Drift:** Evidently AI.
40+
- **Streaming Inputs:** Finnhub Websocket (real-time trades), Alpha Vantage REST (historical context).
641

7-
### System Overview
8-
This project is an enterprise-grade streaming MLOps application built on Databricks. It consumes live financial market data (via Finnhub WebSockets and Alpha Vantage), passes it through a Medallion architecture (Bronze, Silver, Gold), trains models to detect anomalies using a multi-model tournament, and alerts via a customized dashboard.
42+
---
943

10-
### Core Architecture & Technologies
11-
- **Compute & Orchestration:** Databricks, Databricks Asset Bundles (DABs), Delta Live Tables (DLT)
12-
- **Machine Learning:** MLflow (Tracking & Registry), multi-model tournament (LightGBM, XGBoost, Random Forest, Isolation Forest).
13-
- **Data Engineering:** PySpark, Delta Lake.
14-
- **Monitoring & Drift:** Evidently AI
15-
- **Streaming:** Finnhub Websocket, Alpha Vantage REST.
44+
## 2. Directory Structure
1645

17-
### Directory Structure
1846
```text
1947
.
2048
├── dashboard/ # HTML/JS/CSS frontend for viewing anomalies
2149
├── project_config.yml # Centralized hyperparameters, feature lists, and thresholds
2250
├── databricks.yml # Databricks Asset Bundles (DABs) configurations
23-
├── pyproject.toml # Python dependencies and build system
51+
├── pyproject.toml # Python dependencies (uv/pip) and build system
2452
├── resources/ # YAML definitions for Databricks infrastructure
2553
│ ├── drift_monitoring.yml # Scheduled drift detection job
26-
│ ├── retraining_workflow.yml# Retraining & multi-model tournament pipeline
27-
│ └── streaming_pipeline.yml# DLT pipeline definition
54+
│ ├── retraining_workflow.yml # Retraining & multi-model tournament pipeline
55+
│ └── streaming_pipeline.yml # DLT pipeline definition
2856
├── scripts/ # Entry level scripts / Notebook tasks run by Databricks Jobs
29-
│ ├── financial/
30-
│ │ ├── collect_finnhub_stream.py
31-
│ │ ├── train_tournament.py
32-
│ │ ├── deploy_anomaly_model.py
33-
│ │ └── detect_drift.py
34-
└── src/ # Core business logic module
57+
│ └── financial/
58+
│ ├── collect_finnhub_stream.py
59+
│ ├── train_tournament.py
60+
│ ├── deploy_anomaly_model.py
61+
│ ├── detect_drift.py
62+
│ └── rollback_model.py
63+
└── src/ # Core business logic module (financial_transactions)
3564
└── financial_transactions/
36-
├── config.py # Config loader
3765
├── dlt/ # Bronze, Silver, Gold transformations
3866
├── features/ # Feature engineering logic
3967
├── models/ # Model topologies and wrappers
4068
└── monitoring/ # Evidently drift and data quality checks
4169
```
4270

43-
## 2. Operational Runbook
71+
---
72+
73+
## 3. Local Setup & Development
74+
75+
This project uses `uv` for lightning-fast Python dependency management and builds.
4476

45-
### Initial Setup and Deployment
46-
1. **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are stored securely (e.g. Databricks Secrets or env vars).
47-
2. **Environment Targets**: Modifying variables per environment (`dev`, `acc`, `prd`) is handled in `databricks.yml`.
48-
3. **Deploying Infrastructure**:
49-
To deploy pipelines and job updates to Databricks via DABs:
77+
### 3.1 Environment Setup
78+
1. **Install uv**: Follow the official guide to install `uv` (e.g., `curl -LsSf https://astral.sh/uv/install.sh | sh`).
79+
2. **Create Virtual Environment**:
5080
```bash
51-
databricks bundle deploy -t dev
81+
uv venv
82+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
83+
```
84+
3. **Install Dependencies**:
85+
```bash
86+
uv pip install -e ".[dev,test,streaming]"
5287
```
5388

54-
### Day-to-Day Operations
89+
### 3.2 Testing
90+
The project uses `pytest` for unit and integration testing. Tests are located in the `tests/` directory.
91+
```bash
92+
# Run all tests with coverage
93+
pytest tests/ --cov=src/financial_transactions
94+
```
5595

56-
#### A. Delta Live Tables (Streaming Data Pipeline)
57-
- **Name**: `financial-streaming-dlt` (defined in `resources/streaming_pipeline.yml`)
58-
- **Status Check**: Ensure the DLT pipeline is running. If `continuous: false` is configured, it will run as a batch. Change to `true` for 24/7 web-socket streaming.
59-
- **Failures in DLT**: Check the Databricks DLT UI. Look out for schema mismatches in data ingestion through `bronze_ingest.py`.
96+
---
6097

61-
#### B. Model Retraining (Tournament)
62-
- **Job Name**: `financial-retraining-workflow`
63-
- **Trigger**: Run manually or via the specified schedule. (Currently defined as `PAUSED` in variables).
64-
- **Process**:
65-
1. `train_tournament.py` evaluates LightGBM, XGBoost, Random Forest, & Isolation forest.
66-
2. The primary metric (`pr_auc`) is optimized.
67-
3. The `deploy_anomaly_model.py` task compares the winner with the current Champion (champion/challenger gating). If `pr_auc` improves by > `0.005`, the challenger replaces the champion in the Databricks Model Registry.
98+
## 4. Deployment Guide (Databricks Asset Bundles)
6899

69-
#### C. Drift & Monitoring
70-
- **Job Name**: `financial-drift-monitoring`
71-
- **Schedule**: Every 30 minutes (`0 */30 * * * ?`).
72-
- **Logic**: Pulls `reference_window_days` vs current window. Calculates PSI & JS Divergence on features like price volatility and trade intensity.
73-
- If drift threshold (e.g., PSI > 0.2) is breached, an alert event log is generated.
100+
All infrastructure (Pipelines, Jobs, Experiments) is declared as code using Databricks Asset Bundles (`databricks.yml`).
74101

75-
### Troubleshooting Scenarios
102+
### 4.1 Prerequisites
103+
- **Databricks CLI**: Must be installed and configured (`databricks configure`).
104+
- **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are available as environment variables or configured in your environment.
76105

77-
**Issue: High volume of False Positives in Anomaly Detection**
78-
- **Action**: Check `project_config.yml` under `drift`.
79-
- **Action**: Verify if feature distributions have shifted. Run `financial-drift-monitoring` manually to review the Evidently reports.
80-
- **Mitigation**: Manually trigger `financial-retraining-workflow` to update the model to new market conditions.
106+
### 4.2 Environments
107+
Target environments are configured in `databricks.yml`:
108+
- `dev`: Development workspace (`mlops_dev` catalog).
109+
- `acc`: Acceptance/Staging workspace (`mlops_acc` catalog).
110+
- `prd`: Production workspace (`mlops_prd` catalog).
81111

82-
**Issue: WebSockets disconnect or stop ingesting data**
83-
- **Action**: Investigate logs for `collect_finnhub_stream.py`. Ensure Databricks cluster has external internet access. Review Finnhub API rate limits. Restart DLT pipeline.
112+
### 4.3 Deployment Commands
113+
To build the Python wheel and deploy infrastructure to a specific target:
114+
```bash
115+
# Deploy to Development
116+
databricks bundle deploy -t dev
84117

85-
**Issue: Dashboard is disconnected**
86-
- **Action**: The frontend connects to an exposed model endpoint or metrics export. Verify `export_dashboard_metrics.py` is running or API serving endpoints are accessible via Databricks Model Serving.
118+
# Deploy to Production
119+
databricks bundle deploy -t prd
120+
```
87121

88-
### Emergency Rollback
89-
- **Scenario**: A newly deployed model exhibits poor production performance that is impacting business logic.
90-
- **Mitigation**: Execute `scripts/financial/rollback_model.py` to revert the `Champion` alias in MLflow to the previous approved version.
122+
---
123+
124+
## 5. MLOps Workflow & Multi-Model Tournament
125+
126+
The core of this system is the autonomous retraining and evaluation engine.
127+
128+
### 5.1 The Tournament (`train_tournament.py`)
129+
Triggered via `financial-retraining-workflow`, the tournament trains four different model architectures simultaneously:
130+
1. **LightGBM**: Highly efficient gradient boosting (Default Primary).
131+
2. **XGBoost**: Robust gradient boosting alternative.
132+
3. **Random Forest**: Ensemble method to prevent overfitting.
133+
4. **Isolation Forest**: Unsupervised anomaly detection.
134+
135+
Hyperparameters for all models are centrally managed in `project_config.yml`.
136+
137+
### 5.2 Champion / Challenger Gating (`deploy_anomaly_model.py`)
138+
Models are evaluated against a holdout dataset. The system employs a rigorous gating mechanism:
139+
- **Primary Metric**: `pr_auc` (Precision-Recall Area Under Curve).
140+
- **Threshold**: The Challenger must improve upon the existing Champion's `pr_auc` by a minimum of `0.005` (configurable in `project_config.yml`).
141+
- **Promotion**: If successful, the Challenger is registered in the MLflow Model Registry and alias tagged as the new `Champion`.
142+
143+
---
144+
145+
## 6. Data Drift & Monitoring
146+
147+
Data drift monitoring is handled by Evidently AI and orchestrated by `resources/drift_monitoring.yml`.
148+
149+
- **Job**: `financial-drift-monitoring`
150+
- **Schedule**: Every 30 minutes (`0 */30 * * * ?`).
151+
- **Mechanism**: The `detect_drift.py` script compares a recent data window against a historical reference window.
152+
- **Metrics Evaluated**: Population Stability Index (PSI) and Jensen-Shannon (JS) Divergence on key features like price volatility and trade intensity.
153+
- **Alerting**: If drift exceeds the threshold defined in `project_config.yml`, an alert is generated, and the retraining pipeline may be triggered automatically.
154+
155+
---
156+
157+
## 7. Operational Playbooks & Troubleshooting
158+
159+
### Scenario A: Delta Live Tables (DLT) Pipeline Failures
160+
- **Symptom**: `financial-streaming-dlt` job fails or stops processing records.
161+
- **Investigation**:
162+
1. Check the DLT UI in Databricks.
163+
2. If the failure occurs at `bronze_ingest.py`, verify that the Finnhub API rate limits haven't been exceeded or that the payload schema hasn't changed.
164+
3. If running in batch mode (`continuous: false`), consider changing to `true` in `resources/streaming_pipeline.yml` for uninterrupted real-time streaming.
165+
- **Resolution**: Adjust schema evolution settings or rotate API keys if rate-limited.
166+
167+
### Scenario B: High Volume of False Positives
168+
- **Symptom**: The dashboard indicates a massive spike in detected anomalies during normal market conditions.
169+
- **Investigation**:
170+
1. Manually trigger the `financial-drift-monitoring` job. Review the Evidently AI drift reports.
171+
2. Check `project_config.yml` to see if market volatility features have heavily drifted.
172+
- **Resolution**: If structural market drift is confirmed, manually trigger the `financial-retraining-workflow` to update the model baseline.
173+
174+
### Scenario C: Emergency Model Rollback
175+
- **Symptom**: A newly deployed model exhibits severely degraded performance and is impacting downstream consumers.
176+
- **Investigation**: Verify model performance via the dashboard and MLflow real-time metrics.
177+
- **Resolution**: Execute the rollback script to immediately demote the current Champion and restore the previous approved version:
178+
```bash
179+
# Can be executed via Databricks Workflows or a connected notebook
180+
python scripts/financial/rollback_model.py
181+
```
182+
183+
### Scenario D: Missing Dashboard Metrics
184+
- **Symptom**: The frontend dashboard is blank or shows stale data.
185+
- **Investigation**:
186+
1. Verify the `export_dashboard_metrics.py` task is completing successfully.
187+
2. Ensure the Databricks Model Serving endpoint (if active) is accessible and not in a scaled-to-zero / cold-start state.

typings/__builtins__.pyi

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
2+
from databricks.sdk.runtime import *
3+
from pyspark.sql.session import SparkSession
4+
from pyspark.sql.functions import udf as U
5+
from pyspark.sql.context import SQLContext
6+
7+
udf = U
8+
spark: SparkSession
9+
sc = spark.sparkContext
10+
sqlContext: SQLContext
11+
sql = sqlContext.sql
12+
table = sqlContext.table
13+
getArgument = dbutils.widgets.getArgument
14+
15+
def displayHTML(html): ...
16+
17+
def display(input=None, *args, **kwargs): ...
18+

0 commit comments

Comments
 (0)