Merge pull request #4 from Jakee4488/fix-ci-secrets

Jakee4488 · web-flow · commit 2f856de7118b · 2026-04-07T19:38:03.000+01:00
fix(ci): remove sensitive Databricks credentials from CI workflow and…
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -8,10 +8,6 @@ on:
 
 env:
   PYTHON_VERSION: "3.11"
-  DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
-  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
-  DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
-  DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
   FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
 
 jobs:
@@ -90,6 +86,7 @@ jobs:
   bundle-validate:
     name: Validate Databricks Bundle
     runs-on: ubuntu-latest
+    environment: dev
     needs: lint-and-test
     if: >
       github.event_name != 'pull_request' ||
@@ -101,23 +98,24 @@ jobs:
         shell: bash
         run: |
           if [[ -z "${DATABRICKS_HOST}" ]]; then
-            echo "Missing DATABRICKS_HOST secret."
+            echo "::error::Missing DATABRICKS_HOST secret."
             exit 1
           fi
-          if [[ -z "${DATABRICKS_TOKEN}" && -z "${DATABRICKS_CLIENT_ID}" ]]; then
-             echo "Missing authentication secrets! You must provide either DATABRICKS_TOKEN (for PAT) or DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET (for OAuth M2M)."
+          if [[ -z "${DATABRICKS_TOKEN}" ]] && [[ -z "${DATABRICKS_CLIENT_ID}" || -z "${DATABRICKS_CLIENT_SECRET}" ]]; then
+             echo "::error::Missing authentication secrets! Provide either DATABRICKS_TOKEN (for PAT) OR both DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET (for OAuth)."
              exit 1
           fi
         env:
           DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
           DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
           DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
+          DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
 
       - name: Install Databricks CLI
         run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
 
       - name: Validate bundle
-        run: databricks bundle validate
+        run: databricks bundle validate -t dev
         env:
           DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
           DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
diff --git a/RUNBOOK.md b/RUNBOOK.md
@@ -1,90 +1,187 @@
-# Financial AI MLOps - Walkthrough & Runbook
-
-This document serves as a complete architectural walkthrough and operational runbook for the Enterprise Financial Market Anomaly Detection MLOps system.
+# Financial AI MLOps - Comprehensive Architecture & Runbook
+
+This document serves as the single source of truth for the Enterprise Financial Market Anomaly Detection MLOps system. It provides a detailed architectural walkthrough, deployment guidelines, MLOps workflow documentation, and operational troubleshooting playbooks.
+
+---
+
+## 1. System Architecture
+
+The project implements a real-time streaming MLOps architecture built entirely on Databricks. It identifies anomalous financial market transactions, trading behavior, and price action using streaming data and a multi-model evaluation process.
+
+### 1.1 Data Flow & Medallion Architecture
+
+```mermaid
+graph TD
+    A1[Finnhub WebSockets] --> B[Bronze Layer<br/>Raw Ingestion]
+    A2[Alpha Vantage REST] --> B
+    
+    subgraph Delta Live Tables
+    B --> C[Silver Layer<br/>Cleaning & Filtering]
+    C --> D[Gold Layer<br/>Feature Store & Aggregations]
+    end
+    
+    D --> E[Model Training<br/>Tournament]
+    D --> F[Model Serving<br/>Real-time Scoring]
+    
+    E --> G[(MLflow Registry)]
+    G --> F
+    
+    F --> H[Dashboard / Alerting]
+    
+    D --> I[Evidently AI<br/>Drift Monitoring]
+    I -.-> E
+```
 
-## 1. Repository Walkthrough
+### 1.2 Core Technologies
+- **Compute & Orchestration:** Databricks Asset Bundles (DABs), Delta Live Tables (DLT), Databricks Workflows.
+- **Machine Learning:** MLflow (Tracking & Registry), Scikit-Learn, LightGBM, XGBoost.
+- **Data Engineering:** PySpark, Delta Lake.
+- **Monitoring & Data Drift:** Evidently AI.
+- **Streaming Inputs:** Finnhub Websocket (real-time trades), Alpha Vantage REST (historical context).
 
-### System Overview
-This project is an enterprise-grade streaming MLOps application built on Databricks. It consumes live financial market data (via Finnhub WebSockets and Alpha Vantage), passes it through a Medallion architecture (Bronze, Silver, Gold), trains models to detect anomalies using a multi-model tournament, and alerts via a customized dashboard.
+---
 
-### Core Architecture & Technologies
-- **Compute & Orchestration:** Databricks, Databricks Asset Bundles (DABs), Delta Live Tables (DLT)
-- **Machine Learning:** MLflow (Tracking & Registry), multi-model tournament (LightGBM, XGBoost, Random Forest, Isolation Forest). 
-- **Data Engineering:** PySpark, Delta Lake.
-- **Monitoring & Drift:** Evidently AI
-- **Streaming:** Finnhub Websocket, Alpha Vantage REST.
+## 2. Directory Structure
 
-### Directory Structure
 ```text
 .
 ├── dashboard/               # HTML/JS/CSS frontend for viewing anomalies
 ├── project_config.yml       # Centralized hyperparameters, feature lists, and thresholds
 ├── databricks.yml           # Databricks Asset Bundles (DABs) configurations
-├── pyproject.toml           # Python dependencies and build system
+├── pyproject.toml           # Python dependencies (uv/pip) and build system
 ├── resources/               # YAML definitions for Databricks infrastructure
 │   ├── drift_monitoring.yml # Scheduled drift detection job
-│   ├── retraining_workflow.yml# Retraining & multi-model tournament pipeline
-│   └── streaming_pipeline.yml# DLT pipeline definition
+│   ├── retraining_workflow.yml # Retraining & multi-model tournament pipeline
+│   └── streaming_pipeline.yml  # DLT pipeline definition
 ├── scripts/                 # Entry level scripts / Notebook tasks run by Databricks Jobs
-│   ├── financial/
-│   │   ├── collect_finnhub_stream.py
-│   │   ├── train_tournament.py
-│   │   ├── deploy_anomaly_model.py
-│   │   └── detect_drift.py
-└── src/                     # Core business logic module
+│   └── financial/
+│       ├── collect_finnhub_stream.py
+│       ├── train_tournament.py
+│       ├── deploy_anomaly_model.py
+│       ├── detect_drift.py
+│       └── rollback_model.py
+└── src/                     # Core business logic module (financial_transactions)
     └── financial_transactions/
-        ├── config.py        # Config loader
         ├── dlt/             # Bronze, Silver, Gold transformations
         ├── features/        # Feature engineering logic
         ├── models/          # Model topologies and wrappers
         └── monitoring/      # Evidently drift and data quality checks
 ```
 
-## 2. Operational Runbook
+---
+
+## 3. Local Setup & Development
+
+This project uses `uv` for lightning-fast Python dependency management and builds.
 
-### Initial Setup and Deployment
-1. **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are stored securely (e.g. Databricks Secrets or env vars).
-2. **Environment Targets**: Modifying variables per environment (`dev`, `acc`, `prd`) is handled in `databricks.yml`.
-3. **Deploying Infrastructure**:
-   To deploy pipelines and job updates to Databricks via DABs:
+### 3.1 Environment Setup
+1. **Install uv**: Follow the official guide to install `uv` (e.g., `curl -LsSf https://astral.sh/uv/install.sh | sh`).
+2. **Create Virtual Environment**:
    ```bash
-   databricks bundle deploy -t dev
+   uv venv
+   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+   ```
+3. **Install Dependencies**:
+   ```bash
+   uv pip install -e ".[dev,test,streaming]"
    ```
 
-### Day-to-Day Operations
+### 3.2 Testing
+The project uses `pytest` for unit and integration testing. Tests are located in the `tests/` directory.
+```bash
+# Run all tests with coverage
+pytest tests/ --cov=src/financial_transactions
+```
 
-#### A. Delta Live Tables (Streaming Data Pipeline)
-- **Name**: `financial-streaming-dlt` (defined in `resources/streaming_pipeline.yml`)
-- **Status Check**: Ensure the DLT pipeline is running. If `continuous: false` is configured, it will run as a batch. Change to `true` for 24/7 web-socket streaming.
-- **Failures in DLT**: Check the Databricks DLT UI. Look out for schema mismatches in data ingestion through `bronze_ingest.py`.
+---
 
-#### B. Model Retraining (Tournament)
-- **Job Name**: `financial-retraining-workflow`
-- **Trigger**: Run manually or via the specified schedule. (Currently defined as `PAUSED` in variables).
-- **Process**:
-  1. `train_tournament.py` evaluates LightGBM, XGBoost, Random Forest, & Isolation forest.
-  2. The primary metric (`pr_auc`) is optimized.
-  3. The `deploy_anomaly_model.py` task compares the winner with the current Champion (champion/challenger gating). If `pr_auc` improves by > `0.005`, the challenger replaces the champion in the Databricks Model Registry.
+## 4. Deployment Guide (Databricks Asset Bundles)
 
-#### C. Drift & Monitoring 
-- **Job Name**: `financial-drift-monitoring`
-- **Schedule**: Every 30 minutes (`0 */30 * * * ?`).
-- **Logic**: Pulls `reference_window_days` vs current window. Calculates PSI & JS Divergence on features like price volatility and trade intensity.
-- If drift threshold (e.g., PSI > 0.2) is breached, an alert event log is generated.
+All infrastructure (Pipelines, Jobs, Experiments) is declared as code using Databricks Asset Bundles (`databricks.yml`).
 
-### Troubleshooting Scenarios
+### 4.1 Prerequisites
+- **Databricks CLI**: Must be installed and configured (`databricks configure`).
+- **API Keys**: Ensure `FINNHUB_API_KEY` and `ALPHAVANTAGE_API_KEY` are available as environment variables or configured in your environment.
 
-**Issue: High volume of False Positives in Anomaly Detection**
-- **Action**: Check `project_config.yml` under `drift`.
-- **Action**: Verify if feature distributions have shifted. Run `financial-drift-monitoring` manually to review the Evidently reports. 
-- **Mitigation**: Manually trigger `financial-retraining-workflow` to update the model to new market conditions.
+### 4.2 Environments
+Target environments are configured in `databricks.yml`:
+- `dev`: Development workspace (`mlops_dev` catalog).
+- `acc`: Acceptance/Staging workspace (`mlops_acc` catalog).
+- `prd`: Production workspace (`mlops_prd` catalog).
 
-**Issue: WebSockets disconnect or stop ingesting data**
-- **Action**: Investigate logs for `collect_finnhub_stream.py`. Ensure Databricks cluster has external internet access. Review Finnhub API rate limits. Restart DLT pipeline.
+### 4.3 Deployment Commands
+To build the Python wheel and deploy infrastructure to a specific target:
+```bash
+# Deploy to Development
+databricks bundle deploy -t dev
 
-**Issue: Dashboard is disconnected**
-- **Action**: The frontend connects to an exposed model endpoint or metrics export. Verify `export_dashboard_metrics.py` is running or API serving endpoints are accessible via Databricks Model Serving.
+# Deploy to Production
+databricks bundle deploy -t prd
+```
 
-### Emergency Rollback
-- **Scenario**: A newly deployed model exhibits poor production performance that is impacting business logic. 
-- **Mitigation**: Execute `scripts/financial/rollback_model.py` to revert the `Champion` alias in MLflow to the previous approved version.
+---
+
+## 5. MLOps Workflow & Multi-Model Tournament
+
+The core of this system is the autonomous retraining and evaluation engine.
+
+### 5.1 The Tournament (`train_tournament.py`)
+Triggered via `financial-retraining-workflow`, the tournament trains four different model architectures simultaneously:
+1. **LightGBM**: Highly efficient gradient boosting (Default Primary).
+2. **XGBoost**: Robust gradient boosting alternative.
+3. **Random Forest**: Ensemble method to prevent overfitting.
+4. **Isolation Forest**: Unsupervised anomaly detection.
+
+Hyperparameters for all models are centrally managed in `project_config.yml`.
+
+### 5.2 Champion / Challenger Gating (`deploy_anomaly_model.py`)
+Models are evaluated against a holdout dataset. The system employs a rigorous gating mechanism:
+- **Primary Metric**: `pr_auc` (Precision-Recall Area Under Curve).
+- **Threshold**: The Challenger must improve upon the existing Champion's `pr_auc` by a minimum of `0.005` (configurable in `project_config.yml`).
+- **Promotion**: If successful, the Challenger is registered in the MLflow Model Registry and alias tagged as the new `Champion`.
+
+---
+
+## 6. Data Drift & Monitoring
+
+Data drift monitoring is handled by Evidently AI and orchestrated by `resources/drift_monitoring.yml`.
+
+- **Job**: `financial-drift-monitoring`
+- **Schedule**: Every 30 minutes (`0 */30 * * * ?`).
+- **Mechanism**: The `detect_drift.py` script compares a recent data window against a historical reference window.
+- **Metrics Evaluated**: Population Stability Index (PSI) and Jensen-Shannon (JS) Divergence on key features like price volatility and trade intensity.
+- **Alerting**: If drift exceeds the threshold defined in `project_config.yml`, an alert is generated, and the retraining pipeline may be triggered automatically.
+
+---
+
+## 7. Operational Playbooks & Troubleshooting
+
+### Scenario A: Delta Live Tables (DLT) Pipeline Failures
+- **Symptom**: `financial-streaming-dlt` job fails or stops processing records.
+- **Investigation**: 
+  1. Check the DLT UI in Databricks.
+  2. If the failure occurs at `bronze_ingest.py`, verify that the Finnhub API rate limits haven't been exceeded or that the payload schema hasn't changed.
+  3. If running in batch mode (`continuous: false`), consider changing to `true` in `resources/streaming_pipeline.yml` for uninterrupted real-time streaming.
+- **Resolution**: Adjust schema evolution settings or rotate API keys if rate-limited.
+
+### Scenario B: High Volume of False Positives
+- **Symptom**: The dashboard indicates a massive spike in detected anomalies during normal market conditions.
+- **Investigation**:
+  1. Manually trigger the `financial-drift-monitoring` job. Review the Evidently AI drift reports.
+  2. Check `project_config.yml` to see if market volatility features have heavily drifted.
+- **Resolution**: If structural market drift is confirmed, manually trigger the `financial-retraining-workflow` to update the model baseline.
+
+### Scenario C: Emergency Model Rollback
+- **Symptom**: A newly deployed model exhibits severely degraded performance and is impacting downstream consumers.
+- **Investigation**: Verify model performance via the dashboard and MLflow real-time metrics.
+- **Resolution**: Execute the rollback script to immediately demote the current Champion and restore the previous approved version:
+  ```bash
+  # Can be executed via Databricks Workflows or a connected notebook
+  python scripts/financial/rollback_model.py
+  ```
+
+### Scenario D: Missing Dashboard Metrics
+- **Symptom**: The frontend dashboard is blank or shows stale data.
+- **Investigation**: 
+  1. Verify the `export_dashboard_metrics.py` task is completing successfully.
+  2. Ensure the Databricks Model Serving endpoint (if active) is accessible and not in a scaled-to-zero / cold-start state.
diff --git a/typings/__builtins__.pyi b/typings/__builtins__.pyi
@@ -0,0 +1,18 @@
+
+from databricks.sdk.runtime import *
+from pyspark.sql.session import SparkSession
+from pyspark.sql.functions import udf as U
+from pyspark.sql.context import SQLContext
+
+udf = U
+spark: SparkSession
+sc = spark.sparkContext
+sqlContext: SQLContext
+sql = sqlContext.sql
+table = sqlContext.table
+getArgument = dbutils.widgets.getArgument
+
+def displayHTML(html): ...
+
+def display(input=None, *args, **kwargs): ...
+