Skip to content

panwarnalini-hub/realtime-stock-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stock Market Analytics Pipeline

Python 3.8+ PySpark Delta Lake License: MIT

Live Demo

Interactive Streamlit Dashboard:
https://realtime-stock-analytics.streamlit.app/

Real-time stock market visualization powered by Alpha Vantage.
(API rate limits apply on free tier.)

Production data pipeline for stock market analytics with medallion architecture on Databricks. Includes batch processing pipeline and interactive dashboard for different use cases.


Overview

This project demonstrates full-stack data engineering capabilities through two complementary implementations:

  1. Batch Processing Pipeline - Production medallion architecture on Databricks
  2. Interactive Dashboard - Real-time visualization with Streamlit

Key Results:

  • Pipeline latency: <5 seconds per batch execution
  • Processing capacity: 100+ quotes per minute
  • Scalability: 1M+ events per day
  • Infrastructure: $0 (free tier)

Project Components

1. Databricks Pipeline (Production Batch Processing)

Purpose: Scheduled data processing with medallion architecture for reliable, production-grade analytics.

Location: notebooks/

Architecture:

Alpha Vantage API → Bronze Layer → Silver Layer → Gold Layer
                    (Raw Data)    (Enriched)     (Analytics)
                    
Storage: Unity Catalog Volumes (/Volumes/main/stocks/data/)

Data Flow:

  • Bronze Layer - Raw API responses with minimal transformation
  • Silver Layer - Technical indicators (moving averages, volatility)
  • Gold Layer - Business analytics (top movers, alerts)

Features:

  • Micro-batch ingestion with configurable intervals
  • Delta Lake ACID transactions
  • Unity Catalog governance
  • Time travel capabilities
  • Partitioned storage by date and symbol

Tech Stack: Databricks, PySpark, Delta Lake, Unity Catalog

2. Streamlit Dashboard (Interactive Analysis)

Purpose: On-demand visualization and exploration for ad-hoc analysis.

Location: dashboard/streamlit_app.py

Features:

  • Live stock price fetching
  • Interactive stock selection
  • Price comparison charts
  • Volume analysis
  • Configurable refresh intervals

Data Source: Direct API calls (independent of Databricks pipeline)

Tech Stack: Streamlit, Plotly, Python

Why Two Implementations?

This architecture mirrors real-world data platforms:

  • Batch pipeline handles scheduled, reliable production processing
  • Dashboard enables analyst self-service and real-time exploration

Both demonstrate different aspects of data engineering:

  • Production ETL/ELT patterns
  • Interactive application development
  • API integration
  • Data visualization

Architecture Details

Batch Pipeline Architecture

Micro-Batch Processing: This uses scheduled batch processing with configurable intervals (default: 60 seconds). This is NOT Structured Streaming with readStream/writeStream, but rather API polling optimized for rate limits.

Design Rationale:

  • Alpha Vantage API has rate limits (5 calls/minute free tier)
  • Micro-batch approach respects API constraints
  • Suitable for scenarios where sub-minute latency is acceptable

For True Streaming: Would require Kafka/Event Hub as streaming source with Structured Streaming (readStream, writeStream, checkpointing, watermarking).

Bronze Layer:

  • Raw stock quotes from API
  • Minimal transformation
  • Partitioned by ingestion date
  • Delta Lake with ACID guarantees

Silver Layer:

  • Data quality validation
  • Technical indicators: 5min, 15min, 1hr moving averages
  • Volatility calculations (price standard deviation / mean)
  • Data quality scoring

Gold Layer:

  • Top movers identification (gainers/losers)
  • Volatility leaders ranking
  • Price spike/drop alerts
  • Business-ready analytics tables

Quick Start

Prerequisites

  • Python 3.8+
  • Databricks account (Community Edition compatible)
  • Alpha Vantage API key (Get free key)

Installation

# Clone repository
git clone https://github.com/panwarnalini-hub/realtime-stock-analytics.git
cd realtime-stock-analytics

# Install dependencies
pip install -r requirements.txt

# Create configuration
cat > config.py << EOF
API_KEY = "your_alpha_vantage_api_key"
EOF

Running the Databricks Pipeline

  1. Upload notebooks/ folder to Databricks workspace
  2. Update paths in each notebook (see Deployment Notes)
  3. Execute sequentially:
    • 01_bronze_ingestion.py - Fetch and store raw data
    • 02_silver_aggregations.py - Calculate technical indicators
    • 03_gold_analytics.py - Generate business analytics

Running the Dashboard

streamlit run dashboard/streamlit_app.py

Dashboard opens at http://localhost:8501

Note: Dashboard fetches data directly from API, independent of Databricks tables.


Project Structure

stock-analytics/
├── src/
│   └── stock_fetcher.py          # API client
├── notebooks/
│   ├── 01_bronze_ingestion.py    # Raw data ingestion
│   ├── 02_silver_aggregations.py # Technical indicators
│   └── 03_gold_analytics.py      # Business analytics
├── dashboard/
│   └── streamlit_app.py          # Interactive dashboard
├── config.py                      # API configuration (gitignored)
├── requirements.txt               # Dependencies
├── .gitignore
└── README.md

Configuration

API Key Setup

Get free API key: Alpha Vantage

Local Development:

# config.py
API_KEY = "your_api_key_here"

Databricks (Production):

# In notebook
API_KEY = dbutils.secrets.get(scope="stock-analytics", key="alpha-vantage-key")

Stock Symbols

Configurable in notebooks and dashboard:

SYMBOLS = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA', 'META', 'NVDA', 'AMD']

Deployment Notes

Databricks Paths

Unity Catalog Volumes (Community Edition):

BRONZE_PATH = "/Volumes/main/stocks/data/bronze"
SILVER_PATH = "/Volumes/main/stocks/data/silver"
GOLD_PATH = "/Volumes/main/stocks/data/gold"

DBFS (Premium/Standard):

BRONZE_PATH = "/tmp/delta/stocks/bronze"
SILVER_PATH = "/tmp/delta/stocks/silver"
GOLD_PATH = "/tmp/delta/stocks/gold"

Implementation: Developed on Databricks Community Edition with Unity Catalog Volumes. Code samples show /tmp/ paths as standard pattern but were executed with Volume paths.

Setup:

  1. Community Edition → Use /Volumes/ paths
  2. Standard/Premium → Use /tmp/ or /dbfs/FileStore/ paths
  3. Update path variables in all notebooks

Performance Metrics

Pipeline Latency:

  • API to Bronze: <2 seconds
  • Bronze to Silver: <2 seconds
  • Silver to Gold: <1 second
  • Total: <5 seconds per batch

Throughput:

  • API rate: 5 calls/minute (free tier)
  • Processing: 100+ quotes/minute
  • Scalable to: 1M+ events/day

Storage:

  • Format: Delta Lake with Parquet compression
  • Partitioning: Date and symbol
  • Time travel: 30 days retention

Cost:

  • Databricks Community Edition: Free
  • Alpha Vantage API: Free tier
  • Total: $0/month

Data Schema

Bronze Table

symbol              STRING
price               DOUBLE
volume              LONG
timestamp           TIMESTAMP
change_percent      DOUBLE
ingestion_timestamp TIMESTAMP
ingestion_date      DATE
source              STRING

Silver Table

Includes Bronze fields plus:

ma_5min             DOUBLE
ma_15min            DOUBLE
ma_1hr              DOUBLE
volume_avg_1hr      DOUBLE
volatility          DOUBLE
data_quality_score  INT

Gold Tables

Top Movers:

symbol, price, change_percent, movement_type, timestamp

Volatility Leaders:

symbol, price, volatility_pct, volume

Alerts:

symbol, alert_type, price, change_percent, alert_timestamp

Security

API Key Protection:

  • Never commit config.py
  • Use .gitignore exclusions
  • Store production keys in Databricks Secrets
  • Rotate keys regularly

Excluded Files:

config.py
*.key
.env
__pycache__/
*.pyc

Troubleshooting

API Rate Limit:

  • Free tier: 500 calls/day, 5 calls/minute
  • Solution: Increase time.sleep() or upgrade plan

Delta Lake Errors:

  • Check path permissions
  • Verify Delta Lake compatibility
  • Ensure adequate storage

Dashboard Issues:

  • Verify config.py exists with valid API key
  • Check port 8501 availability
  • Install all dependencies from requirements.txt

Future Enhancements

  • Structured Streaming with Kafka/Event Hub
  • Machine learning price predictions
  • Multi-exchange support (NYSE, NASDAQ, LSE)
  • Automated trading signals
  • Slack/email notifications
  • Historical backtesting

Author

Nalini Panwar


License

MIT License - see LICENSE file


Acknowledgments

  • Alpha Vantage for market data API
  • Databricks Community Edition
  • Delta Lake open source project
  • Apache Spark community

About

Stock market analytics pipeline with medallion architecture on Databricks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages