FinTech Market Ingestion: Alpaca → Parquet → DuckDB + QA Observability

A production-style market data ingestion and validation framework for historical OHLCV bars (Daily + 1-Minute) using Alpaca market data. The pipeline writes curated, partitioned Parquet datasets and provides a structured QA layer with artifact-based observability and optional strict enforcement suitable for CI gating and trading research workflows.

Architecture Overview

Alpaca API
    ↓
Alpaca Client (retry + pagination)
    ↓
Normalization Layer (UTC + dedupe)
    ↓
Partitioned Parquet (symbol/year or symbol/date)
    ↓
DuckDB Analytics + QA Observability

Current Capabilities

Alpaca Historical Market Data Client

Auth via .env
Supports: symbol, start, end, timeframe (1Day, 1Min), feed (default: iex)
Automatic pagination
Exponential backoff for:
- HTTP 429 (rate limiting)
- HTTP 5xx (server errors)
- network timeouts
Returns results as a pandas DataFrame

Windowed Daily Backfill Pipeline

Iterates over a 50-ticker universe
Fetches daily bars in monthly windows
Normalizes into a canonical schema (UTC enforced)
Deterministic dedupe on primary key:
```
(symbol, ts_utc, timeframe)
```

Writes partitioned Parquet:

data/curated/bars_daily/symbol=XYZ/year=YYYY/

Safe to re-run (idempotent)

Windowed 1-Minute Backfill Pipeline

Day-windowed ingestion for intraday OHLCV bars
Failure isolation per (symbol, date)
Restartable + rate-limit safe

Writes partitioned Parquet:

data/curated/bars_1m/symbol=XYZ/date=YYYY-MM-DD/

Partitioned Parquet Storage

Implemented via pyarrow.write_to_dataset
Uses delete_matching to safely overwrite partitions
DuckDB-compatible queries over glob paths

QA Layer + Artifact-Based Observability (M2)

Duplicate primary key detection
OHLC integrity checks
Timestamp sanity / ordering checks
Gap detection (daily + intraday)
Coverage metrics (calendar-aware)

Deterministic artifacts per QA run:

artifacts/qa/<run_id>/
  qa_summary_by_symbol.csv
  qa_summary_global.csv
  qa_coverage_by_symbol.csv

Optional strict enforcement mode (CI-friendly)

QA Export CLI Usage

The repository includes a framework-level QA export command that computes integrity metrics and writes deterministic artifacts per run.

Artifacts are written to:

artifacts/qa/<run_id>/
  qa_summary_by_symbol.csv
  qa_summary_global.csv
  qa_coverage_by_symbol.csv

Observability Mode (Default)

Exports QA metrics but does not fail the pipeline.

python -m src.ingestion.qa_export \
  --timeframe all \
  --start 2025-11-01 \
  --end 2025-12-01

You may also run for a single timeframe:

Daily Only

python -m src.ingestion.qa_export \
  --timeframe 1D \
  --start 2025-11-01 \
  --end 2025-12-01

1-Minute Only

python -m src.ingestion.qa_export \
  --timeframe 1Min \
  --start 2025-11-01 \
  --end 2025-12-01

Strict Enforcement Mode (CI / Guardrail Mode)

Fails the pipeline if integrity thresholds are exceeded.

python -m src.ingestion.qa_export \
  --timeframe all \
  --start 2025-11-01 \
  --end 2025-12-01 \
  --strict \
  --max-duplicate-keys 0 \
  --max-ohlc-violations 0

Exit Codes

Condition	Exit Code
PASS (within thresholds)	0
WARN (coverage/gap thresholds exceeded)	0
FAIL (strict integrity violation)	1

Strict mode ensures:

Duplicate primary keys fail the pipeline
OHLC violations fail the pipeline
QA artifacts are still written for debugging
CI jobs automatically fail when data integrity is compromised

Example CI Integration

- name: Strict QA
  run: |
    python -m src.ingestion.qa_export \
      --timeframe all \
      --start 2025-11-01 \
      --end 2025-12-01 \
      --strict \
      --max-duplicate-keys 0 \
      --max-ohlc-violations 0

If integrity rules are violated, the CI job exits with code 1.

Why This Matters

In trading and backtesting systems:

Duplicate bars distort indicators (EMA, RSI, VWAP, etc.)
OHLC violations corrupt return calculations
Silent ingestion errors invalidate research
Coverage gaps can bias signal evaluation

The qa_export CLI formalizes dataset observability and enables production-grade guardrails before any modeling or execution layer consumes the data.

Universe Selection Criteria

The project operates on a curated universe (currently 50 tickers) optimized for stable historical availability.

Document:

Selection logic (liquidity + large-cap bias)
Exclusion rules
Intended expansion strategy

Tip: include this as docs/universe.md so it’s easy to audit and evolve.

Alpaca Market Data Client

Reusable Alpaca historical market data client:

src/ingestion/alpaca_client.py

Example Usage

from src.ingestion.alpaca_client import AlpacaMarketDataClient

client = AlpacaMarketDataClient.from_env()

df = client.fetch_bars(
    symbol="AAPL",
    start="2025-01-01T00:00:00Z",
    end="2025-02-01T00:00:00Z",
    timeframe="1Day"
)

print(df.head())

Setup

1) Create + Activate Virtual Environment (Windows)

python -m venv venv
.\venv\Scripts\activate

2) Install Dependencies

pip install -r requirements.txt

3) Export Dependencies (when updating your environment)

If you install new packages and want to pin exact versions:

pip freeze > requirements.txt

Configuration (.env)

Create a .env file in the project root:

ALPACA_API_KEY=your_key_here
ALPACA_SECRET_KEY=your_secret_here
ALPACA_DATA_FEED=iex

Project Layout

data/
  curated/
    bars_daily/
    bars_1m/

artifacts/
  qa/

reports/

src/
  ingestion/
  qa/
  utils/

notebooks/

Daily Backfill

CLI

python -m src.ingestion.backfill_daily --start 2023-01-01 --end 2026-01-01

Storage Layout

data/curated/bars_daily/symbol=XYZ/year=YYYY/part-*.parquet

1-Minute Historical Backfill

Windowed ingestion of 1-minute OHLCV bars (IEX-compatible) into curated Parquet partitions.

Designed for:

Idempotent re-runs
Restartable backfills
Rate-limit safe API usage
Failure isolation
DuckDB/Pandas/backtesting compatibility

Data Contract (Canonical Schema)

Column	Type	Description
symbol	string	Ticker symbol
ts_utc	timestamp (UTC)	Bar open time in UTC
open	float	Open price
high	float	High price
low	float	Low price
close	float	Close price
volume	float/int	Trade volume
source	string	e.g., `alpaca_iex`
timeframe	string	`"1Min"`
date	string	`YYYY-MM-DD` derived from ts_utc

Primary key (idempotency):

(symbol, ts_utc, timeframe)

Partition Layout

data/curated/bars_1m/
  symbol=XYZ/
    date=YYYY-MM-DD/
      part-*.parquet

CLI Usage

python -m src.ingestion.backfill_1m \
  --start 2024-01-01 \
  --end 2024-04-01

Optional flags:

--symbols configs/tickers_50.txt
--max-symbols 5
--sleep-ms 100
--no-progress

Failure Logging

Failed (symbol, date) requests are recorded:

reports/backfill_1m_failures_YYYYMMDD_HHMMSS.csv

Backfill continues even when individual symbol/day fetches fail.

DuckDB Analytics Validation

Example: validate per-symbol ranges and row counts directly over Parquet:

import duckdb
con = duckdb.connect()

df = con.execute("""
  SELECT symbol,
         MIN(ts_utc) AS min_ts,
         MAX(ts_utc) AS max_ts,
         COUNT(*) AS row_count
  FROM 'data/curated/bars_daily/**/*.parquet'
  GROUP BY symbol
  ORDER BY symbol
""").df()

print(df.head())

Data Quality & Validation (QA Layer)

Validates curated market data after ingestion runs (daily and 1-minute).

Ensures:

Primary key uniqueness
OHLC integrity
Timestamp ordering
Coverage / gap detection
Repeatable reporting
Optional strict enforcement mode

Supported Datasets

Timeframe	Path
Daily (1D)	`data/curated/bars_daily/*/.parquet`
1-Minute	`data/curated/bars_1m/*/.parquet`

Expected columns:

symbol, ts_utc, open, high, low, close, volume, source, timeframe

Primary key:

(symbol, ts_utc, timeframe)

QA Checks Implemented

Duplicate Detection

Detects duplicates on:

(symbol, ts_utc, timeframe)

Reports per symbol
Strict mode fails if duplicates exist

OHLC Integrity Rules

Validates:

high >= max(open, close)
low <= min(open, close)
high >= low
open, high, low, close > 0
volume >= 0
Required fields non-null

Timestamp Sanity

ts_utc non-null
Out-of-order detection per symbol

Coverage / Gap Detection

Daily (1D)

Calendar-aware expected sessions (XNYS default)

1-Minute (1Min) — Session-Aware Gap Logic

Gap metrics computed within (symbol, session_date) using timestamp deltas:

gap_count
gap_segments
max_gap_len

This prevents overnight/weekend closures from being counted as gaps.

Note on Alpaca IEX feed: missing minute bars may reflect “no trades” rather than corruption. Tune thresholds accordingly.

QA Summary Export (Framework-Level Observability)

QA exports are written to:

artifacts/qa/<run_id>/
  qa_summary_by_symbol.csv
  qa_summary_global.csv
  qa_coverage_by_symbol.csv

Global summary includes:

total_rows / unique_rows / duplicate_rows
total_gap_count
total_ohlc_violation_count
pct_symbols_below_coverage_threshold
overall_status (PASS / WARN / FAIL)

Run IDs are deterministic from:

dataset_name + interval + start_ts + end_ts + calendar

Strict QA Mode (Guardrail Enforcement)

When strict mode is enabled:

Duplicate primary keys → FAIL (exit code 1)
OHLC violations over threshold → FAIL (exit code 1)
Artifacts still written for debugging
CI-compatible gating

CLI

python -m src.ingestion.qa_export \
  --timeframe all \
  --start 2025-11-01 \
  --end 2025-12-01 \
  --strict \
  --max-duplicate-keys 0 \
  --max-ohlc-violations 0

Tests

Run unit tests:

pytest -q

Tests cover:

Normalization logic (canonical schema + UTC + sorting)
Duplicate key detection
OHLC violation detection
Strict enforcement path

DuckDB Sanity Queries Notebook

Notebook: notebooks/01_duckdb_sanity_queries.ipynb

Validates:

Schema presence and timestamp parsing
Coverage + completeness
Duplicate primary key contract
Basic validity constraints
Example analytics with window functions

Exports:

reports/duckdb_sanity_summary.csv

Roadmap / Future Enhancements

Fully session-aware expected-minute index comparison
Extended-hours handling
Per-run historical QA trend table
Machine-readable JSON summary export
CI auto-fail policies based on qa_summary_global.csv

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
configs		configs
reports		reports
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
test_connection.py		test_connection.py

Folders and files

Latest commit

History

Repository files navigation

FinTech Market Ingestion: Alpaca → Parquet → DuckDB + QA Observability

Architecture Overview

Current Capabilities

Alpaca Historical Market Data Client

Windowed Daily Backfill Pipeline

Windowed 1-Minute Backfill Pipeline

Partitioned Parquet Storage

QA Layer + Artifact-Based Observability (M2)

QA Export CLI Usage

Observability Mode (Default)

Daily Only

1-Minute Only

Strict Enforcement Mode (CI / Guardrail Mode)

Exit Codes

Example CI Integration

Why This Matters

Universe Selection Criteria

Alpaca Market Data Client

Example Usage

Setup

1) Create + Activate Virtual Environment (Windows)

2) Install Dependencies

3) Export Dependencies (when updating your environment)

Configuration (.env)

Project Layout

Daily Backfill

CLI

Storage Layout

1-Minute Historical Backfill

Data Contract (Canonical Schema)

Partition Layout

CLI Usage

Failure Logging

DuckDB Analytics Validation

Data Quality & Validation (QA Layer)

Supported Datasets

QA Checks Implemented

Duplicate Detection

OHLC Integrity Rules

Timestamp Sanity

Coverage / Gap Detection

Daily (1D)

1-Minute (1Min) — Session-Aware Gap Logic

QA Summary Export (Framework-Level Observability)

Strict QA Mode (Guardrail Enforcement)

CLI

Tests

DuckDB Sanity Queries Notebook

Roadmap / Future Enhancements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages