feat: enhance README with detailed usage instructions and integration examples

HenryZehe-WifOR · HenryZehe-WifOR · commit e7d986563ace · 2026-03-03T19:11:58.000Z
diff --git a/README.md b/README.md
@@ -1,7 +1,22 @@
-
 # sdmxflow
 
-Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse and periodic refresh workflows.
+[![PyPI](https://img.shields.io/pypi/v/sdmxflow.svg)](https://pypi.org/project/sdmxflow/)
+[![Python versions](https://img.shields.io/pypi/pyversions/sdmxflow.svg)](https://pypi.org/project/sdmxflow/)
+[![License](https://img.shields.io/pypi/l/sdmxflow.svg)](LICENSE.md)
+[![CI](https://github.com/knifflig/sdmxflow/actions/workflows/ci.yml/badge.svg)](https://github.com/knifflig/sdmxflow/actions/workflows/ci.yml)
+
+`sdmxflow` turns SDMX datasets (Eurostat today) into deterministic, append-only warehouse refresh artifacts: facts CSV + versioned metadata trail + exported codelists.
+
+**Problem:** SDMX is easy to query, but harder to operationalize for warehouses (repeatable artifacts, refresh semantics, reference data, governance).
+
+**Solution:** `sdmxflow` fetches a dataset and writes a stable on-disk layout that you can load into your warehouse on a schedule.
+
+**Proof:** Eurostat is supported now (`source_id="ESTAT"`), with append-only refresh and last-updated change detection.
+
+> [!NOTE]
+> **Status:** early but functional
+> **Supported providers:** **Eurostat** (`source_id="ESTAT"`)
+> **Docs:** https://knifflig.github.io/sdmxflow/
 
 `sdmxflow` is designed for the common “ELT input dataset” pattern:
 
@@ -11,36 +26,112 @@ Download SDMX datasets into a reproducible, append-only on-disk layout for data
 - keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
 - export the reference data (codelists) required to interpret coded columns.
 
-> Status: early but functional. Current provider support is **Eurostat** (`source_id="ESTAT"`).
+---
+
+## Quickstart
+
+The primary entrypoint is `SdmxDataset`.
+
+```python
+from pathlib import Path
+
+from sdmxflow.dataset import SdmxDataset
+
+ds = SdmxDataset(
+	out_dir=Path("./out/lfsa_egai2d"),
+	source_id="ESTAT",
+	dataset_id="lfsa_egai2d",
+	# Optional:
+	# agency_id="ESTAT",
+	# key=...,        # provider-specific key restriction
+	# params={...},   # provider-specific passthrough params
+	save_logs=True,  # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
+)
+
+result = ds.fetch()
+
+# `result` contains paths to the artifacts that were created/updated:
+# - result.dataset_csv
+# - result.metadata_json
+# - result.codelists_dir
+```
+
+### What you get on disk
+
+```text
+<out_dir>/
+	dataset.csv          # append-only facts across versions
+	metadata.json        # version history + fetch metadata
+	codelists/           # exported reference tables
+	logs/                # only when save_logs=True
+		<agency>__<dataset>__<timestamp>.log
+```
 
 ---
 
-## Why sdmxflow
+## Integrations (Airflow/dbt style)
 
-Many SDMX ingestion solutions focus on “get me data” (often very flexibly), but stop short of the metadata needed for downstream analytics and governance:
+The intended workflow is: fetch artifacts → load into your warehouse → model downstream.
 
-- dataset versioning (what changed upstream and when),
-- artifact locations and repeatability,
-- codelists/reference data exported alongside the facts.
+Example (Airflow task pseudocode):
 
-There are also community solutions (for example a dlt extension shared by Martin Salo) that are great for flexible extraction, and this project started from that direction. `sdmxflow` builds on those ideas but focuses more strongly on a warehouse-friendly artifact layout and richer metadata + codelist outputs.
+```python
+from pathlib import Path
 
-`sdmxflow` aims to be a pragmatic building block for warehouse pipelines: straightforward API, deterministic output layout, and predictable refresh behavior.
+from sdmxflow.dataset import SdmxDataset
 
-Where we come from:
 
-- Early prototyping and the “bring SDMX into warehouse refresh workflows” motivation was influenced by Martin Salo’s SDMX `dlt` extension gist.
-- The heavy lifting for SDMX protocol/model parsing is powered by the `sdmx1` Python package.
+def refresh_eurostat_lfsa_egai2d() -> None:
+	ds = SdmxDataset(
+		out_dir=Path("/data/sdmx/lfsa_egai2d"),
+		source_id="ESTAT",
+		dataset_id="lfsa_egai2d",
+	)
+	ds.fetch()
+```
+
+Then:
+
+- load `<out_dir>/dataset.csv` into a staging table,
+- define it as a dbt source,
+- build models on top; select the newest version via the `last_updated` column.
+
+---
+
+## How refresh works
+
+`fetch()` is designed for scheduled refresh jobs:
+
+1. Fetch upstream “last updated” timestamp.
+2. Compare with the latest locally recorded timestamp in `metadata.json`.
+3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
+4. If changed: download and append a new slice to `dataset.csv`, then update metadata + codelists.
+
+---
+
+## Use cases
+
+- Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
+- Keep reference codelists versioned alongside fact extracts for governance.
+- Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.
+
+---
+
+## Why sdmxflow
+
+`sdmxflow` is intentionally opinionated about *operationalizing* SDMX datasets for warehouse refresh jobs.
+
+- Compared to SDMX client libraries: they fetch data; `sdmxflow` produces deterministic refresh artifacts + metadata trail + codelists.
+- Compared to flexible extractors: `sdmxflow` focuses on stable layout and predictable refresh semantics.
+
+See “Credits and acknowledgements” below for project influences and dependencies.
 
 ---
 
 ## Features
 
 - **Append-only refresh**: only downloads and appends when upstream changed.
-- **Warehouse-friendly layout**:
-	- `dataset.csv` (facts)
-	- `metadata.json` (versions + fetch info)
-	- `codelists/` (reference tables)
+- **Warehouse-friendly layout**: `dataset.csv` (facts), `metadata.json` (versions + fetch info), `codelists/` (reference tables).
 - **Fast upstream change detection** (Eurostat): uses SDMX annotations for last-updated.
 - **User-friendly logging** at `INFO` and detailed diagnostics at `DEBUG`.
 - Optional per-run log file capture via `save_logs=True`.
@@ -57,8 +148,6 @@ Non-goals (for now):
 
 ### From PyPI (recommended)
 
-Once published:
-
 ```bash
 pip install sdmxflow
 ```
@@ -75,44 +164,6 @@ uv sync --group dev
 
 ---
 
-## Quickstart
-
-The primary entrypoint is `SdmxDataset`.
-
-```python
-from pathlib import Path
-
-from sdmxflow.dataset import SdmxDataset
-
-ds = SdmxDataset(
-	out_dir=Path("./out/lfsa_egai2d"),
-	source_id="ESTAT",
-	dataset_id="lfsa_egai2d",
-	# Optional:
-	# agency_id="ESTAT",
-	# key=...,        # provider-specific key restriction
-	# params={...},   # provider-specific passthrough params
-	save_logs=True,  # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
-)
-
-result = ds.fetch()
-print("Appended new version:", result.appended)
-print("Dataset CSV:", result.dataset_csv)
-print("Metadata JSON:", result.metadata_json)
-print("Codelists dir:", result.codelists_dir)
-```
-
-### What `fetch()` does
-
-`fetch()` is designed for scheduled refresh jobs:
-
-1. Fetch upstream “last updated” timestamp.
-2. Compare with the latest locally recorded timestamp in `metadata.json`.
-3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
-4. If changed: download and append a new slice to `dataset.csv`, then update metadata + codelists.
-
----
-
 ## Output layout
 
 `sdmxflow` writes a stable folder structure under your chosen `out_dir`:
@@ -160,21 +211,6 @@ Contains exported codelists needed to interpret coded dataset columns.
 
 ---
 
-## Integrating into warehouse workflows
-
-Typical patterns:
-
-- **Airflow / Dagster / Prefect task**: call `fetch()` on a schedule; downstream tasks ingest `dataset.csv` into your warehouse.
-- **dbt sources**: load `dataset.csv` into a staging table and build models on top.
-- **Lakehouse**: treat `<out_dir>` as a partitioned artifact folder; `metadata.json` provides lineage.
-
-Because the dataset is append-only, you can:
-
-- reprocess from scratch (read the full file), or
-- incrementally process “new versions” by filtering on `last_updated`.
-
----
-
 ## Provider support and limitations
 
 - Supported:
@@ -188,6 +224,26 @@ Planned/possible future work (not guaranteed):
 
 ---
 
+## FAQ
+
+**Does `sdmxflow` load into my warehouse directly?**
+
+No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).
+
+**Does it support providers besides Eurostat?**
+
+Not yet. Eurostat (`source_id="ESTAT"`) is the current supported provider.
+
+**Does it deduplicate data?**
+
+It is append-only across upstream versions. Each appended slice is marked with a `last_updated` value so downstream jobs can select the newest version (or reprocess full history).
+
+**How does it detect upstream changes?**
+
+For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.
+
+---
+
 ## Development
 
 Install dev dependencies: