Skip to content

Commit e7d9865

Browse files
feat: enhance README with detailed usage instructions and integration examples
1 parent dcb25df commit e7d9865

1 file changed

Lines changed: 128 additions & 72 deletions

File tree

README.md

Lines changed: 128 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,22 @@
1-
21
# sdmxflow
32

4-
Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse and periodic refresh workflows.
3+
[![PyPI](https://img.shields.io/pypi/v/sdmxflow.svg)](https://pypi.org/project/sdmxflow/)
4+
[![Python versions](https://img.shields.io/pypi/pyversions/sdmxflow.svg)](https://pypi.org/project/sdmxflow/)
5+
[![License](https://img.shields.io/pypi/l/sdmxflow.svg)](LICENSE.md)
6+
[![CI](https://github.com/knifflig/sdmxflow/actions/workflows/ci.yml/badge.svg)](https://github.com/knifflig/sdmxflow/actions/workflows/ci.yml)
7+
8+
`sdmxflow` turns SDMX datasets (Eurostat today) into deterministic, append-only warehouse refresh artifacts: facts CSV + versioned metadata trail + exported codelists.
9+
10+
**Problem:** SDMX is easy to query, but harder to operationalize for warehouses (repeatable artifacts, refresh semantics, reference data, governance).
11+
12+
**Solution:** `sdmxflow` fetches a dataset and writes a stable on-disk layout that you can load into your warehouse on a schedule.
13+
14+
**Proof:** Eurostat is supported now (`source_id="ESTAT"`), with append-only refresh and last-updated change detection.
15+
16+
> [!NOTE]
17+
> **Status:** early but functional
18+
> **Supported providers:** **Eurostat** (`source_id="ESTAT"`)
19+
> **Docs:** https://knifflig.github.io/sdmxflow/
520
621
`sdmxflow` is designed for the common “ELT input dataset” pattern:
722

@@ -11,36 +26,112 @@ Download SDMX datasets into a reproducible, append-only on-disk layout for data
1126
- keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
1227
- export the reference data (codelists) required to interpret coded columns.
1328

14-
> Status: early but functional. Current provider support is **Eurostat** (`source_id="ESTAT"`).
29+
---
30+
31+
## Quickstart
32+
33+
The primary entrypoint is `SdmxDataset`.
34+
35+
```python
36+
from pathlib import Path
37+
38+
from sdmxflow.dataset import SdmxDataset
39+
40+
ds = SdmxDataset(
41+
out_dir=Path("./out/lfsa_egai2d"),
42+
source_id="ESTAT",
43+
dataset_id="lfsa_egai2d",
44+
# Optional:
45+
# agency_id="ESTAT",
46+
# key=..., # provider-specific key restriction
47+
# params={...}, # provider-specific passthrough params
48+
save_logs=True, # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
49+
)
50+
51+
result = ds.fetch()
52+
53+
# `result` contains paths to the artifacts that were created/updated:
54+
# - result.dataset_csv
55+
# - result.metadata_json
56+
# - result.codelists_dir
57+
```
58+
59+
### What you get on disk
60+
61+
```text
62+
<out_dir>/
63+
dataset.csv # append-only facts across versions
64+
metadata.json # version history + fetch metadata
65+
codelists/ # exported reference tables
66+
logs/ # only when save_logs=True
67+
<agency>__<dataset>__<timestamp>.log
68+
```
1569

1670
---
1771

18-
## Why sdmxflow
72+
## Integrations (Airflow/dbt style)
1973

20-
Many SDMX ingestion solutions focus on “get me data” (often very flexibly), but stop short of the metadata needed for downstream analytics and governance:
74+
The intended workflow is: fetch artifacts → load into your warehouse → model downstream.
2175

22-
- dataset versioning (what changed upstream and when),
23-
- artifact locations and repeatability,
24-
- codelists/reference data exported alongside the facts.
76+
Example (Airflow task pseudocode):
2577

26-
There are also community solutions (for example a dlt extension shared by Martin Salo) that are great for flexible extraction, and this project started from that direction. `sdmxflow` builds on those ideas but focuses more strongly on a warehouse-friendly artifact layout and richer metadata + codelist outputs.
78+
```python
79+
from pathlib import Path
2780

28-
`sdmxflow` aims to be a pragmatic building block for warehouse pipelines: straightforward API, deterministic output layout, and predictable refresh behavior.
81+
from sdmxflow.dataset import SdmxDataset
2982

30-
Where we come from:
3183

32-
- Early prototyping and the “bring SDMX into warehouse refresh workflows” motivation was influenced by Martin Salo’s SDMX `dlt` extension gist.
33-
- The heavy lifting for SDMX protocol/model parsing is powered by the `sdmx1` Python package.
84+
def refresh_eurostat_lfsa_egai2d() -> None:
85+
ds = SdmxDataset(
86+
out_dir=Path("/data/sdmx/lfsa_egai2d"),
87+
source_id="ESTAT",
88+
dataset_id="lfsa_egai2d",
89+
)
90+
ds.fetch()
91+
```
92+
93+
Then:
94+
95+
- load `<out_dir>/dataset.csv` into a staging table,
96+
- define it as a dbt source,
97+
- build models on top; select the newest version via the `last_updated` column.
98+
99+
---
100+
101+
## How refresh works
102+
103+
`fetch()` is designed for scheduled refresh jobs:
104+
105+
1. Fetch upstream “last updated” timestamp.
106+
2. Compare with the latest locally recorded timestamp in `metadata.json`.
107+
3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
108+
4. If changed: download and append a new slice to `dataset.csv`, then update metadata + codelists.
109+
110+
---
111+
112+
## Use cases
113+
114+
- Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
115+
- Keep reference codelists versioned alongside fact extracts for governance.
116+
- Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.
117+
118+
---
119+
120+
## Why sdmxflow
121+
122+
`sdmxflow` is intentionally opinionated about *operationalizing* SDMX datasets for warehouse refresh jobs.
123+
124+
- Compared to SDMX client libraries: they fetch data; `sdmxflow` produces deterministic refresh artifacts + metadata trail + codelists.
125+
- Compared to flexible extractors: `sdmxflow` focuses on stable layout and predictable refresh semantics.
126+
127+
See “Credits and acknowledgements” below for project influences and dependencies.
34128

35129
---
36130

37131
## Features
38132

39133
- **Append-only refresh**: only downloads and appends when upstream changed.
40-
- **Warehouse-friendly layout**:
41-
- `dataset.csv` (facts)
42-
- `metadata.json` (versions + fetch info)
43-
- `codelists/` (reference tables)
134+
- **Warehouse-friendly layout**: `dataset.csv` (facts), `metadata.json` (versions + fetch info), `codelists/` (reference tables).
44135
- **Fast upstream change detection** (Eurostat): uses SDMX annotations for last-updated.
45136
- **User-friendly logging** at `INFO` and detailed diagnostics at `DEBUG`.
46137
- Optional per-run log file capture via `save_logs=True`.
@@ -57,8 +148,6 @@ Non-goals (for now):
57148

58149
### From PyPI (recommended)
59150

60-
Once published:
61-
62151
```bash
63152
pip install sdmxflow
64153
```
@@ -75,44 +164,6 @@ uv sync --group dev
75164

76165
---
77166

78-
## Quickstart
79-
80-
The primary entrypoint is `SdmxDataset`.
81-
82-
```python
83-
from pathlib import Path
84-
85-
from sdmxflow.dataset import SdmxDataset
86-
87-
ds = SdmxDataset(
88-
out_dir=Path("./out/lfsa_egai2d"),
89-
source_id="ESTAT",
90-
dataset_id="lfsa_egai2d",
91-
# Optional:
92-
# agency_id="ESTAT",
93-
# key=..., # provider-specific key restriction
94-
# params={...}, # provider-specific passthrough params
95-
save_logs=True, # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
96-
)
97-
98-
result = ds.fetch()
99-
print("Appended new version:", result.appended)
100-
print("Dataset CSV:", result.dataset_csv)
101-
print("Metadata JSON:", result.metadata_json)
102-
print("Codelists dir:", result.codelists_dir)
103-
```
104-
105-
### What `fetch()` does
106-
107-
`fetch()` is designed for scheduled refresh jobs:
108-
109-
1. Fetch upstream “last updated” timestamp.
110-
2. Compare with the latest locally recorded timestamp in `metadata.json`.
111-
3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
112-
4. If changed: download and append a new slice to `dataset.csv`, then update metadata + codelists.
113-
114-
---
115-
116167
## Output layout
117168

118169
`sdmxflow` writes a stable folder structure under your chosen `out_dir`:
@@ -160,21 +211,6 @@ Contains exported codelists needed to interpret coded dataset columns.
160211

161212
---
162213

163-
## Integrating into warehouse workflows
164-
165-
Typical patterns:
166-
167-
- **Airflow / Dagster / Prefect task**: call `fetch()` on a schedule; downstream tasks ingest `dataset.csv` into your warehouse.
168-
- **dbt sources**: load `dataset.csv` into a staging table and build models on top.
169-
- **Lakehouse**: treat `<out_dir>` as a partitioned artifact folder; `metadata.json` provides lineage.
170-
171-
Because the dataset is append-only, you can:
172-
173-
- reprocess from scratch (read the full file), or
174-
- incrementally process “new versions” by filtering on `last_updated`.
175-
176-
---
177-
178214
## Provider support and limitations
179215

180216
- Supported:
@@ -188,6 +224,26 @@ Planned/possible future work (not guaranteed):
188224

189225
---
190226

227+
## FAQ
228+
229+
**Does `sdmxflow` load into my warehouse directly?**
230+
231+
No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).
232+
233+
**Does it support providers besides Eurostat?**
234+
235+
Not yet. Eurostat (`source_id="ESTAT"`) is the current supported provider.
236+
237+
**Does it deduplicate data?**
238+
239+
It is append-only across upstream versions. Each appended slice is marked with a `last_updated` value so downstream jobs can select the newest version (or reprocess full history).
240+
241+
**How does it detect upstream changes?**
242+
243+
For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.
244+
245+
---
246+
191247
## Development
192248

193249
Install dev dependencies:

0 commit comments

Comments
 (0)