You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# `result` contains paths to the artifacts that were created/updated:
54
+
# - result.dataset_csv
55
+
# - result.metadata_json
56
+
# - result.codelists_dir
57
+
```
58
+
59
+
### What you get on disk
60
+
61
+
```text
62
+
<out_dir>/
63
+
dataset.csv # append-only facts across versions
64
+
metadata.json # version history + fetch metadata
65
+
codelists/ # exported reference tables
66
+
logs/ # only when save_logs=True
67
+
<agency>__<dataset>__<timestamp>.log
68
+
```
15
69
16
70
---
17
71
18
-
## Why sdmxflow
72
+
## Integrations (Airflow/dbt style)
19
73
20
-
Many SDMX ingestion solutions focus on “get me data” (often very flexibly), but stop short of the metadata needed for downstream analytics and governance:
74
+
The intended workflow is: fetch artifacts → load into your warehouse → model downstream.
21
75
22
-
- dataset versioning (what changed upstream and when),
23
-
- artifact locations and repeatability,
24
-
- codelists/reference data exported alongside the facts.
76
+
Example (Airflow task pseudocode):
25
77
26
-
There are also community solutions (for example a dlt extension shared by Martin Salo) that are great for flexible extraction, and this project started from that direction. `sdmxflow` builds on those ideas but focuses more strongly on a warehouse-friendly artifact layout and richer metadata + codelist outputs.
78
+
```python
79
+
from pathlib import Path
27
80
28
-
`sdmxflow` aims to be a pragmatic building block for warehouse pipelines: straightforward API, deterministic output layout, and predictable refresh behavior.
81
+
fromsdmxflow.dataset import SdmxDataset
29
82
30
-
Where we come from:
31
83
32
-
- Early prototyping and the “bring SDMX into warehouse refresh workflows” motivation was influenced by Martin Salo’s SDMX `dlt` extension gist.
33
-
- The heavy lifting for SDMX protocol/model parsing is powered by the `sdmx1` Python package.
84
+
defrefresh_eurostat_lfsa_egai2d() -> None:
85
+
ds = SdmxDataset(
86
+
out_dir=Path("/data/sdmx/lfsa_egai2d"),
87
+
source_id="ESTAT",
88
+
dataset_id="lfsa_egai2d",
89
+
)
90
+
ds.fetch()
91
+
```
92
+
93
+
Then:
94
+
95
+
- load `<out_dir>/dataset.csv` into a staging table,
96
+
- define it as a dbt source,
97
+
- build models on top; select the newest version via the `last_updated` column.
98
+
99
+
---
100
+
101
+
## How refresh works
102
+
103
+
`fetch()` is designed for scheduled refresh jobs:
104
+
105
+
1. Fetch upstream “last updated” timestamp.
106
+
2. Compare with the latest locally recorded timestamp in `metadata.json`.
107
+
3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
108
+
4. If changed: download and append a new slice to `dataset.csv`, then update metadata + codelists.
109
+
110
+
---
111
+
112
+
## Use cases
113
+
114
+
- Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
115
+
- Keep reference codelists versioned alongside fact extracts for governance.
116
+
- Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.
117
+
118
+
---
119
+
120
+
## Why sdmxflow
121
+
122
+
`sdmxflow` is intentionally opinionated about *operationalizing* SDMX datasets for warehouse refresh jobs.
123
+
124
+
- Compared to SDMX client libraries: they fetch data; `sdmxflow` produces deterministic refresh artifacts + metadata trail + codelists.
125
+
- Compared to flexible extractors: `sdmxflow` focuses on stable layout and predictable refresh semantics.
126
+
127
+
See “Credits and acknowledgements” below for project influences and dependencies.
34
128
35
129
---
36
130
37
131
## Features
38
132
39
133
-**Append-only refresh**: only downloads and appends when upstream changed.
-**Airflow / Dagster / Prefect task**: call `fetch()` on a schedule; downstream tasks ingest `dataset.csv` into your warehouse.
168
-
-**dbt sources**: load `dataset.csv` into a staging table and build models on top.
169
-
-**Lakehouse**: treat `<out_dir>` as a partitioned artifact folder; `metadata.json` provides lineage.
170
-
171
-
Because the dataset is append-only, you can:
172
-
173
-
- reprocess from scratch (read the full file), or
174
-
- incrementally process “new versions” by filtering on `last_updated`.
175
-
176
-
---
177
-
178
214
## Provider support and limitations
179
215
180
216
- Supported:
@@ -188,6 +224,26 @@ Planned/possible future work (not guaranteed):
188
224
189
225
---
190
226
227
+
## FAQ
228
+
229
+
**Does `sdmxflow` load into my warehouse directly?**
230
+
231
+
No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).
232
+
233
+
**Does it support providers besides Eurostat?**
234
+
235
+
Not yet. Eurostat (`source_id="ESTAT"`) is the current supported provider.
236
+
237
+
**Does it deduplicate data?**
238
+
239
+
It is append-only across upstream versions. Each appended slice is marked with a `last_updated` value so downstream jobs can select the newest version (or reprocess full history).
240
+
241
+
**How does it detect upstream changes?**
242
+
243
+
For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.
0 commit comments