Tooling for a minimalist data lab running on top of DuckLake.
Minimum requirements:
- uv with Python 3.13 installed.
- Access to MinIO or S3-compatible object storage.
Tip
I keep a MinIO instance on my tiny home lab, made of an old laptop running Proxmox, but you can easily spin up a MinIO instance using the minio service using the dev profile under infra/services/docker/compose.yml, after setting up your .env (see below).
To run your own infrastructure, you'll also need:
Tip
Most workflows are saved as just commands, which are available after you install uv dependencies and load the virtual environment. Run just -l to list all available commands (more details below).
Note
The following is no longer required, and will be updated and tested soon:
We rely on the official duckdb/dbt-duckdb adapter to connect to DuckLake. At this time, the latest stable version of the adapter does not support attaching the external DuckLake catalog with the DATA_PATH option and S3 credentials, but there is PR #564 that solves this, so we're using what is, at this point, unreleased code (see the dbt-duckdb dependency and the corresponding entry under [tools.uv.sources] in the pyproject.toml file).
First create your own .env file from the provided example:
cp .env.example .envMake sure you fill-in the S3 configuration for:
S3_ACCESS_KEY_ID=minio_username
S3_SECRET_ACCESS_KEY=minio_passwordYou can then activate just and dlctl via:
uv sync
source .venv/bin/activateYou can then setup the MinIO service as follows (it will use your env vars):
docker compose -p datalab -f infra/services/docker/compose.yml \
--profile dev up minio minio-init -dOr you can spin up the whole infrastructure locally, after Docker is running, by using:
just infra-provision-localTip
If you're you're having trouble connecting to MinIO, make sure you're using the correct zone, which you set via the S3_REGION variable in .env.
You should also generate the init.sql file, so you can easily connect to your DuckLake from the CLI as well:
dlctl tools generate-init-sql
duckdb -init local/init.sql local/engine.duckdbOr simply run the following command whenever you want to access your DuckLake, which will take care of the setup process for you:
just lakehouseThe general workflow you're expected to follow for data engineering is illustrated in the following diagram:
You're expected to implement your own dbt models to power dlctl transform. We provide an example of this under transform/models/, based on the following Kaggle datasets:
A few datasets are already supported and pipeline are encoded using just commands (e.g., econ-compnet-etl, graphrag-etl, mlops-etl, which correspond to projects with their own YouTube videos).
You can learn all other details below.
This is where the dlctl command lives—standing for 'Data Lab Control'. This helps you run all the tasks supported by the data lab package. It is available as a script under pyproject.toml and it can be accessed via:
uv sync
source .venv/bin/activate
dlctl ...Note
A few torch dependencies, like torch_sparse require UV_FIND_LINKS to be set when adding or removing any dependencies, but not during install, where uv.lock already has all the required information. We currently don't rely on this, but, if we do in the future, here's how to approach it:
export UV_FIND_LINKS="https://data.pyg.org/whl/torch-2.7.0+cu126.html"
uv add --no-build-isolation pyg_lib torch_scatter torch_sparse \
torch_cluster torch_spline_convImplements a 4-layer infrastructure architecture to help you deploy a data stack on-premise using Proxmox, Terraform, and Docker.
- Layer 1 (
foundation/) is a Terraform project that will provision MinIO on an LXC running on Proxmox. - Layer 2 (
platform/) is a Terraform project, with state storage on MinIO, that will provision three Docker VMs and a GitLab VM. GitLab will provide a container registry and come preconfigured with a GitLab Runner that executes on top of one of the Docker VMs. - Layer 3 (
services/) contains a Terraform project (gitlab/) to optionally initialize CI/CD variables/secrets from the local.env, and a Docker Compose project (docker/) to provision the data stack services. - Layer 4 (
applications/) contains local application deployments via Dockerized services (e.g.,ml.server) and CI/CD integration to provision the required resources (e.g., postgres database and credentials).
Helps manage ingestion from difference data sources, creating the proper directory structure (see Storage Layout) consisting of the retrieval for raw data and the creation proper directory structure creation. Raw data might be dropped manually, from Kaggle, Hugging Face, or some other source. This will make it easy to load it and keep it organized.
This is the core of the data lakehouse, using dbt to transform raw data into usable data, with DuckLake as the underlying catalog, running on top of SQLite.
We purposely keep this simple with SQLite, using a backup/restore strategy to/from S3, as this assumes exploratory lab work, but you can easily replace SQLite with a PostgreSQL node, if you prefer.
Gold tier datasets under your data marts are only usable externally after you export them. This component manages exports, creating them for a specific data mart catalog and schema, listing them, or purging old versions.
Graph loading and computation on top of KùzuDB. We support operations like graph loading from S3 parquet files, and node embedding via FRP (Fast Random Projection), which is implemented using node batching with input/output from/to KùzuDB and training on top of PyTorch.
Complete ML Engineering lifecycle implementation, including feature extraction, and model training and testing, with MLflow experiment tracking and evaluation. It also provides a REST API endpoint for inference, prediction logging, and user feedback tracking, implemented using an even-driven architecture based on Kafka topics. This optionally serves models using an A/B/n testing approach. Finally, we also implement several methods for simulating inference requests and user feedback, based on a monitoring dataset, that we use to compute monitoring metrics over time, like prediction drift, feature drift, estimated performance, or user evaluation.
Includes the following modules:
settings– loads and provides access to environment variables and other relevant constants;storage– handles mid-level S3 storage operations, like creating a dated directory structure, uploading and downloading files and directories, or managing the manifest files;cache– provides utilities to manage filesystem-based caching based on a user data directory (usually~/.cache/datalab);lakehouse– connects the DuckDB engine and helps with tasks like exporting datasets, or loading the latest snapshot for an export;templates– contains helper functions andstring.Templateinstances to produce files likeinit.sql;color– palette and color processing utilities, mostly used to support plotting;logging– interceptor logger to replace inconsistent logging utilities (e.g., fromuvicorn).tools– provides a function per CLI tool (callable viadlctl tools), for example to generate theinit.sqlfile described in thetemplatesmodule;utils– provides a@timedannotator to print run time, and function name sanitization code.
Jupyter notebook for prototyping or standalone analyzes. Notebooks are dropped directly on the root path, since all data is loaded and saved to the DuckLake instance.
Individual Bash or Python scripts for generic tasks (e.g., launching KùzuDB Explorer).
Untracked directory where all your local files will live. This includes the engine database (DuckDB) and the DuckLake catalogs (e.g., stage.sqlite, marts/graphs.sqlite), which you can restore from a backup, or create from scratch. KùzuDB databases will also live here, under graphs/, as well as the init.sql script for CLI access to the lakehouse.
All data is stored in a single S3 bucket (e.g., s3://lakehouse, tested with MinIO), with directory structure:
s3://lakehouse/
├── backups/
│ └── catalog/
│ ├── YYYY_MM_DD/
│ │ └── HH_mm_SS_sss/
│ │ └── lakehouse.dump
│ └── manifest.json
├── raw/
│ └── <dataset-name>/
│ ├── YYYY_MM_DD/
│ │ └── HH_mm_SS_sss/
│ │ ├── *.csv
│ │ ├── *.json
│ │ └── *.parquet
│ └── manifest.json
├── stage/
│ └── ducklake-*.parquet
├── marts/
│ └── <domain>/
│ └── ducklake-*.parquet
└── exports/
└── <domain>/
└── <dataset-name>/
├── YYYY_MM_DD/
│ └── HH_mm_SS_sss/
│ ├── *.csv
│ ├── *.json
│ └── *.parquet
└── manifest.json
Note
Date/time entries should be always UTC.
Configuration for data lab is all done through the environment variables defined in .env.
This will also support the generation of an init.sql file, which contains the DuckLake configurations, including the MinIO/S3 secret and all attached catalogs.
S3_ENDPOINT=localhost:9000
S3_USE_SSL=false
S3_URL_STYLE=path
S3_ACCESS_KEY_ID=minio_username
S3_SECRET_ACCESS_KEY=minio_password
S3_REGION=eu-west-1S3_ENDPOINT and S3_URL_STYLE are only required if you're using a non-AWS object store like MinIO.
S3_REGION must match MinIO's region (explicitly setting one in MinIO is recommended).
PSQL_ROOT_PASSWORD=datalabtechSet this to the root user password of your PostgreSQL database—only used when deploying your on-premise infrastructure, so that databases and credentials can be provisioned at a later stage. Otherwise not accessed.
S3_BUCKET=lakehouse
S3_INGEST_PREFIX=raw
S3_STAGE_PREFIX=stage
S3_SECURE_STAGE_PREFIX=secure-stage
S3_GRAPHS_MART_PREFIX=marts/graphs
S3_ANALYTICS_MART_PREFIX=marts/analytics
S3_EXPORTS_PREFIX=exports
S3_BACKUPS_PREFIX=backupsYou can use the defaults here. Everything will live under the S3_BUCKET. Each stage has its own prefix under that bucket, but the mart prefixes are special—any environment variable that ends with *_MART_PREFIX will be associated with its down *_MART_DB, as show in the next section.
ENGINE_DB=engine.duckdb
STAGE_DB=stage.sqlite
SECURE_STAGE_DB=secure_stage.sqlite
GRAPHS_MART_DB=marts/graphs.sqlite
ANALYTICS_MART_DB=marts/analytics.sqliteThese files will live under local/. The DuckDB ENGINE_DB will be leveraged for querying. All data is tracked on the STAGE_DB and *_MART_DB catalog databases and stored on the corresponding object storage locations, as shown in the previous section. You can also used SECURE_STAGE_DB if you need to encrypt your data (e.g., for sensitive user data).
MUSIC_TASTE_GRAPH_DB=graphs/music_taste.kuzu
ECON_COMP_GRAPH_DB=graphs/econ_comp.kuzuThe data lab also leverages Kuzu for graph data science tasks. The path for each graph database can be set here as *_GRAPH_DB.
OLLAMA_MODELS=gemma3:latest,phi4:latestHere you can preconfigure the Ollama models you want to download when running your local or on-premise infrastructure, as comma-separated model:version entries.
MLFLOW_TRACKING_URI=http://docker-shared:5000
MLFLOW_TRACKING_USERNAME=datalabtech
S3_MLFLOW_BUCKET=mlflow
S3_MLFLOW_ARTIFACTS_PREFIX=artifactsThe MLFLOW_TRACKING_* variables configure how you interact with the MLflow server, while the S3_MLFLOW_* variables configure the S3 bucket where artifacts (e.g., serialized models) will be dropped into.
KAFKA_BROKER_ENDPOINT=docker-shared:9092
KAFKA_GROUP_TOPIC_LIST=ml_inference_results:lakehouse-inference-result-consumer,ml_inference_feedback:lakehouse-inference-feedback-consumerYou can configure your Kafka endpoint here, as well as any required topics. We initialize each topic via comma-separated list of topic:group, so that consumers can be initialized and no warning is printed when first connecting to a topic from that consumer—this is likely overkill, but it feels cleaner.
You can generate an init.sql once you setup your .env, so you can access your DuckLake from the CLI using duckdb:
dlctl tools generate-init-sql
duckdb -init local/init.sql local/engine.duckdbAs a rule of thumb, ingestion will be done via the dlctl ingest command. If a version for the current date already exists, it will output an error and do nothing—just wait a millisecond.
For manually uploaded datasets, you can create a directory in S3 by giving it the dataset name:
dlctl ingest dataset --manual "Your Dataset Name"This will create a directory like s3://lakehouse/raw/your_dataset_name/2025_06_03/19_56_03_000, update s3://lakehouse/raw/your_dataset_name/manifest.json to point to it, and print the path to stdout.
dlctl ingest dataset \
"https://www.kaggle.com/datasets/<username>/<dataset>"
dlctl ingest dataset \
"https://huggingface.co/datasets/<username>/<dataset>"The dataset name will be automatically extracted from the <dataset> slug and transformed into snake case for storage. Then, a directory like s3://lakehouse/raw/your_dataset_name/2025_06_03/19_56_03_000 will be created, s3://lakehouse/raw/your_dataset_name/manifest.json updated to point to it, and the final path printed to stdout.
You can also list existing dataset paths for the most recent version, to be used for transformation:
dlctl ingest lsOr all of them:
dlctl ingest ls -aSometimes you'll manually create a dataset and never upload data into the directory, or an ingestion process from a URL will fail and leave an empty directory behind. You can prune those directories using:
dlctl ingest pruneTransformations can be done via dlctl transform, which will call dbt with the appropriate arguments:
dlctl transform "<dataset-name>"You can also run data tests using:
dlctl test
dlctl test -m test_type:singularOr generate or serve dbt documentation using:
dlctl docs generate
dlctl docs serveIn order to externally use a dataset from the Lakehouse, you first need to export it. This can be done for any data mart catalog, over a selected schema. Exported datasets will be kept in dated directories with their own manifest.json.
dlctl export dataset "<data-mart-catalog>" "<schema>"You can list the most recent versions of exported datasets:
dlctl export lsOr all of them:
dlctl export ls -aAfter a few exports, you might want to remove old versions to claim space. You can prune those directories using:
dlctl export pruneSince we rely on embedded databases and S3 object storage, we need to backup our databases.
Important
Data Lab was designed to be used in an education or research environment, so it currently doesn't support concurrent users. This could easily be added, though, as DuckLake supports PostgreSQL catalogs in place of SQLite, which we are using here.
You can create a backup by running:
dlctl backup createIn order to restore a backup, just run:
dlctl backup restore --source "<YYYY-mm-ddTHH:MM:SS.sss>"Omitting --source will restore the latest backup.
Caution
Omitting --target will restore to local/ by default, so take care not to overwrite your working version by mistake!
You can list all backups using:
dlctl backup lsAnd you can list all files in all backups using:
dlctl backup ls -aThis will load nodes and edges into a KùzuDB database stored under local/graphs/<schema>, where schema is a schema containing nodes and edges under the graphs data mart catalog. Table names for nodes or edges are usually prefixed with <dataset>_nodes_ or <dataset>_edges_, respectively, and should follow the format described on KùzuDB's docs.
dlctl graph load "<schema>"A collection of graph computation calls will live here. These can be wrappers to native KùzuDB computations, or external computations. Currently, we just include the embeddings computation, which runs in Python using PyTorch. This will compute FRP embeddings with dimension 256, over batches of 9216 nodes, trained using 5 epochs, for the <schema> graph:
dlctl graph compute embeddings "<schema>" -d 256 -b 9216 -e 5Train and evaluate a model using the dataset table under the provided <schema>, while tracking the experiment using MLflow:
dlctl ml train <schema> --method logreg --features embeddingsThe --method can be one of the supported algorithms (e.g., logreg or xgboost), and --features follows a similar approach for supported features (e.g., tfidf or embeddings). Currently only text-based datasets are supported, but the schema and training code has been generalized to support tabular data as well.
A REST API endpoint can be run to provide an inference service with optional A/B/n testing and event-based logging, or to receive user feedback on the predictions:
dlctl ml server
dlctl ml server -h 0.0.0.0 -p 8000In order to help us implement and test monitoring statistics, we implemented a request simulation framework, where feedback is provided based on a monitoring dataset, which is completely separate from the dataset using for training, validation and testing. For example, to use a 1% sample of the monitor table from <schema> for A/B testing with the dd_xgboost_embeddings and dd_logreg_tfidf latest models, we can use:
dlctl ml simulate <schema> \
--sample-fraction 0.01 \
--model-uri "models:/dd_xgboost_embeddings/latest" \
--model-uri "models:/dd_logreg_tfidf/latest"The models:/ URIs correspond to models trained and logged within MLflow. For production, we usually replace latest with a particular tag that we assign to our production models (e.g., a version).
There are several other options to help you control the simulation as well, which you can check under:
dlctl ml simulate --helpThese include the number of passes, the batch size, the decision threshold, and several ranges to help control the fraction of feedback to provide, the fraction of wrong feedback, or the date range to simulate.
This will let you compute and plot monitor statistics over time for a specific <schema>, optionally specifying a date range and a window size:
dlctl ml monitor compute <schema>
dlctl ml monitor compute <schema> \
--since <start> \
--until <end> \
--window-size 7For plotting, you must also specify one or several model URIs:
dlctl ml monitor plot <schema> \
--model-uri "models:/dd_xgboost_embeddings/latest" \
--model-uri "models:/dd_logreg_tfidf/latest"This will produce several PNG plots under local/monitor/.
We provide several just commands, both for convenience and to keep track of data pipelines (e.g., ETL) for specific datasets. Below we provide an overview on these commands, excluding most secondary commands.
We provide a check binary command that will look for a specific binary in the path and check whether it's executable—the command will fail otherwise, causing any depending commands to fail as well. We implement specific check commands per binary, since we cannot use parameters in dependencies. For example:
just check duckdb
just check-terraformWe also provide a confirm command, to add as a dependency of critical commands (e.g., terraform destroy). This will display a confirmation message and require user input to continue:
just confirmAre you sure? [y/N] n
error: Recipe `confirm` failed with exit code 1
Related video: https://youtu.be/zn69Q7FiFfo?si=tiG4DT_apbR_-sVC
In order to run a REPL for the datalab's DuckLake instance, you can simply run:
just lakehouseThis will take care of the init.sql generation for you, but you might want to regenerate it later as well:
just generate-init-sqlRelated video: https://youtu.be/m61u3mqu1qY?si=kmjmPHTY5-8M8Q81
| Command | Description |
|---|---|
graphrag-etl |
Ingest DSN and MSDSL datasets, run DuckLake transformations, export to Parquet, and load graph into Kuzu. |
graphrag-embeddings |
Compute node embeddings of dimension 256 using 5 epochs and batches of size 9216, and create vector index. |
graphrag |
Launch REPL for graph RAG. |
graphrag-all |
Run all of the above, in order. |
Related video: https://youtu.be/pIwN7oe54i4?si=-nB0upswBGacklh4
| Command | Description |
|---|---|
econ-compnet-ingest |
Ingest The Atlas of Economic Complexity. |
econ-compnet-transform |
Run DuckLake transformations on the dataset, to produce a knowledge graph. |
econ-compnet-export |
Export the graph data to Parquet. |
econ-compnet-load |
Load the graph into Kuzu. |
econ-compnet-etl |
Run all of the above, in order. |
econ-compnet-scoring |
Computes the Common Out-Neighbor (CON) score for the Country-CompetesWith-Country graph projection. |
econ-compnet-all |
Run ETL and scoring commands. |
Related video: https://youtu.be/MGuj13NcdjE?si=i56T6updcLE-NFC3
| Command | Description |
|---|---|
mlops-ingest |
Ingest the depression dataset for training and monitoring. |
mlops-transform |
Run DuckLake transformations on the datasets, normalizing into a common format for the ML pipelines, including a train/test split and fixed folds on the training set for validation. |
mlops-etl |
Run all of the above, in order. |
mlops-train-logreg-tfidf |
Train a model using logistic regression and TF-IDF features. |
mlops-train-logreg-embeddings |
Train a model using logistic regression and text embedding features. |
mlops-train-logreg |
Train all logistic regression models. |
mlops-train-xgboost-tfidf |
Train a model using XGBoost and TF-IDF features. |
mlops-train-xgboost-embeddings |
Train a model using XGBoost and text embedding features. |
mlops-train-xgboost |
Train all XGBoost models. |
mlops-train |
Train all models. |
mlops-all |
Run ETL and training. |
| Command | Description |
|---|---|
mlops-serve |
Run ML server listening on 0.0.0.0 and port 8000. |
mlops-test-inference |
Use curl to test the inference endpoint. |
mlops-test-feedback |
Use curl to test the feedback endpoint. |
| Command | Description |
|---|---|
mlops-simulate-inference |
Run inference simulation for XGBoost with text embedding features, and logistic regression with TF-IDF features, using the monitor set to produce feedback. |
mlops-monitor-compute |
Compute monitoring statistics for the two models. |
mlops-monitor-plot |
Plot monitoring statistics for the two models. |
Related videos: https://www.youtube.com/playlist?list=PLeKtvIdgbljMyhjPgJeoXwa_7J9DTx3Fo
| Command | Description |
|---|---|
infra-config-check-foundation |
Look for terraform.tfvars under infra/foundation. |
infra-config-check-platform |
Look for terraform.tfvars and state.config under infra/platform. |
infra-config-check-services |
Look for the docker-shared context, that should point to the corresponding Docker VM. |
infra-config-check-all |
Run all of the above, in order. |
| Command | Description |
|---|---|
infra-foundation-init |
Run terraform init for infra/foundation. |
infra-platform-init |
Run terraform init for infra/platform. |
infra-init |
Run all of the above, in order. |
| Command | Description |
|---|---|
infra-provision-foundation |
Run terraform apply for infra/foundation. |
infra-provision-platform |
Run terraform apply for infra/platform. |
infra-provision-services |
Run terraform apply for infra/services/gitlab (required a configured .env), and docker compose up under the appropriate docker-shared context, using infra/services/docker/compose.yml. |
infra-provision-all |
Run all of the above, in order. |
infra-provision-local |
Run docker compose up with the dev profile enabled, using infra/services/docker/compose.yml. |
| Command | Description |
|---|---|
infra-destroy-foundation |
Run terraform destroy for infra/foundation. |
infra-destroy-platform |
Run terraform destroy for infra/platform. |
infra-destroy-services |
Run docker compose down and terraform destroy for infra/services. |
infra-destroy-all |
Run all of the above, in reversed order. |
infra-destroy-local |
Run docker compose down with the dev profile enabled for infra/services. |
| Command | Description |
|---|---|
infra-show-tf-credentials <layer> |
Print the credentials for a specific layer (foundation or platform). |
infra-show-credentials |
Print all credentials. |

