Add mount dataset recipe by ErykKul · Pull Request #35 · gdcc/dataverse-recipes

ErykKul · 2026-05-25T20:41:15Z

What this PR adds

A new shell recipe under shell/mount_dataset/ that mounts any Dataverse dataset as a real read-only filesystem, with the dataset's folder structure and human-readable filenames preserved. Files are fetched on demand, so there's no upfront download and no disk space needed for the whole dataset. Optionally, the same recipe can publish the mount as a personal Globus endpoint, so you can transfer the dataset (preserved names and all) to any other Globus endpoint — e.g. your HPC cluster's scratch storage.

The recipe is meant for end-user / researcher use: clone the recipe, run ./mount.sh, browse the dataset as a local folder. No operator involvement, no Dataverse-side configuration, no paid Globus subscription, no S3 connector / Globus Connect Server / S3 keys.

Works with any Dataverse storage backend. The recipe only uses the standard Native API (/api/datasets/.../versions/... and /api/access/datafile/{id}), so it works the same on instances backed by local filesystem, S3, Swift, or anything else Dataverse supports. If the instance is on S3 with direct-download enabled, the backend automatically picks up the presigned-URL redirect and streams bytes straight from S3 (cutting Dataverse out of the data path); otherwise it streams through Dataverse's access endpoint with HTTP Range requests. Either way, only the bytes you actually read are fetched. Both code paths are unit-tested in the rclone backend and verified end-to-end against real datasets (see Testing below).

Replaces the earlier s3fs-based draft of this PR

The first iteration of this branch used s3fs to mount the Dataverse instance's S3 bucket and built a friendly symlink tree on top. That works but trades off in ways that matter for a recipe aimed at end users:

	s3fs approach (earlier draft)	rclone+Dataverse approach (this draft)
Credentials	Operator's raw S3 access/secret keys	Per-user Dataverse API token (optional — guest works)
Access control	Whoever has the S3 keys can read every bucket object	Dataverse's normal per-user permissions apply
Filenames	Raw S3 object keys, fixed up via symlink-builder	Dataverse's `directoryLabel + label` (the author's intent), exposed directly
Granularity	One mount per Dataverse install	One mount per dataset, scoped to the chosen version
Storage assumed	S3 only	Any Dataverse storage driver (file, S3, Swift, …)
Tabular ingest	N/A (operator decides what's in the bucket)	Configurable: surface the original CSV/SPSS/Stata or the archival `.tab`
Globus on top	Out of scope	Optional `./mount-globus.sh` brings up a personal endpoint with the dataset at `/mnt/dataset`
Operator/admin involvement	Needed (S3 keys, S3 endpoint, bucket policy)	None

The s3fs path is still the right answer for the operator mounting the whole Dataverse install for backup/inspection. This recipe targets the user who wants one dataset, with the file names the dataset author chose, and possibly a fast cross-institution transfer.

How it works

The Docker image is a single-stage debian:bookworm-slim runtime that downloads a pre-built rclone binary (with the Dataverse backend included) from the fork's release page at https://github.com/ErykKul/rclone/releases/tag/dataverse-backend-latest, installs fuse3, tini, ca-certificates, and — when built with INCLUDE_GLOBUS=1 (which mount-globus.sh does) — Globus Connect Personal plus the Python runtime it needs. The download approach keeps first-time builds in the ~30-second range instead of the multi-minute Go compile the earlier draft did. TARGETARCH picks linux-amd64 vs linux-arm64 automatically so the same Dockerfile works on x86 Linux and Apple Silicon Macs via Docker Desktop. To test backend changes from a different fork, point RCLONE_BINARY_URL at a different release asset, or -v mount a locally-built binary over /usr/local/bin/rclone at run time.

The Dataverse backend itself is read-only, talks to /api/datasets/.../versions/... for listings and /api/access/datafile/{id} for bytes. On the first read of each file it issues a Range: bytes=0-0 probe to detect mode: a 30x with Location: means S3-direct (it follows the presigned URL and caches it until expiry); a 200/206 means proxy mode (it streams through the Dataverse endpoint with the API token on every read). Access URLs are cached per (file, format), concurrent first-touches deduplicated via singleflight, and mid-stream failures trigger a transparent in-place resume with Range: bytes=<bytes-read>-<end>. Tabular-ingest files surface under the user's original upload name with a verifiable MD5 by default (ingest_format=original), or under Dataverse's archival .tab form on demand.

The rclone fork is at ErykKul/rclone, branch dataverse-backend. The backend is being proposed upstream to rclone/rclone (PR #9467); once it merges, the Dockerfile's RCLONE_BINARY_URL build arg will point at an upstream release asset and the fork becomes unnecessary.

Files

shell/mount_dataset/
├── README.md              # recipe doc — lay-terms explainer, quickstart, platform notes
├── Dockerfile             # debian-slim runtime + pre-built rclone binary (optional GCP)
├── entrypoint.sh          # mount / mount-globus / globus-setup / status / shell modes
├── lib.sh                 # cross-platform helpers (detect_platform, abspath,
│                          #   is_mountpoint, fuse_unmount, require_docker)
├── mount.sh               # ./mount.sh — interactive prompt, foreground rclone mount
├── mount-globus.sh        # ./mount-globus.sh — same + Globus endpoint (handles first-run setup)
├── unmount.sh             # ./unmount.sh — stop container + clear stale FUSE mount
├── reset.sh               # ./reset.sh — wipe .env + ./data + ./globus-state for a fresh start
├── sample.env             # editable .env template
└── .gitignore             # /data, /globus-state (per-user runtime state)

Quickstart

Sparse-checkout pulls just this recipe directory, not the whole repo:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/gdcc/dataverse-recipes.git
cd dataverse-recipes
git sparse-checkout add shell/mount_dataset
cd shell/mount_dataset
./mount.sh

mount.sh prompts for the Dataverse URL, dataset DOI, and optional API token (blank for guest access on public datasets); writes .env; builds the image on first run; mounts the dataset on ./data in the foreground; Ctrl-C unmounts.

For Globus mode: ./mount-globus.sh — first run walks you through a one-time browser device-code login.

Platforms

Platform	Mount mode	Globus mode
Linux	✅ full host visibility via bind-mount	✅ full
WSL2 (Windows)	✅ same as Linux (clone inside WSL for speed)	✅ full
macOS	⚠️ visible inside container only (see README)	✅ full

macOS limitation: Docker Desktop's VM boundary doesn't propagate FUSE mounts back to the host filesystem. The mount works inside the container (docker exec ... ls /mnt/dataset), and the Globus mode works fully (Globus is TCP, not bind-mount-dependent) — so the "transfer to my HPC" workflow is unaffected. Three workarounds for plain file browsing: (a) docker exec into the container, (b) use the Globus mode with a Globus Connect Personal running natively on the Mac as the receiving end, or (c) install rclone natively (with macFUSE or rclone's nfsmount) and skip Docker entirely — the README has the full walkthrough.

Testing

End-to-end verified on Linux against:

Public Dataverse, guest access (demo.dataverse.org, no API token): list + range read + full copy + rclone check (MD5s match) against a public dataset. Exercises the S3-direct path.
Local Dataverse + MinIO dev stack, S3-direct mode (download-redirect=true): mount, MD5s match Dataverse-reported hashes, range reads correct at 7 distinct offsets including the last byte (off=1048575, cnt=1), head -c 128 / tail -c 128 byte-identical, full-file cmp byte-identical, clean unmount. Confirms the presigned-URL follow + range-forward path.
Local Dataverse + file storage driver (localfs1), proxy mode: collection configured with the file-based storage driver, dataset files served via /api/access/datafile/{id} returning 206 Partial Content (no Location: redirect). Mount worked, files and subdirectories appeared correctly, all MD5s matched. Confirms the proxy-mode path with X-Dataverse-Key auth on every read.
Globus mode: real endpoint registration via ./mount-globus.sh's first-run device-code flow; endpoint came online in the Globus web app, dataset files visible under /mnt/dataset/.

The rclone Dataverse backend itself has unit-test coverage for both modes including TestOpenResumesOnMidStreamFailure (URL expires mid-transfer → backend re-fetches + re-issues with Range: bytes=N-), TestOpenPresignedRetryOn403, TestFetchAccessURLRedirect, TestFetchAccessURLProxyMode, and TestParsePresignExpiryFallback.

The macOS path has not been tested on macOS hardware; the cross-platform lib.sh helpers (abspath, is_mountpoint, fuse_unmount) are shell-only and target POSIX + BSD tools, but real-machine confirmation would be welcome.

Known limitations / follow-ups

rclone fork dependency. The Dockerfile pulls the rclone Dataverse backend from ErykKul/rclone@dataverse-backend. An upstream PR is being prepared for rclone/rclone; once merged the defaults swap to upstream rclone.
Dataset version is frozen at mount time. New versions published by the dataset author won't appear until a restart — by design, so in-progress transfers see a stable file list.
Restricted-file listings. If your token can list but not read a restricted file, the listing includes it and the read fails 401/403. Could be filtered out client-side later.
GHCR prebuilt images. The recipe currently builds the Docker image locally on first run. The parallel ErykKul/dataverse-mount repo already publishes the same image to GHCR if a pre-built one is preferred (IMAGE_TAG=ghcr.io/erykkul/dataverse-mount:latest ./mount.sh); a future iteration may move that publishing into the recipe itself.

Why this might be interesting beyond just a recipe

The mainline "Dataverse + managed Globus" story requires the institution to stand up a Globus Connect Server, register endpoints, manage identity mapping, and usually a paid Globus subscription — realistic only with dedicated data-engineering staff. This recipe gets researchers a per-dataset Globus endpoint with zero institutional involvement, on the free Globus tier. We expect the biggest use case to be "researcher needs to move a 500 GB dataset to HPC scratch to run analyses" — currently a multi-day manual chore for many users.

… via Docker + s3fs

…Linux/WSL2)

ErykKul · 2026-05-27T14:29:47Z

I got a really good idea how to extend it to a possibility of having a personal Globus endpoint on top of that. And got it working already, so I will update this PR soon (at latest tomorrow, it turned out to be very easy to do).

…onnect Personal endpoint (replaces s3fs-based approach)

… ./data

ErykKul · 2026-05-28T13:27:29Z

Ready for review.

…just S3

…rop legacy GCP setup-key block

…ernative

…rom source (faster Docker builds, simpler native install)

…om source)

…ount); fallback to local build

…ut a token

…E_TIME (1000h), tini -g

ErykKul added 3 commits May 25, 2026 21:55

add shell/mount_dataset recipe: mount a dataset as a local filesystem…

e4e7970

… via Docker + s3fs

mount_dataset: add native s3fs path for macOS (Docker path stays for …

ab08085

…Linux/WSL2)

special char in readme replaced with dash

01daf40

pdurbin added this to IQSS Dataverse Project May 26, 2026

pdurbin moved this to Ready for Triage in IQSS Dataverse Project May 26, 2026

ErykKul added 3 commits May 27, 2026 17:28

Rework mount_dataset: rclone+FUSE Docker image with optional Globus C…

b50face

…onnect Personal endpoint (replaces s3fs-based approach)

README: clarify mount_dataset entry (network drive / Globus endpoint)

49a2d73

mount_dataset: rename reset-globus.sh to reset.sh, also wipe .env and…

1b8ef68

… ./data

ErykKul added 3 commits May 28, 2026 15:41

mount_dataset: clarify backend works with any Dataverse storage, not …

d255f45

…just S3

mount_dataset: mark DV_TOKEN optional in entrypoint and sample.env, d…

3a40496

…rop legacy GCP setup-key block

mount_dataset: document native-rclone install path as Docker-free alt…

9959780

…ernative

This was referenced May 28, 2026

doi: add Dataverse direct mode with token, versions, ingest and tree rclone/rclone#9467

Open

Would the standalone client have potential to become a cloud storage client? libis/rdm-integration#4

Closed

ErykKul added 6 commits May 28, 2026 18:44

mount_dataset: download pre-built rclone binary instead of building f…

b517202

…rom source (faster Docker builds, simpler native install)

mount_dataset: README — note Intel-Mac binary not pre-built (build fr…

8fdca62

…om source)

mount_dataset: default IMAGE_TAG to GHCR (ghcr.io/erykkul/dataverse-m…

9df265f

…ount); fallback to local build

mount_dataset: switch to the rclone doi backend

1441183

mount_dataset: guard DV_TOKEN under set -u so guest mounts work witho…

cf5d399

…ut a token

mount_dataset: supervise rclone+GCP in mount-globus, env-ize DIR_CACH…

1bafb80

…E_TIME (1000h), tini -g

pdurbin moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mount dataset recipe#35

Add mount dataset recipe#35
ErykKul wants to merge 15 commits into
gdcc:mainfrom
ErykKul:add-mount-dataset-recipe

ErykKul commented May 25, 2026 •

edited

Loading

Uh oh!

ErykKul commented May 27, 2026

Uh oh!

ErykKul commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ErykKul commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Replaces the earlier s3fs-based draft of this PR

How it works

Files

Quickstart

Platforms

Testing

Known limitations / follow-ups

Why this might be interesting beyond just a recipe

Uh oh!

ErykKul commented May 27, 2026

Uh oh!

ErykKul commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErykKul commented May 25, 2026 •

edited

Loading