Skip to content

Add mount dataset recipe#35

Open
ErykKul wants to merge 15 commits into
gdcc:mainfrom
ErykKul:add-mount-dataset-recipe
Open

Add mount dataset recipe#35
ErykKul wants to merge 15 commits into
gdcc:mainfrom
ErykKul:add-mount-dataset-recipe

Conversation

@ErykKul

@ErykKul ErykKul commented May 25, 2026

Copy link
Copy Markdown

What this PR adds

A new shell recipe under shell/mount_dataset/ that mounts any Dataverse dataset as a real read-only filesystem, with the dataset's folder structure and human-readable filenames preserved. Files are fetched on demand, so there's no upfront download and no disk space needed for the whole dataset. Optionally, the same recipe can publish the mount as a personal Globus endpoint, so you can transfer the dataset (preserved names and all) to any other Globus endpoint — e.g. your HPC cluster's scratch storage.

The recipe is meant for end-user / researcher use: clone the recipe, run ./mount.sh, browse the dataset as a local folder. No operator involvement, no Dataverse-side configuration, no paid Globus subscription, no S3 connector / Globus Connect Server / S3 keys.

Works with any Dataverse storage backend. The recipe only uses the standard Native API (/api/datasets/.../versions/... and /api/access/datafile/{id}), so it works the same on instances backed by local filesystem, S3, Swift, or anything else Dataverse supports. If the instance is on S3 with direct-download enabled, the backend automatically picks up the presigned-URL redirect and streams bytes straight from S3 (cutting Dataverse out of the data path); otherwise it streams through Dataverse's access endpoint with HTTP Range requests. Either way, only the bytes you actually read are fetched. Both code paths are unit-tested in the rclone backend and verified end-to-end against real datasets (see Testing below).

Replaces the earlier s3fs-based draft of this PR

The first iteration of this branch used s3fs to mount the Dataverse instance's S3 bucket and built a friendly symlink tree on top. That works but trades off in ways that matter for a recipe aimed at end users:

s3fs approach (earlier draft) rclone+Dataverse approach (this draft)
Credentials Operator's raw S3 access/secret keys Per-user Dataverse API token (optional — guest works)
Access control Whoever has the S3 keys can read every bucket object Dataverse's normal per-user permissions apply
Filenames Raw S3 object keys, fixed up via symlink-builder Dataverse's directoryLabel + label (the author's intent), exposed directly
Granularity One mount per Dataverse install One mount per dataset, scoped to the chosen version
Storage assumed S3 only Any Dataverse storage driver (file, S3, Swift, …)
Tabular ingest N/A (operator decides what's in the bucket) Configurable: surface the original CSV/SPSS/Stata or the archival .tab
Globus on top Out of scope Optional ./mount-globus.sh brings up a personal endpoint with the dataset at /mnt/dataset
Operator/admin involvement Needed (S3 keys, S3 endpoint, bucket policy) None

The s3fs path is still the right answer for the operator mounting the whole Dataverse install for backup/inspection. This recipe targets the user who wants one dataset, with the file names the dataset author chose, and possibly a fast cross-institution transfer.

How it works

The Docker image is a single-stage debian:bookworm-slim runtime that downloads a pre-built rclone binary (with the Dataverse backend included) from the fork's release page at https://github.com/ErykKul/rclone/releases/tag/dataverse-backend-latest, installs fuse3, tini, ca-certificates, and — when built with INCLUDE_GLOBUS=1 (which mount-globus.sh does) — Globus Connect Personal plus the Python runtime it needs. The download approach keeps first-time builds in the ~30-second range instead of the multi-minute Go compile the earlier draft did. TARGETARCH picks linux-amd64 vs linux-arm64 automatically so the same Dockerfile works on x86 Linux and Apple Silicon Macs via Docker Desktop. To test backend changes from a different fork, point RCLONE_BINARY_URL at a different release asset, or -v mount a locally-built binary over /usr/local/bin/rclone at run time.

The Dataverse backend itself is read-only, talks to /api/datasets/.../versions/... for listings and /api/access/datafile/{id} for bytes. On the first read of each file it issues a Range: bytes=0-0 probe to detect mode: a 30x with Location: means S3-direct (it follows the presigned URL and caches it until expiry); a 200/206 means proxy mode (it streams through the Dataverse endpoint with the API token on every read). Access URLs are cached per (file, format), concurrent first-touches deduplicated via singleflight, and mid-stream failures trigger a transparent in-place resume with Range: bytes=<bytes-read>-<end>. Tabular-ingest files surface under the user's original upload name with a verifiable MD5 by default (ingest_format=original), or under Dataverse's archival .tab form on demand.

The rclone fork is at ErykKul/rclone, branch dataverse-backend. The backend is being proposed upstream to rclone/rclone (PR #9467); once it merges, the Dockerfile's RCLONE_BINARY_URL build arg will point at an upstream release asset and the fork becomes unnecessary.

Files

shell/mount_dataset/
├── README.md              # recipe doc — lay-terms explainer, quickstart, platform notes
├── Dockerfile             # debian-slim runtime + pre-built rclone binary (optional GCP)
├── entrypoint.sh          # mount / mount-globus / globus-setup / status / shell modes
├── lib.sh                 # cross-platform helpers (detect_platform, abspath,
│                          #   is_mountpoint, fuse_unmount, require_docker)
├── mount.sh               # ./mount.sh — interactive prompt, foreground rclone mount
├── mount-globus.sh        # ./mount-globus.sh — same + Globus endpoint (handles first-run setup)
├── unmount.sh             # ./unmount.sh — stop container + clear stale FUSE mount
├── reset.sh               # ./reset.sh — wipe .env + ./data + ./globus-state for a fresh start
├── sample.env             # editable .env template
└── .gitignore             # /data, /globus-state (per-user runtime state)

Quickstart

Sparse-checkout pulls just this recipe directory, not the whole repo:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/gdcc/dataverse-recipes.git
cd dataverse-recipes
git sparse-checkout add shell/mount_dataset
cd shell/mount_dataset
./mount.sh

mount.sh prompts for the Dataverse URL, dataset DOI, and optional API token (blank for guest access on public datasets); writes .env; builds the image on first run; mounts the dataset on ./data in the foreground; Ctrl-C unmounts.

For Globus mode: ./mount-globus.sh — first run walks you through a one-time browser device-code login.

Platforms

Platform Mount mode Globus mode
Linux ✅ full host visibility via bind-mount ✅ full
WSL2 (Windows) ✅ same as Linux (clone inside WSL for speed) ✅ full
macOS ⚠️ visible inside container only (see README) ✅ full

macOS limitation: Docker Desktop's VM boundary doesn't propagate FUSE mounts back to the host filesystem. The mount works inside the container (docker exec ... ls /mnt/dataset), and the Globus mode works fully (Globus is TCP, not bind-mount-dependent) — so the "transfer to my HPC" workflow is unaffected. Three workarounds for plain file browsing: (a) docker exec into the container, (b) use the Globus mode with a Globus Connect Personal running natively on the Mac as the receiving end, or (c) install rclone natively (with macFUSE or rclone's nfsmount) and skip Docker entirely — the README has the full walkthrough.

Testing

End-to-end verified on Linux against:

  • Public Dataverse, guest access (demo.dataverse.org, no API token): list + range read + full copy + rclone check (MD5s match) against a public dataset. Exercises the S3-direct path.
  • Local Dataverse + MinIO dev stack, S3-direct mode (download-redirect=true): mount, MD5s match Dataverse-reported hashes, range reads correct at 7 distinct offsets including the last byte (off=1048575, cnt=1), head -c 128 / tail -c 128 byte-identical, full-file cmp byte-identical, clean unmount. Confirms the presigned-URL follow + range-forward path.
  • Local Dataverse + file storage driver (localfs1), proxy mode: collection configured with the file-based storage driver, dataset files served via /api/access/datafile/{id} returning 206 Partial Content (no Location: redirect). Mount worked, files and subdirectories appeared correctly, all MD5s matched. Confirms the proxy-mode path with X-Dataverse-Key auth on every read.
  • Globus mode: real endpoint registration via ./mount-globus.sh's first-run device-code flow; endpoint came online in the Globus web app, dataset files visible under /mnt/dataset/.

The rclone Dataverse backend itself has unit-test coverage for both modes including TestOpenResumesOnMidStreamFailure (URL expires mid-transfer → backend re-fetches + re-issues with Range: bytes=N-), TestOpenPresignedRetryOn403, TestFetchAccessURLRedirect, TestFetchAccessURLProxyMode, and TestParsePresignExpiryFallback.

The macOS path has not been tested on macOS hardware; the cross-platform lib.sh helpers (abspath, is_mountpoint, fuse_unmount) are shell-only and target POSIX + BSD tools, but real-machine confirmation would be welcome.

Known limitations / follow-ups

  • rclone fork dependency. The Dockerfile pulls the rclone Dataverse backend from ErykKul/rclone@dataverse-backend. An upstream PR is being prepared for rclone/rclone; once merged the defaults swap to upstream rclone.
  • Dataset version is frozen at mount time. New versions published by the dataset author won't appear until a restart — by design, so in-progress transfers see a stable file list.
  • Restricted-file listings. If your token can list but not read a restricted file, the listing includes it and the read fails 401/403. Could be filtered out client-side later.
  • GHCR prebuilt images. The recipe currently builds the Docker image locally on first run. The parallel ErykKul/dataverse-mount repo already publishes the same image to GHCR if a pre-built one is preferred (IMAGE_TAG=ghcr.io/erykkul/dataverse-mount:latest ./mount.sh); a future iteration may move that publishing into the recipe itself.

Why this might be interesting beyond just a recipe

The mainline "Dataverse + managed Globus" story requires the institution to stand up a Globus Connect Server, register endpoints, manage identity mapping, and usually a paid Globus subscription — realistic only with dedicated data-engineering staff. This recipe gets researchers a per-dataset Globus endpoint with zero institutional involvement, on the free Globus tier. We expect the biggest use case to be "researcher needs to move a 500 GB dataset to HPC scratch to run analyses" — currently a multi-day manual chore for many users.

@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project May 26, 2026
@ErykKul

ErykKul commented May 27, 2026

Copy link
Copy Markdown
Author

I got a really good idea how to extend it to a possibility of having a personal Globus endpoint on top of that. And got it working already, so I will update this PR soon (at latest tomorrow, it turned out to be very easy to do).

@ErykKul

ErykKul commented May 28, 2026

Copy link
Copy Markdown
Author

Ready for review.

@pdurbin pdurbin moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants