Add mount dataset recipe#35
Open
ErykKul wants to merge 15 commits into
Open
Conversation
Author
|
I got a really good idea how to extend it to a possibility of having a personal Globus endpoint on top of that. And got it working already, so I will update this PR soon (at latest tomorrow, it turned out to be very easy to do). |
…onnect Personal endpoint (replaces s3fs-based approach)
Author
|
Ready for review. |
…rom source (faster Docker builds, simpler native install)
…ount); fallback to local build
…E_TIME (1000h), tini -g
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR adds
A new shell recipe under
shell/mount_dataset/that mounts any Dataverse dataset as a real read-only filesystem, with the dataset's folder structure and human-readable filenames preserved. Files are fetched on demand, so there's no upfront download and no disk space needed for the whole dataset. Optionally, the same recipe can publish the mount as a personal Globus endpoint, so you can transfer the dataset (preserved names and all) to any other Globus endpoint — e.g. your HPC cluster's scratch storage.The recipe is meant for end-user / researcher use: clone the recipe, run
./mount.sh, browse the dataset as a local folder. No operator involvement, no Dataverse-side configuration, no paid Globus subscription, no S3 connector / Globus Connect Server / S3 keys.Works with any Dataverse storage backend. The recipe only uses the standard Native API (
/api/datasets/.../versions/...and/api/access/datafile/{id}), so it works the same on instances backed by local filesystem, S3, Swift, or anything else Dataverse supports. If the instance is on S3 with direct-download enabled, the backend automatically picks up the presigned-URL redirect and streams bytes straight from S3 (cutting Dataverse out of the data path); otherwise it streams through Dataverse's access endpoint with HTTPRangerequests. Either way, only the bytes you actually read are fetched. Both code paths are unit-tested in the rclone backend and verified end-to-end against real datasets (see Testing below).Replaces the earlier s3fs-based draft of this PR
The first iteration of this branch used
s3fsto mount the Dataverse instance's S3 bucket and built a friendly symlink tree on top. That works but trades off in ways that matter for a recipe aimed at end users:directoryLabel + label(the author's intent), exposed directly.tab./mount-globus.shbrings up a personal endpoint with the dataset at/mnt/datasetThe s3fs path is still the right answer for the operator mounting the whole Dataverse install for backup/inspection. This recipe targets the user who wants one dataset, with the file names the dataset author chose, and possibly a fast cross-institution transfer.
How it works
The Docker image is a single-stage
debian:bookworm-slimruntime that downloads a pre-built rclone binary (with the Dataverse backend included) from the fork's release page at https://github.com/ErykKul/rclone/releases/tag/dataverse-backend-latest, installsfuse3,tini,ca-certificates, and — when built withINCLUDE_GLOBUS=1(whichmount-globus.shdoes) — Globus Connect Personal plus the Python runtime it needs. The download approach keeps first-time builds in the ~30-second range instead of the multi-minute Go compile the earlier draft did.TARGETARCHpickslinux-amd64vslinux-arm64automatically so the same Dockerfile works on x86 Linux and Apple Silicon Macs via Docker Desktop. To test backend changes from a different fork, pointRCLONE_BINARY_URLat a different release asset, or-vmount a locally-built binary over/usr/local/bin/rcloneat run time.The Dataverse backend itself is read-only, talks to
/api/datasets/.../versions/...for listings and/api/access/datafile/{id}for bytes. On the first read of each file it issues aRange: bytes=0-0probe to detect mode: a 30x withLocation:means S3-direct (it follows the presigned URL and caches it until expiry); a 200/206 means proxy mode (it streams through the Dataverse endpoint with the API token on every read). Access URLs are cached per(file, format), concurrent first-touches deduplicated viasingleflight, and mid-stream failures trigger a transparent in-place resume withRange: bytes=<bytes-read>-<end>. Tabular-ingest files surface under the user's original upload name with a verifiable MD5 by default (ingest_format=original), or under Dataverse's archival.tabform on demand.The rclone fork is at ErykKul/rclone, branch
dataverse-backend. The backend is being proposed upstream torclone/rclone(PR #9467); once it merges, the Dockerfile'sRCLONE_BINARY_URLbuild arg will point at an upstream release asset and the fork becomes unnecessary.Files
Quickstart
Sparse-checkout pulls just this recipe directory, not the whole repo:
mount.shprompts for the Dataverse URL, dataset DOI, and optional API token (blank for guest access on public datasets); writes.env; builds the image on first run; mounts the dataset on./datain the foreground;Ctrl-Cunmounts.For Globus mode:
./mount-globus.sh— first run walks you through a one-time browser device-code login.Platforms
macOS limitation: Docker Desktop's VM boundary doesn't propagate FUSE mounts back to the host filesystem. The mount works inside the container (
docker exec ... ls /mnt/dataset), and the Globus mode works fully (Globus is TCP, not bind-mount-dependent) — so the "transfer to my HPC" workflow is unaffected. Three workarounds for plain file browsing: (a)docker execinto the container, (b) use the Globus mode with a Globus Connect Personal running natively on the Mac as the receiving end, or (c) install rclone natively (with macFUSE or rclone'snfsmount) and skip Docker entirely — the README has the full walkthrough.Testing
End-to-end verified on Linux against:
demo.dataverse.org, no API token): list + range read + full copy +rclone check(MD5s match) against a public dataset. Exercises the S3-direct path.download-redirect=true): mount, MD5s match Dataverse-reported hashes, range reads correct at 7 distinct offsets including the last byte (off=1048575, cnt=1),head -c 128/tail -c 128byte-identical, full-filecmpbyte-identical, clean unmount. Confirms the presigned-URL follow + range-forward path.localfs1), proxy mode: collection configured with the file-based storage driver, dataset files served via/api/access/datafile/{id}returning206 Partial Content(noLocation:redirect). Mount worked, files and subdirectories appeared correctly, all MD5s matched. Confirms the proxy-mode path withX-Dataverse-Keyauth on every read../mount-globus.sh's first-run device-code flow; endpoint came online in the Globus web app, dataset files visible under/mnt/dataset/.The rclone Dataverse backend itself has unit-test coverage for both modes including
TestOpenResumesOnMidStreamFailure(URL expires mid-transfer → backend re-fetches + re-issues withRange: bytes=N-),TestOpenPresignedRetryOn403,TestFetchAccessURLRedirect,TestFetchAccessURLProxyMode, andTestParsePresignExpiryFallback.The macOS path has not been tested on macOS hardware; the cross-platform
lib.shhelpers (abspath,is_mountpoint,fuse_unmount) are shell-only and target POSIX + BSD tools, but real-machine confirmation would be welcome.Known limitations / follow-ups
rclone/rclone; once merged the defaults swap to upstream rclone.IMAGE_TAG=ghcr.io/erykkul/dataverse-mount:latest ./mount.sh); a future iteration may move that publishing into the recipe itself.Why this might be interesting beyond just a recipe
The mainline "Dataverse + managed Globus" story requires the institution to stand up a Globus Connect Server, register endpoints, manage identity mapping, and usually a paid Globus subscription — realistic only with dedicated data-engineering staff. This recipe gets researchers a per-dataset Globus endpoint with zero institutional involvement, on the free Globus tier. We expect the biggest use case to be "researcher needs to move a 500 GB dataset to HPC scratch to run analyses" — currently a multi-day manual chore for many users.