Further fixes to job submission by allison-truhlar · Pull Request #335 · JaneliaSciComp/fileglancer

allison-truhlar · 2026-03-20T14:48:35Z

This PR troubleshoots submitting jobs to the Janelia cluster when running the Fileglancer server as root.

Changes so far:

Work directory fix - Resolved ~ to the actual user's home (~username), not root's home.
seteuid job submission - Wrapped work dir creation, symlink, and executor.submit() in the EffectiveUserContext so filesystem operations happen as the user.
Stale symlink fix - Unlink existing repo symlink before creating a new one (caused an issue when a job failed - the symlink wasn't removed).
bsub -U username flag — Because bsub uses getuid() (real UID = root) not geteuid(), we pass -U username to tell LSF to run the job as the actual user. Note: unsure whether this is actually needed after adding the missing FGC_CLUSTER__EXECUTOR=lsf to the fileglancer-hub .env file — should test removing it once everything else is fixed
cluster.extra_paths config option — Prepends directories to os.environ["PATH"] at server startup so bsub/bjobs/bkill are findable. Needed because systemd services have minimal PATH. Documented in config.yaml.template.

Where we are now (as of release 2.7.0a7, commit eff3c09)

bsub is found on PATH (no more FileNotFoundError), but it fails because LSF can't find its configuration

The work directory, repo symlink, and bsub/sbatch submission were all running as root. The repo cache (~/.fileglancer/apps/) must stay as root since users can't write there, so user_context is passed into submit_job and applied only around work_dir creation, symlink, and executor.submit.

os.path.expanduser("~") uses the real UID (root), ignoring seteuid. Pass username to _build_work_dir so it expands ~username instead.

- work_dir can survive from a previous failed attempt with the same job ID

Adds a cluster.extra_paths config option that prepends directories to the server's PATH at startup, so scheduler commands (bsub, bjobs, bkill) are findable even when the system service has a minimal default PATH. Configurable via .env (FGC_CLUSTER__EXTRA_PATHS) to avoid committing environment-specific paths.

…apps

fileglancer/apps/core.py

…variables LSF's bsub requires LSF_ENVDIR to locate lsf.conf. When running as a systemd service via pixi, this env var does not reach the server process. Approaches that did NOT work: - Environment=LSF_ENVDIR=... in the systemd unit file - Adding LSF_ENVDIR to the .env file (pydantic-settings rejected it as an unknown field: "lsf_envdir: Extra inputs are not permitted") - ExecStart with `source /misc/lsf/conf/profile.lsf && exec pixi run start` - Inspecting /proc/<PID>/environ confirmed LSF_ENVDIR was absent from the worker process in all cases These attempts ruled out systemd misconfiguration and shell sourcing issues, and suggest that pixi strips inherited environment variables during activation. The fix mirrors the existing extra_paths approach: extra_env is a dict in ClusterSettings that is applied to os.environ inside the Python process at startup, after pixi has already set up its environment.

… server as root

- the previous commit fixed the lsf.conf issue, but there is a new error "Failed in LSF library call: External authentication failed", so trying removing the -U flag to see if LSF picks up the euid set by seteuid

docs/config.yaml.template

…level These are general server environment settings, not cluster-specific. Also removes the now-unnecessary config.pop() calls in get_executor().

Adds a top-level env_source_script config option that sources a shell script (e.g., /misc/lsf/conf/profile.lsf) at startup and applies the resulting environment variables. This avoids manually enumerating LSF vars in extra_env when running under pixi, which strips inherited env.

Ensures pixi is on PATH when job scripts run on cluster nodes.

The repo cache was expanded at import time to /root/.fileglancer/apps when running as a systemd service, making repos inaccessible from compute nodes. Now resolves ~username/.fileglancer/apps per-user. Also re-adds -U username to bsub so LSF runs jobs as the actual user, and wraps repo operations in user_context for correct file ownership.

fileglancer/apps/core.py

…b-submission' into further-fixes-to-job-submission

krokicki · 2026-03-25T21:01:57Z

@allison-truhlar I made some changes to move the bjobs monitoring into the worker. So now all the LSF commands run with setuid. Once you have it on the dev server, we should coordinate to test it with multiple users submitting jobs at the same time.

…ctiveUserContext EffectiveUserContext sets process-wide euid/egid, which gets corrupted when concurrent async requests interleave at await points. This caused permission errors for users with restrictive home directory permissions (drwx------) and setegid failures ("Operation not permitted") for other concurrent users. Replace in-process EffectiveUserContext usage for all apps/manifest operations with worker subprocesses via _run_as_user, which sets the real UID/GID and isolates each user's identity in a separate process. Also set HOME in the worker environment so git finds the correct user config.

_convert_job after submit_job was missing a user context wrapper, so get_service_url file reads ran as root instead of the logged-in user. validate_paths ran os.path.exists and os.access as root, so path validation always reported paths as accessible regardless of the user's actual permissions.

The poll worker seeded all job stubs as PENDING before calling bjobs. If bjobs didn't return data for a job in a given cycle, the stub's PENDING status was returned to the server and written back to the DB, overwriting the correct RUNNING status. On the next cycle bjobs would return RUNNING again, causing visible toggling on the jobs page. Pass each job's current DB status to the worker so stubs preserve their real status when bjobs doesn't mention them.

Log uid/euid/HOME at key points: - _run_as_user: logs target user, uid, gid, HOME before spawning worker - worker main(): logs actual uid, euid, HOME on startup (to stderr) - EffectiveUserContext: logs euid/egid on enter and exit - _ensure_repo_cache, discover_app_manifests, fetch_app_manifest: log whether delegating to worker or running in-process

Previously, start_job_monitor() ran in every uvicorn worker's lifespan, causing N independent poll loops (10 workers = 10x bjobs calls per interval and spurious same-status DB updates). Now all workers run the loop but race for a per-cycle file lock — only the winner polls. If that worker dies, another takes over on the next interval.

Two issues from production logs: 1. Status log showed "RUNNING -> RUNNING" because SQLAlchemy refreshed db_job.status after commit, before the log f-string evaluated. Fix: capture old_status before calling update_job_status. 2. Staggered worker sleep timers caused 2-3 polls per interval instead of 1 — the lock was released after poll but before sleep, so other workers whose timers expired could immediately acquire it. Fix: hold the lock through both poll and sleep.

…m acquiring it

allison-truhlar added 8 commits March 19, 2026 13:24

fix: resolve job work directory under the user's home, not root's

543d9cb

os.path.expanduser("~") uses the real UID (root), ignoring seteuid. Pass username to _build_work_dir so it expands ~username instead.

fix: remove any existing repo_link symlink before creating it

95742f4

- work_dir can survive from a previous failed attempt with the same job ID

test: resolve bsub to full path

db6f047

fix: submit with username because bsub doesn't use geteuid internally

99fda28

chore: bump alpha version for a pre-release to test changes

e2affc7

chore: new alpha version to create a release testing more changes to …

eff3c09

…apps

github-advanced-security bot found potential problems Mar 20, 2026

View reviewed changes

fileglancer/apps/core.py Fixed Show fixed Hide fixed

fileglancer/apps/core.py Fixed Show fixed Hide fixed

fileglancer/apps/core.py Fixed Show fixed Hide fixed

fileglancer/apps/core.py Fixed Show fixed Hide fixed

fileglancer/apps/core.py Fixed Show fixed Hide fixed

allison-truhlar added 4 commits March 20, 2026 15:23

chore: new alpha version to test changes for using apps while running…

fc2a569

… server as root

test: take out previously added -U username

ab33f34

- the previous commit fixed the lsf.conf issue, but there is a new error "Failed in LSF library call: External authentication failed", so trying removing the -U flag to see if LSF picks up the euid set by seteuid

chore: bump alpha version for another test pre-release

6ebf98b

krokicki reviewed Mar 20, 2026

View reviewed changes

docs/config.yaml.template Outdated Show resolved Hide resolved

docs/config.yaml.template Outdated Show resolved Hide resolved

allison-truhlar added 5 commits March 23, 2026 11:42

refactor: move extra_paths, extra_env out of cluster settings to top …

8b57085

…level These are general server environment settings, not cluster-specific. Also removes the now-unnecessary config.pop() calls in get_executor().

feat: add pre_run to pixi task entry points for PATH setup

a276bb3

Ensures pixi is on PATH when job scripts run on cluster nodes.

chore: bump alpha version for test release

045ff9c

github-advanced-security bot found potential problems Mar 23, 2026

View reviewed changes

fileglancer/apps/core.py Dismissed Show dismissed Hide dismissed

fileglancer/apps/core.py Dismissed Show dismissed Hide dismissed

allison-truhlar and others added 10 commits March 23, 2026 12:43

chore: bump alpha version for test release

4638fd5

test: remove -U; this is only for advance submissions

a9b6c89

chore: bump alpha version

ce1c0d8

cleanup: remove now unused top level extra_env and extra_paths env vars

422170f

run lsf operations in separate worker with setuid

486489f

wrap other file operations in user contexts

bb301a8

fix error serializing cached_repo_dir

e83338c

only switch identity when running as root

1c69ef2

Merge remote-tracking branch 'refs/remotes/origin/further-fixes-to-jo…

69a0303

…b-submission' into further-fixes-to-job-submission

chore: bump alpha verison for test release

0bc19b7

krokicki added 2 commits March 25, 2026 16:55

updated to py_cluster_api 0.4.0

198ef8e

move bjobs monitoring into worker process

594cadc

allison-truhlar added 17 commits March 26, 2026 09:21

chore: bump alpha version for test release

e93c7fc

fix: if cluster rejects a submission, clean up job entry in db

a8f5dba

refactor: show job submission errs on form, not as toast

06c0ccd

chore: bump alpha version for test release

804295c

test: add tests for poll lock election and status-update logic

07c129c

chore: new alpha version for test release

5105d12

fix(tests): hold the file lock for longer to prevent other worker fro…

391f838

…m acquiring it

fix: only run job polling if there are active jobs in the user's db

312fe46

chore: new alpha version for test release

c1e4bc7

Merge branch 'main' into further-fixes-to-job-submission

8c89296

Merge branch 'main' into further-fixes-to-job-submission

e631cb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further fixes to job submission#335

Further fixes to job submission#335
allison-truhlar wants to merge 46 commits intomainfrom
further-fixes-to-job-submission

allison-truhlar commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krokicki commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

allison-truhlar commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krokicki commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants