Draft
Conversation
The work directory, repo symlink, and bsub/sbatch submission were all running as root. The repo cache (~/.fileglancer/apps/) must stay as root since users can't write there, so user_context is passed into submit_job and applied only around work_dir creation, symlink, and executor.submit.
os.path.expanduser("~") uses the real UID (root), ignoring seteuid.
Pass username to _build_work_dir so it expands ~username instead.
- work_dir can survive from a previous failed attempt with the same job ID
Adds a cluster.extra_paths config option that prepends directories to the server's PATH at startup, so scheduler commands (bsub, bjobs, bkill) are findable even when the system service has a minimal default PATH. Configurable via .env (FGC_CLUSTER__EXTRA_PATHS) to avoid committing environment-specific paths.
…variables LSF's bsub requires LSF_ENVDIR to locate lsf.conf. When running as a systemd service via pixi, this env var does not reach the server process. Approaches that did NOT work: - Environment=LSF_ENVDIR=... in the systemd unit file - Adding LSF_ENVDIR to the .env file (pydantic-settings rejected it as an unknown field: "lsf_envdir: Extra inputs are not permitted") - ExecStart with `source /misc/lsf/conf/profile.lsf && exec pixi run start` - Inspecting /proc/<PID>/environ confirmed LSF_ENVDIR was absent from the worker process in all cases These attempts ruled out systemd misconfiguration and shell sourcing issues, and suggest that pixi strips inherited environment variables during activation. The fix mirrors the existing extra_paths approach: extra_env is a dict in ClusterSettings that is applied to os.environ inside the Python process at startup, after pixi has already set up its environment.
- the previous commit fixed the lsf.conf issue, but there is a new error "Failed in LSF library call: External authentication failed", so trying removing the -U flag to see if LSF picks up the euid set by seteuid
krokicki
reviewed
Mar 20, 2026
…level These are general server environment settings, not cluster-specific. Also removes the now-unnecessary config.pop() calls in get_executor().
Adds a top-level env_source_script config option that sources a shell script (e.g., /misc/lsf/conf/profile.lsf) at startup and applies the resulting environment variables. This avoids manually enumerating LSF vars in extra_env when running under pixi, which strips inherited env.
Ensures pixi is on PATH when job scripts run on cluster nodes.
The repo cache was expanded at import time to /root/.fileglancer/apps when running as a systemd service, making repos inaccessible from compute nodes. Now resolves ~username/.fileglancer/apps per-user. Also re-adds -U username to bsub so LSF runs jobs as the actual user, and wraps repo operations in user_context for correct file ownership.
…b-submission' into further-fixes-to-job-submission
Member
|
@allison-truhlar I made some changes to move the bjobs monitoring into the worker. So now all the LSF commands run with setuid. Once you have it on the dev server, we should coordinate to test it with multiple users submitting jobs at the same time. |
…ctiveUserContext
EffectiveUserContext sets process-wide euid/egid, which gets corrupted when
concurrent async requests interleave at await points. This caused permission
errors for users with restrictive home directory permissions (drwx------) and
setegid failures ("Operation not permitted") for other concurrent users.
Replace in-process EffectiveUserContext usage for all apps/manifest operations
with worker subprocesses via _run_as_user, which sets the real UID/GID and
isolates each user's identity in a separate process. Also set HOME in the
worker environment so git finds the correct user config.
_convert_job after submit_job was missing a user context wrapper, so get_service_url file reads ran as root instead of the logged-in user. validate_paths ran os.path.exists and os.access as root, so path validation always reported paths as accessible regardless of the user's actual permissions.
The poll worker seeded all job stubs as PENDING before calling bjobs. If bjobs didn't return data for a job in a given cycle, the stub's PENDING status was returned to the server and written back to the DB, overwriting the correct RUNNING status. On the next cycle bjobs would return RUNNING again, causing visible toggling on the jobs page. Pass each job's current DB status to the worker so stubs preserve their real status when bjobs doesn't mention them.
Log uid/euid/HOME at key points: - _run_as_user: logs target user, uid, gid, HOME before spawning worker - worker main(): logs actual uid, euid, HOME on startup (to stderr) - EffectiveUserContext: logs euid/egid on enter and exit - _ensure_repo_cache, discover_app_manifests, fetch_app_manifest: log whether delegating to worker or running in-process
Previously, start_job_monitor() ran in every uvicorn worker's lifespan, causing N independent poll loops (10 workers = 10x bjobs calls per interval and spurious same-status DB updates). Now all workers run the loop but race for a per-cycle file lock — only the winner polls. If that worker dies, another takes over on the next interval.
Two issues from production logs: 1. Status log showed "RUNNING -> RUNNING" because SQLAlchemy refreshed db_job.status after commit, before the log f-string evaluated. Fix: capture old_status before calling update_job_status. 2. Staggered worker sleep timers caused 2-3 polls per interval instead of 1 — the lock was released after poll but before sleep, so other workers whose timers expired could immediately acquire it. Fix: hold the lock through both poll and sleep.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR troubleshoots submitting jobs to the Janelia cluster when running the Fileglancer server as root.
Changes so far:
Where we are now (as of release 2.7.0a7, commit eff3c09)
@krokicki