Skip to content

Further fixes to job submission#335

Draft
allison-truhlar wants to merge 46 commits intomainfrom
further-fixes-to-job-submission
Draft

Further fixes to job submission#335
allison-truhlar wants to merge 46 commits intomainfrom
further-fixes-to-job-submission

Conversation

@allison-truhlar
Copy link
Copy Markdown
Collaborator

This PR troubleshoots submitting jobs to the Janelia cluster when running the Fileglancer server as root.

Changes so far:

  1. Work directory fix - Resolved ~ to the actual user's home (~username), not root's home.
  2. seteuid job submission - Wrapped work dir creation, symlink, and executor.submit() in the EffectiveUserContext so filesystem operations happen as the user.
  3. Stale symlink fix - Unlink existing repo symlink before creating a new one (caused an issue when a job failed - the symlink wasn't removed).
  4. bsub -U username flag — Because bsub uses getuid() (real UID = root) not geteuid(), we pass -U username to tell LSF to run the job as the actual user. Note: unsure whether this is actually needed after adding the missing FGC_CLUSTER__EXECUTOR=lsf to the fileglancer-hub .env file — should test removing it once everything else is fixed
  5. cluster.extra_paths config option — Prepends directories to os.environ["PATH"] at server startup so bsub/bjobs/bkill are findable. Needed because systemd services have minimal PATH. Documented in config.yaml.template.

Where we are now (as of release 2.7.0a7, commit eff3c09)

  • bsub is found on PATH (no more FileNotFoundError), but it fails because LSF can't find its configuration

@krokicki

The work directory, repo symlink, and bsub/sbatch submission were all running as root. The repo cache (~/.fileglancer/apps/) must stay as root since users can't write there, so user_context is passed into submit_job and applied only around work_dir creation, symlink, and executor.submit.
os.path.expanduser("~") uses the real UID (root), ignoring seteuid.
Pass username to _build_work_dir so it expands ~username instead.
- work_dir can survive from a previous failed attempt with the same job ID
Adds a cluster.extra_paths config option that prepends directories to the
server's PATH at startup, so scheduler commands (bsub, bjobs, bkill) are
findable even when the system service has a minimal default PATH.
Configurable via .env (FGC_CLUSTER__EXTRA_PATHS) to avoid committing
environment-specific paths.
…variables

LSF's bsub requires LSF_ENVDIR to locate lsf.conf. When running as a
systemd service via pixi, this env var does not reach the server process.

Approaches that did NOT work:
- Environment=LSF_ENVDIR=... in the systemd unit file
- Adding LSF_ENVDIR to the .env file (pydantic-settings rejected it
  as an unknown field: "lsf_envdir: Extra inputs are not permitted")
- ExecStart with `source /misc/lsf/conf/profile.lsf && exec pixi run
  start`
- Inspecting /proc/<PID>/environ confirmed LSF_ENVDIR was absent from
  the worker process in all cases

These attempts ruled out systemd misconfiguration and shell sourcing
issues, and suggest that pixi strips inherited environment variables
during activation.

The fix mirrors the existing extra_paths approach: extra_env is a dict
in ClusterSettings that is applied to os.environ inside the Python
process at startup, after pixi has already set up its environment.
- the previous commit fixed the lsf.conf issue, but there is a new error "Failed in LSF library call: External authentication failed", so trying removing the -U flag to see if LSF picks up the euid set by seteuid
…level

These are general server environment settings, not cluster-specific.  Also removes the now-unnecessary config.pop() calls in get_executor().
Adds a top-level env_source_script config option that sources a shell script (e.g., /misc/lsf/conf/profile.lsf) at startup and applies the resulting environment variables. This avoids manually enumerating LSF vars in extra_env when running under pixi, which strips inherited env.
Ensures pixi is on PATH when job scripts run on cluster nodes.
The repo cache was expanded at import time to /root/.fileglancer/apps when running as a systemd service, making repos inaccessible from compute nodes. Now resolves ~username/.fileglancer/apps per-user.

Also re-adds -U username to bsub so LSF runs jobs as the actual user, and wraps repo operations in user_context for correct file ownership.
@krokicki
Copy link
Copy Markdown
Member

@allison-truhlar I made some changes to move the bjobs monitoring into the worker. So now all the LSF commands run with setuid. Once you have it on the dev server, we should coordinate to test it with multiple users submitting jobs at the same time.

…ctiveUserContext

EffectiveUserContext sets process-wide euid/egid, which gets corrupted when
concurrent async requests interleave at await points. This caused permission
errors for users with restrictive home directory permissions (drwx------) and
setegid failures ("Operation not permitted") for other concurrent users.

Replace in-process EffectiveUserContext usage for all apps/manifest operations
with worker subprocesses via _run_as_user, which sets the real UID/GID and
isolates each user's identity in a separate process. Also set HOME in the
worker environment so git finds the correct user config.
_convert_job after submit_job was missing a user context wrapper, so
get_service_url file reads ran as root instead of the logged-in user.

validate_paths ran os.path.exists and os.access as root, so path
validation always reported paths as accessible regardless of the
user's actual permissions.
The poll worker seeded all job stubs as PENDING before calling bjobs.
If bjobs didn't return data for a job in a given cycle, the stub's
PENDING status was returned to the server and written back to the DB,
overwriting the correct RUNNING status. On the next cycle bjobs would
return RUNNING again, causing visible toggling on the jobs page.

Pass each job's current DB status to the worker so stubs preserve
their real status when bjobs doesn't mention them.
Log uid/euid/HOME at key points:
- _run_as_user: logs target user, uid, gid, HOME before spawning worker
- worker main(): logs actual uid, euid, HOME on startup (to stderr)
- EffectiveUserContext: logs euid/egid on enter and exit
- _ensure_repo_cache, discover_app_manifests, fetch_app_manifest:
  log whether delegating to worker or running in-process
Previously, start_job_monitor() ran in every uvicorn worker's lifespan,
causing N independent poll loops (10 workers = 10x bjobs calls per
interval and spurious same-status DB updates). Now all workers run
the loop but race for a per-cycle file lock — only the winner polls.
If that worker dies, another takes over on the next interval.
Two issues from production logs:

1. Status log showed "RUNNING -> RUNNING" because SQLAlchemy refreshed
   db_job.status after commit, before the log f-string evaluated.
   Fix: capture old_status before calling update_job_status.

2. Staggered worker sleep timers caused 2-3 polls per interval instead
   of 1 — the lock was released after poll but before sleep, so other
   workers whose timers expired could immediately acquire it.
   Fix: hold the lock through both poll and sleep.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants