madengine — Codebase Wiki
-AI/ML model automation and benchmarking platform for local Docker, Kubernetes and SLURM. This wiki reflects branch
- develop. madengine is a streamlined CLI tool for running and benchmarking AI models on ROCm GPUs, offering a production‑ready workflow for local single node or remote multi node execution with integrated performance monitoring.
AI/ML model automation & benchmarking platform for local Docker, Kubernetes, and SLURM. + A Typer-based CLI that discovers models, builds Docker images, runs them across compute targets, + and writes structured performance results.
+Entry point: src/madengine/cli/app.py::cli_main
+ → console script madengine registered in pyproject.toml.
Overview
What it does
-madengine is a Typer-based CLI (madengine) that discovers models from a
- MAD package, builds Docker images, and runs them either locally or on distributed
- backends (Kubernetes, SLURM). It writes performance results to perf.csv
- and can generate HTML reports or upload to MongoDB.
Entry point: src/madengine/cli/app.py::cli_main
- (registered as the madengine console script in pyproject.toml).
What madengine does
+-
+
- Discover — finds model definitions from
models.jsonor dynamic scripts, resolves tags
+ - Build — calls
docker buildfor each model, writesbuild_manifest.json
+ - Run — reads manifest, infers compute target, dispatches containers, writes
perf.csv
+ - Report — converts
perf.csvto HTML or email; uploads to MongoDB
+
All four stages share a single --additional-context configuration spine that controls
+ GPU vendor, deployment type, launcher, profiling tools, and environment variables.
Why this branch matters
-The add_slurm_multi_launcher branch adds a self-managed multi-node SLURM launcher
- so that workloads with their own per-node Docker orchestration (e.g. SGLang Disaggregated
- prefill + decode + proxy) can run via a thin wrapper SBATCH that does not nest Docker
- inside the job step. It adds --use-image / --build-on-compute build modes,
- a registry gate, parallel image pull, and a bash-in-salloc execution path.
What's new in v2.1.0
+-
+
slurm_multi— self-managed multi-node SLURM launcher for workloads with per-node Docker (e.g. SGLang Disagg)
+ --use-image [auto]/--build-on-compute— newmadengine buildmodes
+ - Docker
--build-context tools=— shared tool APIs accessible in every Dockerfile
+ - Local
MAD_MULTI_NODE_RUNNER— Megatron / DeepSpeed / TorchTitan now work on local Docker
+ - SLURM env-var escaping — double-quote escaping preserves spaces & paths +
Quick start
# Install
+# 1. Install
pip install -e ".[dev]"
-# Discover models
+# 2. Discover available models
madengine discover --tags dummy
-# Run locally (build + run)
+# 3. Build + run (single command)
madengine run --tags dummy \
- --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+ --additional-context '{"gpu_vendor":"AMD","guest_os":"UBUNTU"}'
+
+# 4. Build only, then run from manifest
+madengine build --tags llama3 --registry registry.example.com/ml
+madengine run --manifest-file build_manifest.json \
+ --additional-context '{"docker_gpus":"0,1,2,3"}'
+Local mode: no k8s or slurm key in context → ContainerRunner (local Docker).
# Minimal K8s config — defaults applied automatically
-madengine run --tags model \
- --additional-context '{"k8s": {"gpu_count": 2}}'
+# Single-node K8s (minimal — defaults applied from presets/k8s/)
+madengine run --tags llama3 \
+ --additional-context '{"k8s":{"gpu_count":4}}'
-# Multi-node vLLM
-madengine run --tags model --additional-context '{
- "k8s": {"namespace": "ml-team", "gpu_count": 8},
- "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
-}'
+# Multi-node vLLM on K8s
+madengine run --tags vllm-serve \
+ --additional-context '{
+ "k8s": {"namespace":"ml-team","gpu_count":8},
+ "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
+ }'
+
+# K8s with NFS data PVC and secrets
+madengine run --tags model \
+ --additional-context '{
+ "k8s": {"namespace":"ml","gpu_count":8,"data_storage_class":"nfs-banff"},
+ "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy"}
+ }'
+Presence of "k8s" or "kubernetes" key → KubernetesDeployment. Requires pip install -e ".[all]".
# Build phase (login node or CI) then deploy
-madengine build --tags model --registry gcr.io/myproject
+# Single-node SLURM (build on login node, deploy via sbatch)
+madengine build --tags llama3 --registry registry.example.com/ml
+madengine run --manifest-file build_manifest.json \
+ --additional-context '{
+ "slurm": {"partition":"gpu","nodes":1,"gpus_per_node":8,"time":"12:00:00"}
+ }'
+# Multi-node torchrun
madengine run --manifest-file build_manifest.json \
--additional-context '{
- "slurm":{"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
- "distributed":{"launcher":"torchtitan","nnodes":4,"nproc_per_node":8}
+ "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+ "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8}
+ }'
+
+# DeepSpeed with reservation
+madengine run --manifest-file build_manifest.json \
+ --additional-context '{
+ "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,
+ "time":"48:00:00","reservation":"ml-training"},
+ "distributed": {"launcher":"deepspeed","nnodes":8,"nproc_per_node":8}
}'
+Presence of "slurm" key → SlurmDeployment. Generates sbatch wrapper from Jinja2 template.
# slurm_multi — for workloads that run their own docker via srun
-madengine run --tags pyt_sglang_disagg_qwen3-32b_short \
+# SGLang Disaggregated (3+ nodes: proxy + prefill + decode)
+madengine run --tags pyt_sglang_disagg_qwen3-32b \
--additional-context '{
- "slurm":{"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
- "distributed":{"launcher":"slurm_multi"}
+ "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
+ "distributed": {"launcher":"slurm_multi"}
}'
-# Build on a compute node, push, then have run pull in parallel
-madengine build --tags model --build-on-compute --registry myreg.io/team
-# or skip build entirely and use a pre-baked image
-madengine build --tags model --use-image auto
+# Build options for slurm_multi models:
+# Option A — use pre-built registry image (skip local build)
+madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:latest
+
+# Option B — auto-resolve DOCKER_IMAGE_NAME from model card
+madengine build --tags pyt_sglang_disagg --use-image
+
+# Option C — build on compute node, push, then run pulls in parallel
+madengine build --tags pyt_sglang_disagg \
+ --registry registry.io/ml --build-on-compute
+slurm_multi bypasses the standard sbatch template: the model's own .slurm script runs directly on the head node so srun/scontrol work inside it.
# Store configuration in a JSON file and reference it
+cat > my_run.json <<'EOF'
+{
+ "gpu_vendor": "AMD",
+ "guest_os": "UBUNTU",
+ "slurm": {
+ "partition": "gpu",
+ "nodes": 4,
+ "gpus_per_node": 8,
+ "time": "24:00:00",
+ "exclusive": true
+ },
+ "distributed": {
+ "launcher": "torchrun",
+ "nnodes": 4,
+ "nproc_per_node": 8,
+ "backend": "nccl"
+ },
+ "env_vars": {
+ "NCCL_DEBUG": "WARN",
+ "HSA_ENABLE_SDMA": "0"
+ },
+ "tools": [{"name": "rocprofv3_compute"}]
+}
+EOF
+
+madengine run --tags llama3 --additional-context-file my_run.json
+--additional-context-file and --additional-context are mutually exclusive. The file is parsed as JSON (not ast.literal_eval).
Install & dev
Setup
-pip install -e ".[dev]" # base + dev
-pip install -e ".[all]" # + kubernetes
+# Base install (includes dev tools)
+pip install -e ".[dev]"
+
+# With Kubernetes support
+pip install -e ".[all]"
+
+# Enable pre-commit hooks
pre-commit install
+Optional extras
+
+Extra Adds
+
+[dev]pytest, black, flake8, mypy, isort, pre-commit
+[kubernetes]kubernetes>=28.0.0, pyyaml
+[all]dev + kubernetes
+
+
Test & quality
-pytest # all tests
+pytest # all tests
+pytest tests/unit/ -v # unit only
pytest tests/unit/test_slurm_multi.py -v
pytest --cov=src/madengine --cov-report=html
-pytest -m "not slow"
-black src/ tests/ && isort src/ tests/
+pytest -m "not slow" # skip slow tests
+pytest -m "unit and amd" # combined markers
+
+black src/ tests/
+isort src/ tests/
flake8 src/ tests/
mypy src/madengine
pre-commit run --all-files
@@ -250,127 +377,142 @@ Test & quality
5-layer architecture
-Each layer talks only to the one below it. Layers are color-coded throughout this wiki.
+Each layer talks only to the layers below it. Layers are color-coded throughout this wiki.
| Layer | Path | Responsibilities | Key types |
|---|---|---|---|
| CLI | src/madengine/cli/ | -Typer app, command parsing, Rich output, exit-code mapping. | -app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode |
| Orchestration | src/madengine/orchestration/ | -Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment. | -BuildOrchestrator, RunOrchestrator, image_filtering.py |
| Deployment | src/madengine/deployment/ | -Factory + K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. | -DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment |
| Execution | src/madengine/execution/ | -Local Docker build/run, log scanning, timeout resolution, perf parsing. | -ContainerRunner, DockerBuilder, container_runner_helpers.py |
| Core | src/madengine/core/ | -Cross-cutting primitives: context merging, console, docker wrapper, errors, auth, timeout. | -Context, Console, Docker, MADEngineError, load_credentials |
| Utils | src/madengine/utils/ | -Discovery, GPU vendor abstraction, ROCm path resolution, config parsing. | -DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser |
| Reporting | src/madengine/reporting/ | -perf.csv writers, HTML/email report generation. | -update_perf_csv, csv_to_html, csv_to_email |
| CLI | +src/madengine/cli/ | +Typer app, 5 commands, argument validation, Rich output, exit-code mapping. | +app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode |
+
| Orchestration | +src/madengine/orchestration/ | +Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment backend. | +BuildOrchestrator, RunOrchestrator, image_filtering.py |
+
| Deployment | +src/madengine/deployment/ | +Factory + Template Method pattern. K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring. | +DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment, ConfigLoader |
+
| Execution | +src/madengine/execution/ | +Local Docker build/run, log scanning, timeout resolution, perf parsing, self-managed launcher bypass. | +ContainerRunner, DockerBuilder, container_runner_helpers |
+
| Core | +src/madengine/core/ | +Cross-cutting primitives: context merging & GPU detection, shell execution, Docker wrapper, error hierarchy, auth, timeout. | +Context, Console, Docker, MADEngineError, load_credentials |
+
| Utils | +src/madengine/utils/ | +Model discovery, GPU vendor abstraction, ROCm path resolution, config parsing. | +DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser |
+
| Reporting | +src/madengine/reporting/ | +perf.csv writers, HTML/email report generation. Database upload in src/madengine/database/. | +update_perf_csv, csv_to_html, csv_to_email, mongodb.py |
+
Architecture diagram
Key data flows
Key data flows
Build flow
madengine build→BuildOrchestrator.execute()
- DiscoverModelsresolves--tagsagainst the MAD package - (rootmodels.json,scripts/{dir}/models.json, or -scripts/{dir}/get_models_json.py).
- - Each model is materialised through
Context(system + user -additional_context) and passed toDockerBuilder.
- - Optionally tags & pushes to
--registry.
- - Writes
build_manifest.jsonconsumed byrun.
+ Context(build_only_mode=True)— GPU vendor / arch detection skipped unlessdetect_local_gpu_arch=True
+ ConfigLoader.load_config()applies preset defaults (SLURM or K8s) over user config
+ DiscoverModelsresolves--tagsfrom rootmodels.json,scripts/{dir}/models.json, orscripts/{dir}/get_models_json.py
+ - slurm_multi gate: if model uses
slurm_multiand no--registry/--use-imagegiven → auto-resolvesDOCKER_IMAGE_NAMEfrom model card or raisesConfigurationError
+ DockerBuilder.build_all_models()— passes--build-context tools=scripts/common/toolsif that dir exists
+ - After registry push: sets
DOCKER_IMAGE_NAMEin manifestenv_varsfor parallel SLURM pull
+ - Writes
build_manifest.json
Special build modes on this branch:
--
-
--use-image [IMAGE|auto]— skip local build, use a prebuilt image (auto resolves -env_vars.DOCKER_IMAGE_NAMEfrom the model card). Mutually exclusive with ---registryand--build-on-compute.
- --build-on-compute— build on a SLURM compute node and push to--registry; - manifest carriesbuilt_on_compute: true.
-
Run flow
-
-
madengine run→RunOrchestratorloads existing manifest or triggers a build.
- - Target inference (Convention over Configuration):
-
-
-
"k8s"/"kubernetes"in context → KubernetesDeployment
- "slurm"in context → SlurmDeployment
- distributed.launcher == "slurm_multi"→ slurm_multi path
- - neither → ContainerRunner (local Docker) -
- scripts/common/is populated from the package (pre_scripts, post_scripts, tools) and cleaned up afterwards.
- - Per-model results parsed via
PERFORMANCE_LOG_PATTERNand appended to -perf.csv/perf_entry.csv. Failed runs are still recorded with -STATUS=FAILURE.
+ madengine run→RunOrchestrator.execute()
+ - If manifest exists: skip build; else trigger
_build_phase()
+ Context(build_only_mode=False)— full GPU detection, ROCm path resolution
+ _load_and_merge_manifest()— runtime context overrides manifestdeployment_config
+ - Target inference:
"k8s"/"kubernetes"→ K8s ·"slurm"→ SLURM · neither → local
+ _copy_scripts()— populatesscripts/common/{pre_scripts,post_scripts,tools}from madengine package
+ - Dispatch:
ContainerRunner(local) orDeploymentFactory.create()(SLURM/K8s)
+ - Results →
perf.csv/perf_entry.csv
+ _cleanup_model_dir_copies()— removes populatedscripts/common/files
SLURM job flow (inside sbatch)
+-
+
- sbatch script sets
MASTER_ADDR(via scontrol),WORLD_SIZE,NNODES, node-local GPU visibility
+ - Multi-node: generates a task script per node; runs via
srun bash $TASK_SCRIPT— each node callsmadengine runwith local manifest
+ - Single-node: creates local manifest with
deployment_config.target="docker", callsmadengine run
+ - Each node's
madengine run→ContainerRunner→docker runwith SLURM env vars injected
+ - Results collected from per-node
perf.csvand aggregated
+
additional_context — the configuration spine
---additional-context accepts a JSON or Python-dict string (parsed with
-ast.literal_eval(), not json.loads) or a path to a JSON file.
-It is merged into Context.ctx alongside system-detected values
-(GPU vendor, architecture, OS, ROCm path). Specific keys drive different subsystems.
CLI — discover
+Lists and validates model definitions without building or running.
+madengine discover [OPTIONS]
+
+ --tags TEXT Comma-separated tags/names to filter [required]
+ --verbose / --no-verbose Show full model JSON [default: no-verbose]
+Tag syntax
| Key | Where it goes | What it does |
|---|---|---|
| Pattern | Example | Meaning |
gpu_vendor | Core | AMD or NVIDIA. Defaults to AMD if missing. |
guest_os | Core | UBUNTU or CENTOS; selects package manager for in-container installs. |
MAD_ROCM_PATH | Core | Override host ROCm root (top-level only). |
docker_env_vars | Execution | Env vars injected into the container. docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host. |
docker_gpus | Execution | Comma list of GPU indices or all. |
k8s / kubernetes | Deployment | Selects K8s. Merged with preset defaults; supports namespace, gpu_count, storage class fallback chain (data_storage_class → nfs_storage_class → storage_class). |
slurm | Deployment | Selects SLURM. partition, nodes, gpus_per_node, time, exclusive, reservation, nodelist. Setting nodelist also skips automatic node health preflight. |
distributed.launcher | Deployment | torchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang_disagg, slurm_multi / slurm-multi. |
distributed.nnodes / nproc_per_node | Deployment | Topology hints for launcher templates. |
tools | Execution | List of profilers/tracers to enable, e.g. [{"name":"rocprofv3_compute"}]. |
rocenv_mode | Execution | "lite" (default) or "full" — full collects lshw / dmidecode / dmesg / modinfo, best-effort installs missing tools per guest_os. |
log_error_pattern_scan | Execution | false disables post-run log substring scan (use when pytest/JUnit is authoritative). |
log_error_patterns / log_error_benign_patterns | Execution | Override or extend the failure-substring lists. |
pre_scripts / post_scripts | Execution | Custom scripts to run before/after the model. |
secrets | Deployment (K8s) | Auto-converted to a K8s Secret and mounted as env vars. |
| Simple tag | --tags llama3 | Any model with tag llama3 |
| Multiple tags | --tags llama3,vllm | Any model matching any listed tag |
| All models | --tags all | Every discovered model |
| Scoped (exact dir) | --tags MAD/llama3 | Only from scripts/MAD/ subdirectory |
| Dynamic + args | --tags dummy3:dummy_3:batch=512 | Dynamic model with arg override |
Context parses with ast.literal_eval(). Pass a Python dict
-repr (single quotes are fine in shells if you wrap the whole argument in single quotes and use
-double quotes inside) — strictly JSON also works since JSON ⊂ Python literals.
+Discovery sources (checked in order per directory)
+-
+
- Root
models.json
+ scripts/{dir}/models.json(static list)
+ scripts/{dir}/get_models_json.py— dynamic; must exportlist_models() → List[CustomModel]
+
CLI — build
+Builds Docker images for discovered models and writes build_manifest.json.
madengine build [OPTIONS]
+
+ --tags TEXT Tags to select models (mutually exclusive with --batch-manifest)
+ --batch-manifest FILE JSON file of multiple tag groups to build in sequence
+ --registry TEXT Push built images to this registry URL
+ --target-archs TEXT Comma-separated GPU arch list (e.g. "gfx90a,gfx942")
+ --use-image [IMAGE|auto] Skip local build; use named image or auto-resolve from model card
+ --build-on-compute Build on SLURM compute node + push (requires --registry)
+ --additional-context TEXT Python dict / JSON string of context overrides
+ --additional-context-file FILE Path to a JSON context file (mutually exclusive with --additional-context)
+ --clean-docker-cache Pass --no-cache to docker build
+ --manifest-output FILE Output path for build_manifest.json [default: build_manifest.json]
+ --summary-output FILE Output path for build summary JSON
+ --live-output / --no-live-output Stream docker build output line by line [default: no-live-output]
+ --verbose / --no-verbose
+
+-
+
--batch-manifestvs--tags
+ --use-imagevs--registry
+ --use-imagevs--build-on-compute
+ --build-on-computerequires--registry
+ --additional-context-filevs--additional-context
+
--use-image modes
+| Invocation | Behavior |
|---|---|
--use-image (bare flag) | Resolves to "auto" — reads DOCKER_IMAGE_NAME from model card env_vars |
--use-image registry.io/img:tag | Uses the explicit image name; skips all Docker build steps |
CLI commands
+ +CLI — run
+Runs models from a manifest (build if needed) and writes perf.csv.
madengine run [OPTIONS]
+
+ --tags TEXT Select models (triggers build if no manifest)
+ --manifest-file FILE Use existing manifest; skip build [default: build_manifest.json]
+ --registry TEXT Registry for image pull auth
+ --timeout INT Seconds per model; -1=7200s default, 0=disabled
+ --additional-context TEXT Python dict or JSON string
+ --additional-context-file FILE JSON file (mutually exclusive with --additional-context)
+ --keep-alive Leave container running after model completes
+ --keep-model-dir Do not clean up model directory copy
+ --clean-docker-cache Remove docker image before pull (SLURM mode)
+ --skip-model-run Build/pull only; skip execution
+ --manifest-output FILE
+ --summary-output FILE
+ --live-output / --no-live-output Stream container output [default: no-live-output]
+ --output FILE Redirect container stdout to file
+ --tools-json-file-name FILE Tools config [default: ./scripts/common/tools.json]
+ --generate-sys-env-details / --no-generate-sys-env-details
+ --force-mirror-local Force ContainerRunner even in SLURM/K8s context
+ --disable-skip-gpu-arch Ignore skip_gpu_arch model field
+ --cleanup-perf Remove existing perf.csv before run
+ --verbose / --no-verbose
+
+Timeout resolution
| Command | Source | Purpose | Notable flags |
|---|---|---|---|
| Value | Resolved timeout | ||
discover |
- cli/commands/discover.py | -List/validate models matching tags. | ---tags (scoped: MAD/foo, dynamic: dummy3:dummy_3:batch=512) |
build |
- cli/commands/build.py | -Build Docker images; write build_manifest.json. |
- --registry, --target-archs, --batch-manifest, --clean-docker-cache, --use-image new, --build-on-compute new |
run |
- cli/commands/run.py | -Run models from manifest or trigger a build first. | ---manifest-file, --additional-context[-file], --skip-model-run, --live-output, --keep-alive, --verbose, --timeout |
report |
- cli/commands/report.py | -Convert perf CSVs to HTML/email. | -Sub-apps: to-html --csv-file …, to-email --directory … |
database |
- cli/commands/database.py | -Upload perf CSV to MongoDB. | ---csv-file, --database-name, --collection-name (uses MONGO_HOST/USER/PASSWORD env) |
-1 (default) | 7200 s (2 hours) | ||
0 | Disabled (no timeout) | ||
model card timeout field | Used when CLI is default (-1) | ||
| Explicit positive int | That many seconds, overrides model card |
CLI — report & database
+report
+# Convert perf.csv to HTML
+madengine report to-html --csv-file perf.csv
+
+# Generate consolidated email report
+madengine report to-email \
+ --directory ./results \
+ --output run_results.html
+Source: cli/commands/report.py → reporting/csv_to_html.py, reporting/csv_to_email.py
+database
+madengine database \
+ --csv-file perf.csv \
+ --database-name benchmarks \
+ --collection-name runs
+Reads from env: MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS.
Source: cli/commands/database.py → database/mongodb.py
+Exit codes (CI contract)
-From src/madengine/cli/constants.py::ExitCode. Use these in pipelines instead of log scraping.
+Exit codes CI contract
+Defined in src/madengine/cli/constants.py::ExitCode. Use these in CI pipelines instead of log scraping.
| Code | Name | Meaning |
|---|---|---|
0 | SUCCESS | All operations succeeded. |
1 | FAILURE | General/unhandled failure. |
2 | BUILD_FAILURE | One or more image builds failed. |
3 | RUN_FAILURE | One or more model runs failed (still written to perf.csv with status FAILURE). |
4 | INVALID_ARGS | Argument validation rejected the invocation. |
0 | SUCCESS | All operations succeeded. |
1 | FAILURE | General / unhandled failure (keyboard interrupt, unexpected exception). |
2 | BUILD_FAILURE | One or more Docker image builds failed. |
3 | RUN_FAILURE | One or more model runs failed. Results still written to perf.csv with STATUS=FAILURE. |
4 | INVALID_ARGS | Argument validation rejected the invocation. |
... 2>&1 | tee madengine.run.log with bash -o pipefail
-so the step's exit code is still madengine's, not tee's.
+In Jenkins, use madengine run … 2>&1 | tee madengine.log with bash -o pipefail so tee doesn't swallow the exit code.
additional_context — configuration spine
+--additional-context accepts a Python dict string (parsed with ast.literal_eval, not json.loads) or --additional-context-file accepts a JSON file. The dict is deep-merged into Context.ctx alongside system-detected values.
'{"key":"val"}' (valid JSON is also valid Python) or "{'key':'val'}". Do not use True/False as unquoted Python booleans in shell — shell expansion will fail. Use true/false (JSON) or single-quote the whole argument.
+| Key | Type | Subsystem | Description & example |
|---|---|---|---|
gpu_vendor | string | Core | Override GPU vendor detection. "AMD" or "NVIDIA". Defaults to "AMD" if not set and auto-detect fails. |
guest_os | string | Core | Container OS for package manager selection. "UBUNTU" or "CENTOS". Affects rocEnvTool installer selection. |
MAD_ROCM_PATH | string | Core | Override host ROCm root path (e.g. "/opt/rocm-6.2"). Takes priority over auto-detection and ROCM_PATH env. |
docker_env_vars | dict | Exec | Env vars injected as --env into docker run. Keys are validated with _ENV_KEY_RE. Special: docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host. |
docker_build_arg | dict | Exec | Extra --build-arg KEY=VAL flags passed to docker build. |
docker_gpus | string | Exec | Comma-separated GPU indices to expose, or "all". E.g. "0,1,2,3". |
docker_cpus | string | Exec | CPU affinity string for --cpuset-cpus. E.g. "0-15". |
docker_mounts | dict | Exec | Extra volume mounts. E.g. {"host_path":"/data","container_path":"/mnt/data"}. |
docker_image / MAD_CONTAINER_IMAGE | string | Orch | Skip build entirely; use this image for all models. Creates a synthetic manifest. |
k8s / kubernetes | dict | Deploy | Selects Kubernetes deployment. See K8s config section for sub-keys. |
slurm | dict | Deploy | Selects SLURM deployment. See SLURM config section for sub-keys. |
distributed | dict | Deploy | Distributed launcher configuration. launcher, nnodes, nproc_per_node, backend, port. See Per-launcher config. |
distributed.launcher | string | Deploy | "torchrun", "deepspeed", "megatron", "torchtitan", "primus", "vllm", "sglang", "sglang_disagg", "slurm_multi"/"slurm-multi". |
distributed.sglang_disagg | dict | Deploy | Fine-tune prefill/decode node split. {"prefill_nodes":1,"decode_nodes":2}. Default ~40% prefill, rest decode. Min 3 nodes total. |
vllm | dict | Deploy | vLLM-specific config (tensor/pipeline parallelism, model, etc.). |
primus | dict | Deploy | Primus-specific config. config_path, cli_extra, backend. |
secrets | dict | Deploy | K8s only. Auto-converted to a K8s Secret and mounted as env vars. E.g. {"HF_TOKEN":"hf_xxx"}. |
tools | list | Exec | Profiling/tracing tools. Each item: {"name":"rocprofv3_compute"}. Stackable. See Profiling tools. |
rocenv_mode | string | Exec | "lite" (default) or "full". Full mode runs lshw/dmidecode/dmesg/modinfo, installs missing tools per guest_os. |
pre_scripts | list | Exec | Scripts to run inside the container before the model script. |
post_scripts | list | Exec | Scripts to run inside the container after the model script. |
encapsulate_script | string | Exec | Script prepended to the model run command (wraps the whole execution). |
log_error_pattern_scan | bool | Exec | Set false to disable post-run log substring error detection. Useful when pytest/JUnit is authoritative. |
log_error_patterns | list | Exec | Replace the default error patterns list entirely. Each string is matched as substring in log lines. |
log_error_benign_patterns | list | Exec | Literal substrings that mark a matching log line as benign (not an error). |
env_vars | dict | Deploy | Top-level env vars merged into deployment config (SLURM script / K8s job manifest). |
gen_sys_env_details | bool | Exec | Enable/disable rocEnvTool system environment collection. Default: true. |
debug | bool | Deploy | Enable debug-level logging in deployment templates. |
SLURM sub-keys (slurm dict)
+| Key | Default (from preset) | Description |
|---|---|---|
partition | "amd-rccl" | SLURM partition name. |
nodes | 1 | Number of nodes to allocate. |
gpus_per_node | 8 | GPUs per node. |
time | "24:00:00" | Wall time limit (HH:MM:SS). |
exclusive | true | Request exclusive node access. |
nodelist | — | Pin to specific nodes. Also skips node health preflight check. |
exclude | — | Nodes to exclude. |
constraint | — | Node feature constraints. |
reservation | — | SLURM reservation name. Forwarded to srun health/cleanup commands. |
qos | — | Quality of service. |
account | — | SLURM account for billing. |
modules | [] | List of environment modules to load before job. |
output_dir | CWD | Directory for SLURM log/output files. |
network_interface | — | Network interface for NCCL/RCCL (e.g. "ib0"). |
shared_workspace | — | Shared filesystem path accessible from all nodes. |
Kubernetes sub-keys (k8s dict)
+| Key | Default | Description |
|---|---|---|
namespace | "default" | Kubernetes namespace. |
gpu_count | — | Number of GPUs per pod. |
gpu_resource_name | "amd.com/gpu" | K8s GPU resource type. Auto-set by GPU-vendor preset. |
image_pull_policy | "Always" | K8s imagePullPolicy. |
kubeconfig | "~/.kube/config" | Path to kubeconfig. |
data_storage_class | "nfs-banff" | Storage class for data PVC. Falls back to nfs_storage_class then storage_class. |
storage_class | "nfs-banff" | Generic storage class fallback. |
memory | "64Gi" | Container memory request. |
memory_limit | "128Gi" | Container memory limit. |
cpu | "16" | CPU request. |
cpu_limit | "32" | CPU limit. |
host_ipc | false | Enable hostIPC (needed for multi-node NCCL). |
backoff_limit | 3 | K8s Job backoffLimit (retries). |
ttl_seconds_after_finished | null | Auto-delete job after N seconds. |
recreate_shared_data_pvc | false | Re-create data PVC even if it already exists. |
secrets.strategy | "from_local_credentials" | How to load K8s image pull secrets. |
secrets.image_pull_secret_names | [] | Existing K8s secret names to use as image pull secrets. |
Model definition — models.json
+Each model definition lives in a models.json file (or is returned by get_models_json.py::list_models()). Fields map to the CustomModel dataclass in utils/discover_models.py.
{
+ "name": "llama3-8b-train", // Unique model identifier
+ "dockerfile": "docker/Dockerfile.ubuntu.amd",
+ "dockercontext": ".", // Build context dir (relative to scripts dir)
+ "scripts": "scripts/llama3/train.sh",
+ "url": "https://github.com/org/repo",
+ "cred": "hf_token", // Credential key from credential.json
+ "owner": "ml-team",
+ "data": "llama3-dataset", // Data identifier for DataProvider
+ "n_gpus": "8", // "-1" = all available; "0" = CPU-only
+ "timeout": 14400, // Seconds; overridden by --timeout CLI flag
+ "training_precision": "bf16",
+ "tags": ["llama3", "training", "amd"],
+ "args": "--batch-size 4 --seq-len 4096",
+ "multiple_results": "results.csv", // CSV file with multiple perf rows
+ "skip_gpu_arch": "gfx908,gfx1100", // Comma-list of archs to skip this model on
+ "additional_docker_run_options": "--shm-size 64g",
+ "distributed": {
+ "launcher": "torchrun",
+ "nnodes": 2,
+ "nproc_per_node": 8
+ },
+ "env_vars": {
+ "HF_TOKEN": "auto", // Injected into container env
+ "DOCKER_IMAGE_NAME": "reg/img" // Used by slurm_multi parallel pull
+ }
+}
+
+Key field notes
+| Field | Notes |
|---|---|
n_gpus | "-1" = use all GPUs on the host (MAD_SYSTEM_NGPUS). Positive int = that many GPUs. Used for perf CSV metadata. |
timeout | Used when CLI --timeout=-1 (default). Explicit CLI value always wins. |
skip_gpu_arch | Comma-separated GPU arch names (e.g. "gfx908,A100"). Model is skipped if detected arch matches. Disable with --disable-skip-gpu-arch. |
multiple_results | Path to CSV file (relative to model dir) with per-result rows that are appended to perf.csv individually. |
DOCKER_IMAGE_NAME in env_vars | Required for slurm_multi: specifies the registry image for parallel srun docker pull on compute nodes. Also set automatically by DockerBuilder after a successful push. |
Build manifest — build_manifest.json
+Written by madengine build, consumed by madengine run. Pass with --manifest-file.
{
+ "built_images": {
+ "ci-llama3_Dockerfile.ubuntu.amd": {
+ "docker_image": "registry.io/ml/ci-llama3:sha256-abc",
+ "docker_sha": "sha256:abc123",
+ "build_duration": 183.4
+ }
+ },
+ "built_models": {
+ "ci-llama3_Dockerfile.ubuntu.amd": {
+ "name": "llama3-8b-train",
+ "dockerfile": "docker/Dockerfile.ubuntu.amd",
+ "docker_image": "ci-llama3_Dockerfile.ubuntu.amd",
+ "docker_sha": "sha256:abc123",
+ "build_duration": 183.4,
+ "scripts": "scripts/llama3/train.sh",
+ "args": "--batch-size 4",
+ "tags": ["llama3","training"],
+ "n_gpus": "8",
+ "timeout": 14400,
+ "skip_gpu_arch": "",
+ "multiple_results": "",
+ "distributed": {"launcher":"torchrun","nnodes":2,"nproc_per_node":8},
+ "env_vars": {"DOCKER_IMAGE_NAME":"registry.io/ml/ci-llama3:sha256-abc"},
+ "built_on_compute": false
+ }
+ },
+ "context": {
+ "gpu_vendor": "AMD",
+ "guest_os": "UBUNTU",
+ "docker_env_vars": {"MAD_GPU_VENDOR":"AMD","MAD_SYSTEM_NGPUS":"8"},
+ "docker_build_arg": {}
+ },
+ "deployment_config": {
+ "target": "slurm",
+ "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+ "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8},
+ "env_vars": {"NCCL_DEBUG":"WARN"},
+ "debug": false
+ },
+ "summary": {"total":1,"success":1,"failed":0}
+}
+deployment_config are merged into the runtime context at startup. Keys in --additional-context take precedence over deployment_config.
+Deployment target inference
-No explicit deploy field exists. The factory inspects additional_context:
No explicit deploy field needed. RunOrchestrator._infer_deployment_target() inspects the merged context:
| Trigger | Class | Source | |
|---|---|---|---|
| Context condition | Target | Class | Path |
no k8s/slurm key | Local ContainerRunner | execution/container_runner.py | |
"k8s" or "kubernetes" key | KubernetesDeployment | deployment/kubernetes.py | |
"slurm" key | SlurmDeployment | deployment/slurm.py | |
distributed.launcher == "slurm_multi" | slurm_multi path (within Slurm) | deployment/slurm.py + common.py | |
"k8s" or "kubernetes" key present | Kubernetes | KubernetesDeployment | deployment/kubernetes.py |
"slurm" key present | SLURM | SlurmDeployment | deployment/slurm.py |
| Neither | Local Docker | ContainerRunner | execution/container_runner.py |
The mixin deployment/kubernetes_launcher_mixin.py selects the correct Jinja2 template -under src/madengine/deployment/templates/{kubernetes,slurm}/ per launcher.
+Within SLURM deployment, if distributed.launcher == "slurm_multi" (or "slurm-multi"), SlurmDeployment.prepare() takes the slurm_multi path instead of generating the standard Jinja2 template.
--force-mirror-local on madengine run to always use ContainerRunner even when slurm/k8s keys are in context.
+SLURM deployment
+Implemented in src/madengine/deployment/slurm.py. Generates an sbatch script from a Jinja2 template at src/madengine/deployment/templates/slurm/job.sh.j2.
+ +Preset merge order
+ConfigLoader.load_slurm_config() applies three layers (last wins):
-
+
presets/slurm/defaults.json— base defaults for all SLURM runs
+ presets/slurm/profiles/single-node.jsonormulti-node.json— profile selected bynodescount
+ - User-supplied
slurm/distributed/env_varskeys
+
presets/slurm/defaults.json — base preset contents
+{
+ "gpu_vendor": "AMD",
+ "guest_os": "UBUNTU",
+ "debug": false,
+ "slurm": {
+ "partition": "amd-rccl",
+ "nodes": 1,
+ "gpus_per_node": 8,
+ "time": "24:00:00",
+ "exclusive": true,
+ "modules": []
+ },
+ "distributed": {
+ "backend": "nccl",
+ "port": 29500
+ },
+ "env_vars": {
+ "OMP_NUM_THREADS": "8",
+ "MIOPEN_FIND_MODE": "1",
+ "MIOPEN_USER_DB_PATH": "/tmp/.miopen"
+ }
+}
+presets/slurm/profiles/multi-node.json — additional env vars for multi-node
+{
+ "slurm": {"nodes": 2, "gpus_per_node": 8, "time": "24:00:00"},
+ "distributed": {"backend": "nccl", "port": 29500},
+ "env_vars": {
+ "NCCL_DEBUG": "WARN",
+ "NCCL_DEBUG_SUBSYS": "INIT",
+ "NCCL_IB_DISABLE": "0",
+ "NCCL_SOCKET_IFNAME": "ib0",
+ "TORCH_NCCL_HIGH_PRIORITY": "1",
+ "GPU_MAX_HW_QUEUES": "8",
+ "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1",
+ "NCCL_TIMEOUT": "1200",
+ "HSA_ENABLE_SDMA": "0",
+ "HSA_FORCE_FINE_GRAIN_PCIE": "1",
+ "RCCL_ENABLE_HIPGRAPH": "0"
+ }
+}
+What the SLURM job script does
+-
+
- Sets
MASTER_ADDRviascontrol show hostnames,MASTER_PORT,WORLD_SIZE,NNODES
+ - Sets per-node
HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES(vLLM/SGLang: onlyHIP_VISIBLE_DEVICES)
+ - Sets
MIOPEN_USER_DB_PATHper-process:/tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
+ - Sets
TORCH_ELASTIC_RDZV_TIMEOUT=3600for PyTorch elastic
+ - Sets
MAD_DEPLOYMENT_TYPE=slurm,MAD_SLURM_JOB_ID,MAD_NODE_RANK,MAD_IN_SLURM_JOB=1
+ - Multi-node: generates per-node task script; runs via
srun bash $TASK_SCRIPT
+ - Single-node: creates synthetic manifest with
deployment_config.target="docker"and callsmadengine run
+
Node health preflight
+SlurmNodeSelector runs a health-check srun before the main job unless slurm.nodelist is set (then skipped). Supports slurm.reservation forwarded to srun commands.
Monitoring
+Polls squeue every 30 seconds. Terminal states: COMPLETED, FAILED, CANCELLED — a scancel'd job will not loop forever.
salloc): if SLURM_JOB_ID is set and the launcher is slurm_multi, madengine runs the wrapper script directly with bash instead of nesting a new sbatch. Other launchers still submit via sbatch even inside salloc.
+Kubernetes deployment
+Implemented in src/madengine/deployment/kubernetes.py and 6 focused mixin modules (refactored in v2.0.3). Requires pip install -e ".[kubernetes]".
Mixin modules
+| Module | Concern |
|---|---|
| k8s_pvc.py | PVC lifecycle. Storage-class fallback: data_storage_class → nfs_storage_class → storage_class. Default: "nfs-banff". |
| k8s_results.py | Log/artifact collection, perf aggregation. Uses shared collector_pod_name() helper — truncated collector-{id[:15]} to stay within K8s name limits. |
| k8s_scripts.py | Script extraction, ConfigMap building. Carries rocenv_mode and guest_os into the ConfigMap. |
| k8s_template_context.py | Assembles Jinja2 template context dict passed to job.yaml.j2. |
| kubernetes_launcher_mixin.py | Selects the right Jinja2 template per launcher type. |
| k8s_secrets.py | Converts additional_context.secrets dict to K8s Secret objects mounted as env vars. |
Preset merge order
+ConfigLoader.load_k8s_config() applies five layers (last wins):
-
+
presets/k8s/defaults.json— base defaults
+ presets/k8s/gpu-vendors/amd.jsonornvidia.json— GPU resource name
+ presets/k8s/gpu-vendors/amd-multi-gpu.json— AMD multi-GPU NCCL env vars (only if AMD + multi-GPU)
+ presets/k8s/profiles/single-gpu.json,multi-gpu.json, ormulti-node.json
+ - User config +
presets/k8s/defaults.json — base preset contents
+{
+ "k8s": {
+ "kubeconfig": "~/.kube/config",
+ "namespace": "default",
+ "image_pull_policy": "Always",
+ "backoff_limit": 3,
+ "ttl_seconds_after_finished": null,
+ "nfs_storage_class": "nfs-banff",
+ "storage_class": "nfs-banff",
+ "data_storage_class": "nfs-banff",
+ "recreate_shared_data_pvc": false,
+ "secrets": {
+ "strategy": "from_local_credentials",
+ "image_pull_secret_names": [],
+ "runtime_secret_name": null
+ }
+ },
+ "env_vars": {"OMP_NUM_THREADS": "8"}
+}
+presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars
+{
+ "env_vars": {
+ "NCCL_DEBUG": "WARN",
+ "NCCL_IB_DISABLE": "0",
+ "NCCL_SOCKET_IFNAME": "ib0",
+ "TORCH_NCCL_HIGH_PRIORITY": "1",
+ "GPU_MAX_HW_QUEUES": "8",
+ "HSA_ENABLE_SDMA": "0",
+ "MIOPEN_FIND_MODE": "1",
+ "MIOPEN_USER_DB_PATH": "/tmp/.miopen",
+ "HSA_FORCE_FINE_GRAIN_PCIE": "1",
+ "RCCL_ENABLE_HIPGRAPH": "0"
+ }
+}
+FAILED in the results table even when the pod succeeded — this occurs when the kubelet returns 502 between job completion and log collection. PVC artifacts are still collected. Check kubectl describe pod <pod>.
+Secrets management
+# Pass secrets via additional_context
+madengine run --tags llm-serve \
+ --additional-context '{
+ "k8s": {"namespace":"ml","gpu_count":8},
+ "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy","S3_KEY":"zzz"}
+ }'
+Secrets in additional_context.secrets are auto-converted to a K8s Secret object and mounted as environment variables in the job pod. They are never written to perf.csv or build logs.
slurm_multi launcher branch focus
+slurm_multi launcher merged in v2.1.0
What it is
-A minimal-but-additive SLURM launcher for workloads that orchestrate their own per-node
-Docker containers via srun — for example SGLang Disaggregated (proxy +
-prefill + decode topologies) or anything that needs to call srun / scontrol from
-inside the job script.
Generates a wrapper SBATCH that runs the model's .slurm script
-directly on baremetal (not inside a container), so the workload can spawn its own
-per-node containers without the outer job step holding a container open.
An escape-hatch SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode) or any topology that needs to call srun/scontrol from inside the job step.
Generates a wrapper SBATCH that runs the model's own .slurm (or .sh) script directly on the head node on baremetal — no outer container — so the workload can spawn its own per-node containers without nesting.
How to pick it
+How to select it
{
- "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
- "distributed": {"launcher": "slurm_multi"}
- // aliases: "slurm-multi"
+ "slurm": {
+ "partition": "gpu",
+ "nodes": 3,
+ "gpus_per_node": 8,
+ "time": "02:00:00"
+ },
+ "distributed": {
+ "launcher": "slurm_multi"
+ }
}
-Honors model-card + context slurm fields:
-partition, nodes, gpus_per_node, time,
-exclusive, reservation, nodelist.
Alias "slurm-multi" (hyphen) is also accepted and normalized automatically.
Build modes added with this launcher
+Build modes
| Mode | Flag | Behaviour |
|---|---|---|
| Mode | Flag | Behavior |
| Local build (default) | — | Normal madengine build. |
| Use prebuilt image | --use-image [IMAGE | auto] | Skip local build. auto resolves to the model card's env_vars.DOCKER_IMAGE_NAME. Mutually exclusive with the two below. |
| Build on compute | --build-on-compute (requires --registry) | Build on a SLURM compute node, push to registry; manifest sets built_on_compute: true. run then does parallel srun docker pull on all allocated nodes. |
| Implicit auto-use-image | none | If build finds a slurm_multi model and none of --registry / --use-image / --build-on-compute is set, it either auto-resolves the model card's DOCKER_IMAGE_NAME or raises a structured ConfigurationError listing the four supported options. |
| Use prebuilt image | --use-image registry.io/img:tag | Skip local build. Uses explicit image. |
| Auto-resolve from model card | --use-image (bare) | Reads env_vars.DOCKER_IMAGE_NAME from model card. |
| Build on compute | --build-on-compute --registry reg.io/ml | Builds on SLURM compute node, pushes to registry. Manifest sets built_on_compute: true. Run phase pulls in parallel on all nodes. |
| Implicit fallback | no flags | If model card has DOCKER_IMAGE_NAME, auto-uses it. Otherwise raises ConfigurationError listing options. |
Execution paths
-
-
- sbatch (default): wrapper SBATCH submitted to SLURM. -
- bash-in-salloc: when
SLURM_JOB_IDis already set (inside an - existingsalloc), the slurm_multi launcher runs the wrapper synchronously with -bashinstead of nestingsbatch. Other launchers keep using -sbatcheven insidesalloc. Uses -DeploymentResult.skip_monitoring=Trueto skip the monitor poll.
+ - sbatch (default): wrapper SBATCH submitted to SLURM. Head node calls
srun docker pullon all nodes in parallel, then runs the model's script.
+ - bash-in-salloc: if
SLURM_JOB_IDenv var is set (inside existingsalloc), the launcher runs the wrapper synchronously withbash. SetsDeploymentResult.skip_monitoring=Trueso the monitor poll is skipped.
Results aggregation
-_collect_slurm_multi_results reads the per-job CSV at
-/shared_inference/$USER/$JOBID/perf.csv and now also writes those rows
-into cwd/perf.csv (copy if absent, append data rows if present), so the default
-reporter (display_performance_table) finds them without extra args. Local + classic-SLURM
-flows are unchanged.
Tests & examples
--
-
- tests/unit/test_slurm_multi.py — registry membership, hyphen alias
- normalization, env_vars-export contract against MAD-private PR #186's
-
pyt_sglang_disagg_qwen3-32b_shortmodel card.
- - examples/slurm-configs/minimal/slurm-multi-minimal.json — reference config. -
_collect_slurm_multi_results() reads per-job CSV from /shared_inference/$USER/$JOBID/perf.csv and writes those rows into cwd/perf.csv (copy if absent, append data rows if present). This ensures display_performance_table and madengine report to-html find results without extra arguments.
Recent commits on this branch (most recent first)
-2e8f1a4 Merge remote-tracking branch 'upstream/develop' into add_slurm_multi_launcher
-68d0bf3 fix(slurm_multi): address Copilot review on PR #124
-dc3bc48 docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image
-e84506a fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter
-e281e7e fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch
-f7af062 test(slurm_multi): contract tests + minimal example config
-8a5e174 feat(cli): expose --use-image and --build-on-compute on madengine build
-bd371fe feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi
-941d56d feat(deployment): add slurm_multi launcher (minimal additive)
-Local self-managed execution
+When slurm_multi is detected in a non-SLURM context (e.g. local Docker mode), ContainerRunner._run_self_managed() runs the model's script directly on the host. Env vars from model card and additional_context are injected; keys are logged without values to avoid leaking credentials.
Kubernetes deployment
-Decomposed (v2.0.3) into focused mixins composed by KubernetesDeployment:
| Module | Concern |
|---|---|
| k8s_pvc.py | PVC lifecycle (data PVC, single-node results PVC). |
| k8s_results.py | Log/artifact collection, performance aggregation. Uses the shared collector_pod_name() helper so cleanup matches the truncated collector-{deployment_id[:15]} name. |
| k8s_scripts.py | Script extraction, ConfigMap building. |
| k8s_template_context.py | Jinja2 template context assembly. |
| kubernetes_launcher_mixin.py | Per-launcher template selection. |
| k8s_secrets.py | secrets dict → K8s Secret objects → env vars. |
| k8s_pvc.py | Storage-class fallback: data_storage_class → nfs_storage_class → storage_class; single_node_results_storage_class → local_path_storage_class → storage_class. Default bundled preset: storage_class: "nfs-banff". |
FAILED in the results table
-even though the pod actually succeeded — this happens when the kubelet returns 502 between
-job completion and log collection, so madengine cannot parse perf metrics. PVC artifacts are still collected.
-Check kubectl describe pod <pod>.
+
+Docker --build-context tools= v2.1.0
+What it does
+Every docker build issued by DockerBuilder now passes --build-context tools=scripts/common/tools when that directory exists. Dockerfiles can pull shared helper scripts from the named context:
# In any model Dockerfile
+COPY --from=tools rocm_smi/*.py /opt/mad/tools/rocm_smi/
+COPY --from=tools gpu_info/*.py /opt/mad/tools/
+Eliminates duplication of shared APIs across model Dockerfiles.
+Conditional emission (PR #134)
+The flag is only added when scripts/common/tools/ exists at build time. Builds in MAD projects without a tools directory do not receive the flag and will not fail.
Implementation: single guarded block in execution/docker_builder.py.
+SLURM fix in same PR: switched from shlex.quote() to double-quote escaping in slurm.py env-var generation so spaces and paths in values survive correctly in the sbatch script.
Launcher matrix
| Launcher | Local | K8s | SLURM | Type | Notes |
|---|---|---|---|---|---|
| torchrun | ✅ | ✅ | ✅ | Train | DDP / FSDP, elastic. |
| DeepSpeed | ✅ | ✅ | ✅ | Train | ZeRO, pipeline parallelism. |
| Megatron-LM | ✅ | ✅ | ✅ | Train | TP + PP, large transformers. |
| TorchTitan | ✅ | ✅ | ✅ | Train | FSDP2 + TP + PP + CP, Llama 3.1 8B–405B. |
| Primus | ✅ | ✅ | ✅ | Train | Megatron / TorchTitan / MaxText via Primus YAML. |
| vLLM | ✅ | ✅ | ✅ | Infer | v1 engine, PagedAttention. |
| SGLang | ✅ | ✅ | ✅ | Infer | RadixAttention, structured gen. |
| SGLang Disagg | ❌ | ✅ | ✅ | Infer | Disagg prefill/decode, Mooncake, 3+ nodes. |
slurm_multi branch | ❌ | ❌ | ✅ | Meta | Self-managed multi-node SLURM wrapper for workloads with their own per-node container orchestration. |
torchrun | ✅ | ✅ | ✅ | Train | DDP / FSDP, elastic rendezvous. |
megatron / megatron-lm | ✅ | ✅ | ✅ | Train | TP + PP parallelism; sets TP/PP/CP size env vars. |
torchtitan | ✅ | ✅ | ✅ | Train | FSDP2 + TP + PP + CP; Llama 3.1 8B–405B. |
deepspeed | ✅ | ✅ | ✅ | Train | ZeRO, pipeline parallelism; dynamic hostfile from SLURM. |
vllm | ✅ | ✅ | ✅ | Infer | PagedAttention; each node self-managing (no torchrun wrapper). |
sglang | ✅ | ✅ | ✅ | Infer | RadixAttention, structured gen; each node self-managing. |
sglang_disagg | ❌ | ✅ | ✅ | Infer | Disaggregated prefill/decode; min 3 nodes (1 proxy + ≥1P + ≥1D). |
primus | ✅ | ✅ | ✅ | Train | Megatron / TorchTitan / MaxText via Primus YAML config. |
slurm_multi | ✅ (self-mgd) | ❌ | ✅ | Meta | Bypasses template; model's own SLURM script on head node. |
Per-launcher configuration
+Standard PyTorch distributed launcher. Generates: torchrun --nnodes=N --nproc_per_node=N --node_rank=R --master_addr=ADDR --master_port=PORT
{
+ "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+ "distributed": {
+ "launcher": "torchrun",
+ "nnodes": 4,
+ "nproc_per_node": 8,
+ "backend": "nccl",
+ "port": 29500
+ },
+ "env_vars": {
+ "NCCL_DEBUG": "WARN",
+ "HSA_ENABLE_SDMA": "0",
+ "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1"
+ }
+}
+Local: MAD_MULTI_NODE_RUNNER is set to torchrun --standalone --nproc_per_node=N (single-node only).
Uses torchrun under the hood; sets TENSOR_MODEL_PARALLEL_SIZE, PIPELINE_MODEL_PARALLEL_SIZE, CONTEXT_PARALLEL_SIZE env vars for the Megatron script to read.
{
+ "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,"time":"48:00:00"},
+ "distributed": {
+ "launcher": "megatron",
+ "nnodes": 8,
+ "nproc_per_node": 8
+ },
+ "env_vars": {
+ "TENSOR_MODEL_PARALLEL_SIZE": "4",
+ "PIPELINE_MODEL_PARALLEL_SIZE": "2",
+ "CONTEXT_PARALLEL_SIZE": "1",
+ "NCCL_IB_DISABLE": "0"
+ }
+}
+FSDP2 + TP + PP + CP. Sets TORCHTITAN_TENSOR_PARALLEL_SIZE, TORCHTITAN_PIPELINE_PARALLEL_SIZE, TORCHTITAN_FSDP_ENABLED, TORCHTITAN_CONTEXT_PARALLEL_SIZE.
{
+ "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+ "distributed": {
+ "launcher": "torchtitan",
+ "nnodes": 4,
+ "nproc_per_node": 8
+ },
+ "env_vars": {
+ "TORCHTITAN_TENSOR_PARALLEL_SIZE": "2",
+ "TORCHTITAN_FSDP_ENABLED": "true"
+ }
+}
+DeepSpeed with dynamic SLURM hostfile generation. Generates: deepspeed --hostfile=/tmp/hostfile …
{
+ "slurm": {
+ "partition": "gpu",
+ "nodes": 8,
+ "gpus_per_node": 8,
+ "time": "48:00:00",
+ "reservation": "ml-priority"
+ },
+ "distributed": {
+ "launcher": "deepspeed",
+ "nnodes": 8,
+ "nproc_per_node": 8,
+ "backend": "nccl"
+ },
+ "env_vars": {
+ "NCCL_DEBUG": "WARN",
+ "HSA_ENABLE_SDMA": "0"
+ }
+}
+Each node runs independently (no torchrun). Sets VLLM_TENSOR_PARALLEL_SIZE, VLLM_PIPELINE_PARALLEL_SIZE, VLLM_DISTRIBUTED_BACKEND. Only HIP_VISIBLE_DEVICES is set (not ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES) to avoid conflict with Ray.
{
+ "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"12:00:00"},
+ "distributed": {
+ "launcher": "vllm",
+ "nnodes": 2,
+ "nproc_per_node": 8
+ },
+ "env_vars": {
+ "VLLM_TENSOR_PARALLEL_SIZE": "8",
+ "VLLM_PIPELINE_PARALLEL_SIZE": "2"
+ }
+}
+RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES is automatically overridden to "" when HIP_VISIBLE_DEVICES is set, preventing the rocm/vllm image from ignoring GPU visibility.
+SGLang standard (RadixAttention, structured gen). Each node self-managing. Sets SGLANG_TENSOR_PARALLEL_SIZE, SGLANG_PIPELINE_PARALLEL_SIZE.
{
+ "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"06:00:00"},
+ "distributed": {
+ "launcher": "sglang",
+ "nnodes": 2,
+ "nproc_per_node": 8
+ },
+ "env_vars": {
+ "SGLANG_TENSOR_PARALLEL_SIZE": "8"
+ }
+}
+Disaggregated prefill + decode topology. Minimum 3 nodes: 1 proxy + ≥1 prefill + ≥1 decode. Node split: default ~40% prefill, rest decode.
+{
+ "slurm": {
+ "partition": "gpu",
+ "nodes": 5,
+ "gpus_per_node": 8,
+ "time": "04:00:00"
+ },
+ "distributed": {
+ "launcher": "sglang_disagg",
+ "nnodes": 5,
+ "nproc_per_node": 8,
+ "sglang_disagg": {
+ "prefill_nodes": 2,
+ "decode_nodes": 2
+ }
+ },
+ "env_vars": {
+ "SGLANG_TP_SIZE": "8"
+ }
+}
+Sets: SGLANG_DISAGG_MODE, SGLANG_DISAGG_PREFILL_NODES, SGLANG_DISAGG_DECODE_NODES, SGLANG_DISAGG_TOTAL_NODES, SGLANG_NODE_IPS, SGLANG_NODE_RANK.
Config recipes
+Complete working configurations for common scenarios.
+ +Local — single GPU, AMD
+madengine run --tags llama3 \
+ --additional-context '{
+ "gpu_vendor": "AMD",
+ "guest_os": "UBUNTU",
+ "docker_gpus": "0"
+ }'
+Local — all 8 GPUs, with Megatron env vars
+madengine run --tags megatron-llama3 \
+ --additional-context '{
+ "gpu_vendor": "AMD",
+ "guest_os": "UBUNTU",
+ "docker_env_vars": {
+ "TENSOR_MODEL_PARALLEL_SIZE": "4",
+ "PIPELINE_MODEL_PARALLEL_SIZE": "2"
+ }
+ }'
+SLURM — single node torchrun
+cat > slurm-single.json <<'EOF'
+{
+ "slurm": {
+ "partition": "amd-gpu",
+ "nodes": 1,
+ "gpus_per_node": 8,
+ "time": "12:00:00",
+ "exclusive": true
+ },
+ "distributed": {
+ "launcher": "torchrun",
+ "nnodes": 1,
+ "nproc_per_node": 8
+ }
+}
+EOF
+madengine build --tags llama3 --registry registry.example.com/ml
+madengine run --manifest-file build_manifest.json \
+ --additional-context-file slurm-single.json
+SLURM — 4-node DeepSpeed with reservation
+cat > slurm-multi.json <<'EOF'
+{
+ "slurm": {
+ "partition": "amd-gpu",
+ "nodes": 4,
+ "gpus_per_node": 8,
+ "time": "24:00:00",
+ "exclusive": true,
+ "reservation": "ml-training-q1",
+ "network_interface": "ib0"
+ },
+ "distributed": {
+ "launcher": "deepspeed",
+ "nnodes": 4,
+ "nproc_per_node": 8,
+ "backend": "nccl"
+ },
+ "env_vars": {
+ "NCCL_IB_DISABLE": "0",
+ "NCCL_SOCKET_IFNAME": "ib0",
+ "NCCL_DEBUG": "WARN",
+ "HSA_ENABLE_SDMA": "0"
+ }
+}
+EOF
+madengine run --manifest-file build_manifest.json \
+ --additional-context-file slurm-multi.json
+K8s — single pod, 4 AMD GPUs
+madengine run --tags llama3-infer \
+ --additional-context '{
+ "k8s": {
+ "namespace": "ml-team",
+ "gpu_count": 4
+ }
+ }'
+K8s — multi-node vLLM with HF secret
+madengine run --tags vllm-llama3-70b \
+ --additional-context '{
+ "k8s": {
+ "namespace": "ml-team",
+ "gpu_count": 8,
+ "host_ipc": true,
+ "data_storage_class": "nfs-banff"
+ },
+ "distributed": {
+ "launcher": "vllm",
+ "nnodes": 2,
+ "nproc_per_node": 8
+ },
+ "secrets": {"HF_TOKEN": "hf_xxxxxxx"},
+ "env_vars": {
+ "VLLM_TENSOR_PARALLEL_SIZE": "8",
+ "VLLM_PIPELINE_PARALLEL_SIZE": "2"
+ }
+ }'
+SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)
+madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:v0.4
+
+madengine run --manifest-file build_manifest.json \
+ --additional-context '{
+ "slurm": {
+ "partition": "amd-gpu",
+ "nodes": 3,
+ "gpus_per_node": 8,
+ "time": "04:00:00"
+ },
+ "distributed": {
+ "launcher": "slurm_multi"
+ }
+ }'
+Local run with ROCm compute profiling
+madengine run --tags llama3 \
+ --additional-context '{
+ "gpu_vendor": "AMD",
+ "tools": [
+ {"name": "rocprofv3_compute"}
+ ],
+ "rocenv_mode": "full"
+ }'
+Stack multiple profilers:
+ "tools": [
+ {"name": "rocprofv3_compute"},
+ {"name": "rccl_trace"},
+ {"name": "gpu_info_power_profiler"}
+ ]
+Profiling & tracing
-Enable via --additional-context '{"tools":[{"name":"…"}]}'. Stackable.
Profiling & tracing tools
+Enable via --additional-context '{"tools":[{"name":"…"}]}'. Tools are stackable — list multiple objects. Implemented in scripts/common/tools/ and execution/container_runner.py::apply_tools().
rocm_trace_lite with rocprof / rocprofv3_* in the same run — they conflict at the kernel-tracing level.
+| Tool | Purpose | Output | |
|---|---|---|---|
| Tool name | Purpose | Output location | Notes |
rocprof | Legacy GPU kernel profiling | Kernel timings/occupancy | |
rocprofv3_compute | Compute-bound (ROCm ≥ 7.0) | ALU, wave execution | |
rocprofv3_memory | Memory-bound | Cache hits, bandwidth | |
rocprofv3_communication | Multi-GPU | RCCL traces | |
rocprofv3_full | Comprehensive | All metrics, high overhead | |
rocprofv3_lightweight | Minimal overhead | HIP + kernel traces | |
rocprofv3_perfetto | Perfetto UI traces | Perfetto JSON | |
rocprofv3_api_overhead | API call timing | API timings | |
rocprofv3_pc_sampling | Kernel hotspots | PC sample histograms | |
rocm_trace_lite | RTL lite dispatch trace | rocm_trace_lite_output/trace.db | |
rocm_trace_lite_default | RTL default mode | Same paths, broader coverage | |
rocblas_trace / miopen_trace / tensile_trace / rccl_trace |
- Library call tracing | Per-library log | |
gpu_info_power_profiler / gpu_info_vram_profiler | Power / VRAM over time | CSV time series | |
therock_check | TheRock ROCm validation | Detection report | |
rocprof | Legacy GPU kernel profiling | Kernel timings / occupancy CSVs | Use rocprofv3_* on ROCm ≥ 7.0 |
rocprofv3_compute | Compute-bound kernels | ALU, wave execution metrics | ROCm ≥ 7.0 |
rocprofv3_memory | Memory-bound workloads | Cache hits, bandwidth | |
rocprofv3_communication | Multi-GPU communication | RCCL traces | |
rocprofv3_full | Comprehensive (all metrics) | All counters | High overhead — short runs only |
rocprofv3_lightweight | Minimal overhead tracing | HIP API + kernel traces | |
rocprofv3_perfetto | Perfetto UI traces | Perfetto JSON for ui.perfetto.dev | |
rocprofv3_api_overhead | API call timing | Per-API timing report | |
rocprofv3_pc_sampling | Kernel hotspot identification | PC sample histograms | |
rocm_trace_lite | RTL lite dispatch trace | rocm_trace_lite_output/trace.db | Pinned GitHub release wheel by default |
rocm_trace_lite_default | RTL default mode | Same paths, broader coverage | v2.0.3+ |
rocblas_trace | rocBLAS call tracing | Per-library log | |
miopen_trace | MIOpen call tracing | Per-library log | |
tensile_trace | Tensile call tracing | Per-library log | |
rccl_trace | RCCL communication tracing | Per-library log | |
gpu_info_power_profiler | Power consumption over time | CSV time series | |
gpu_info_vram_profiler | VRAM usage over time | CSV time series | |
therock_check | TheRock ROCm stack validation | Detection report | Identifies apt vs TheRock install |
rocm_trace_lite wheel control
+| Env var | Effect |
|---|---|
ROCM_TRACE_LITE_FOLLOW_LATEST=1 | Always pull the latest wheel from GitHub |
ROCM_TRACE_LITE_WHEEL_URL=https://… | Use a specific wheel URL (air-gapped installs) |
rocEnvTool modes
+Mode (rocenv_mode) | Collects |
|---|---|
"lite" (default) | Basic ROCm info, GPU topology, driver version |
"full" | All of lite + lshw, dmidecode, dmesg, modinfo; best-effort installs missing tools per guest_os |
rocm_trace_lite with rocprof /
-rocprofv3_* in the same run. RTL installs from a pinned GitHub release wheel by
-default — set ROCM_TRACE_LITE_FOLLOW_LATEST=1 or
-ROCM_TRACE_LITE_WHEEL_URL=… for latest / air-gapped installs.
-ROCm path resolution
-Implemented in src/madengine/utils/rocm_path_resolver.py.
-Host (build & tools)
+Implemented in src/madengine/utils/rocm_path_resolver.py and src/madengine/core/context.py. Two independent resolution chains run in parallel.
+Host path (build & tools)
-
-
- Top-level
MAD_ROCM_PATHin--additional-context
- - Auto-detect:
/opt/rocm,/opt/rocm-*, TheRockrocm-sdk+ markers, thenrocminfo/amd-smi/rocm-smionPATH
- ROCM_PATHenv var
- /opt/rocmfallback
+ MAD_ROCM_PATHin--additional-context
+ - Auto-detect:
/opt/rocm, versioned/opt/rocm-*, TheRock (rocm-sdk+ markers)
+ rocminfo/amd-smi/rocm-smilocation onPATH
+ ROCM_PATHenvironment variable
+ /opt/rocmfallback (with warning)
Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only the env var/default.
In-container (AMD Docker runs)
+Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only env var / default.
In-container path (AMD Docker runs)
-
-
docker_env_vars.MAD_ROCM_PATH(consumed; not forwarded as-is)
- ROCM_PATH/ROCM_HOMEfrom image OCI config (docker image inspect)
- - In-image shell probe (
docker run --rm)
- /opt/rocmwith a warning
+ docker_env_vars.MAD_ROCM_PATHin additional_context
+ ROCM_PATH/ROCM_HOMEfrom image OCI config (docker image inspect)
+ - In-image shell probe (
docker run --rm image env)
+ /opt/rocmfallback with warning
The run-phase environment table prints host vs container installation type
-(apt / therock / unknown), ROCm/CUDA root, and version side-by-side.
The run-phase env table prints host vs container ROCm root, installation type (apt / therock / unknown), and version side-by-side.
unique_id method; 6.4.1+ uses amd-smi node_id. The gpu_renderDs context key maps GPU index → /dev/dri/renderD number. Guards against None entries on restricted ROCm installs.
+Environment variables
+ + +Read by madengine at runtime
+| Variable | Module | Purpose |
|---|---|---|
MAD_ROCM_PATH | context.py | Override ROCm root on host. Priority 1. |
ROCM_PATH | core/constants.py | Fallback ROCm root. Priority 3. |
MAD_AUTO_ROCM_PATH | rocm_path_resolver | Set 0 to disable auto-scan. |
MODEL_DIR | core/constants.py | Working directory for model scripts. Default: . |
MAD_VERBOSE_CONFIG | core/constants.py | Enable verbose config output. |
MAD_SETUP_MODEL_DIR | core/constants.py | Trigger model directory setup. |
MAD_SECRETS* | context.py | Any env var with this prefix is automatically copied to docker_build_arg AND docker_env_vars. |
MAD_DOCKERHUB_USER | build_orchestrator | Docker Hub username for registry auth. |
MAD_DOCKERHUB_PASSWORD | build_orchestrator | Docker Hub password for registry auth. |
SLURM_JOB_ID | slurm.py | Detect existing SLURM allocation (triggers bash-in-salloc for slurm_multi). |
SLURM_NNODES, SLURM_NPROCS | container_runner | Read in SLURM job to resolve GPU count per node. |
NPROC_PER_NODE, GPUS_PER_NODE | container_runner | Injected by SLURM template; read by ContainerRunner to set up docker run GPU args. |
MONGO_HOST, MONGO_PORT | database/mongodb.py | MongoDB connection. |
MONGO_USER, MONGO_PASSWORD | database/mongodb.py | MongoDB credentials. |
MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS | database/mongodb.py | MongoDB auth source and timeout. |
NAS_NODES | core/constants.py | NAS node config (JSON string). |
MAD_AWS_S3 | core/constants.py | AWS S3 credentials (JSON: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, …). |
MAD_MINIO | core/constants.py | MinIO credentials (JSON: MINIO_ENDPOINT, AWS_ENDPOINT_URL_S3, …). |
PUBLIC_GITHUB_ROCM_KEY | core/constants.py | GitHub ROCm key (JSON). |
ROCM_TRACE_LITE_FOLLOW_LATEST | tools | Set 1 to always pull latest RTL wheel. |
ROCM_TRACE_LITE_WHEEL_URL | tools | Override RTL wheel URL (air-gapped installs). |
Set by madengine in Docker containers
+| Variable | Set by | Value / source |
|---|---|---|
MAD_GPU_VENDOR | context.py | "AMD" or "NVIDIA" |
MAD_SYSTEM_NGPUS | context.py | Total GPU count on host |
MAD_SYSTEM_GPU_ARCHITECTURE | context.py | GPU arch string (e.g. "gfx90a") |
MAD_SYSTEM_HIP_VERSION | context.py | HIP version string |
MAD_SYSTEM_GPU_PRODUCT_NAME | context.py | GPU product name |
MAD_GUEST_OS | container_runner | "UBUNTU" or "CENTOS" |
MAD_RUNTIME_NGPUS | container_runner | GPU count allocated for this specific run |
MAD_MULTI_NODE_RUNNER | container_runner | Distributed launcher command (e.g. torchrun --standalone --nproc_per_node=8) |
MAD_MODEL_NAME | container_runner | Model name from model definition |
MAD_OUTPUT_CSV | container_runner | Path for multiple_results CSV output |
ROCM_PATH | container_runner | Resolved in-container ROCm root |
JENKINS_BUILD_NUMBER | container_runner | CI build number (from shell env if set) |
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES | container_runner | Force-set to "" when HIP_VISIBLE_DEVICES is active (AMD+Ray fix) |
Set by SLURM job script (job.sh.j2)
+| Variable | Value |
|---|---|
MAD_DEPLOYMENT_TYPE | "slurm" |
MAD_SLURM_JOB_ID | SLURM job ID |
MAD_NODE_RANK | This node's rank (0-indexed) |
MAD_TOTAL_NODES | Total node count |
MAD_IN_SLURM_JOB | "1" |
MAD_LAUNCHER_TYPE | Launcher type string |
MASTER_ADDR | Head node hostname (via scontrol) |
MASTER_PORT | Communication port (default 29500) |
WORLD_SIZE | Total GPU processes (nodes × GPUs/node) |
NNODES | Node count |
GPUS_PER_NODE | GPU count per node |
NODE_RANK | This node's rank |
TORCH_ELASTIC_RDZV_TIMEOUT | 3600 |
MIOPEN_USER_DB_PATH | /tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0} |
HIP_VISIBLE_DEVICES | GPU indices for this node's processes |
ROCR_VISIBLE_DEVICES | GPU indices (not set for Ray-based launchers) |
CUDA_VISIBLE_DEVICES | GPU indices (not set for Ray-based launchers) |
Error types
+Defined in src/madengine/core/errors.py. All inherit from MADEngineError(Exception) which carries: message, category, context (ErrorContext dataclass), cause, recoverable, suggestions (list). Rich panels are used for display.
| Class | Category | When raised |
|---|---|---|
ValidationError | VALIDATION | Invalid CLI args, model field values, context key types. |
NetworkError | CONNECTION | Registry connectivity, pull failures, MongoDB connection. |
AuthenticationError | AUTHENTICATION | Registry login failure, invalid credentials format. |
ExecutionError | RUNTIME | Container run failure, script non-zero exit, timeout. (RuntimeError is an alias.) |
BuildError | BUILD | Docker build failure. |
DiscoveryError | DISCOVERY | models.json parse failure, tag not found, no models matched. |
OrchestrationError | ORCHESTRATION | Manifest load failure, incompatible build/run state. |
RunnerError | RUNNER | ContainerRunner internal failure. |
ConfigurationError | CONFIGURATION | slurm_multi registry gate violation, conflicting flags, missing required config. |
DeploymentTimeoutError | TIMEOUT | SLURM/K8s job exceeded wall time. |
Module reference
- -| Layer | Path | What it contains |
|---|
| Layer | Path | Contents |
|---|---|---|
| CLI | cli/app.py | Typer app, cli_main entry, --version handling, rich traceback install. |
| CLI | cli/commands/build.py | madengine build command, registry options, batch builds, --use-image/--build-on-compute. |
| CLI | cli/commands/run.py | madengine run command, manifest loading, --skip-model-run. |
| CLI | cli/commands/discover.py | Model discovery command. |
| CLI | cli/app.py | Typer app, cli_main entry, --version, Rich traceback install. |
| CLI | cli/commands/build.py | madengine build: registry, batch, --use-image, --build-on-compute, mutex validation. |
| CLI | cli/commands/run.py | madengine run: manifest loading, all run flags, --force-mirror-local, --cleanup-perf. |
| CLI | cli/commands/discover.py | Model discovery command, scoped tag parsing. |
| CLI | cli/commands/report.py | report to-html / to-email sub-app. |
| CLI | cli/commands/database.py | MongoDB upload command. |
| CLI | cli/constants.py | ExitCode enum. |
| CLI | cli/validators.py | Argument validation. |
| Orch | orchestration/build_orchestrator.py | BuildOrchestrator.execute(), discover → build, registry login, batch manifest, slurm_multi registry gate. |
| Orch | orchestration/run_orchestrator.py | RunOrchestrator, build phase, target inference, local Docker dispatch, slurm_multi result aggregation. |
| Orch | orchestration/image_filtering.py | Target-arch / tag filtering of manifest entries. |
| Dep | deployment/factory.py | DeploymentFactory.create(), registers SlurmDeployment + KubernetesDeployment; UserWarning if kubernetes pkg missing. |
| Dep | deployment/base.py | BaseDeployment, DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN, terminal states (COMPLETED/FAILED/CANCELLED). |
| Dep | deployment/kubernetes.py | Composes K8s mixins; orchestrates job lifecycle. |
| Dep | deployment/k8s_pvc.py | PVC creation/deletion + storage-class resolution. |
| Dep | deployment/k8s_results.py | Log/artifact collection, perf aggregation; collector_pod_name(). |
| Dep | deployment/k8s_scripts.py | Script extraction, ConfigMap building (carries rocenv_mode, guest_os). |
| Dep | deployment/k8s_template_context.py | Assembles Jinja2 template context. |
| Dep | deployment/k8s_secrets.py | secrets → K8s Secret objects. |
| Dep | deployment/k8s_names.py | Name truncation/sanitization helpers. |
| Dep | deployment/kubernetes_launcher_mixin.py | Selects K8s template per launcher. |
| Dep | deployment/slurm.py | SlurmDeployment; classic SLURM path; routes to slurm_multi when launcher matches. |
| Dep | deployment/slurm_node_selector.py | SlurmNodeSelector health/cleanup srun, supports reservation. |
| Dep | deployment/primus_backend.py | Primus YAML / backend selection. |
| Dep | deployment/common.py | Shared deployment helpers, slurm_multi wrapper assembly. |
| Dep | deployment/config_loader.py | Loads and deep-merges preset JSON with user config. |
| Dep | deployment/presets/{k8s,slurm}/defaults.json | Default values auto-merged with minimal user configs. |
| Dep | deployment/templates/{kubernetes,slurm}/ | Jinja2 templates per launcher. |
| Exec | execution/container_runner.py | ContainerRunner: local docker run, env injection (MAD_GUEST_OS, MAD_OUTPUT_CSV), tools wiring, perf parsing. |
| Exec | execution/container_runner_helpers.py | Log error pattern scan, timeout resolution. |
| Exec | execution/docker_builder.py | DockerBuilder: build args (incl. MAD_SYSTEM_GPU_ARCHITECTURE), push/tag, shell-quoted everywhere. |
| Exec | execution/dockerfile_utils.py | Dockerfile parsing helpers. |
| Core | core/context.py | Context: ast.literal_eval parse, system detect, GPU vendor/arch, ROCm path; guards against None kfd_renderDs entries on restricted ROCm. |
| Core | core/additional_context_defaults.py | Default values merged into context. |
| Core | core/console.py | Console: Rich-backed shell wrapper, live output mode. |
| Core | core/docker.py | Docker wrapper; shlex.quote() on every interpolation. |
| Core | core/errors.py | MADEngineError + 10 typed errors; create_error_context; Rich panels. |
| Core | core/auth.py | load_credentials(), login_to_registry() (uses --password-stdin + MAD_REGISTRY_PASSWORD env). |
| CLI | cli/constants.py | ExitCode enum, DEFAULT_MANIFEST_FILE, DEFAULT_PERF_OUTPUT, DEFAULT_TIMEOUT=-1. |
| CLI | cli/validators.py | Argument validation: validate_additional_context(), create_args_namespace(). |
| Orch | orchestration/build_orchestrator.py | BuildOrchestrator.execute(): discover → context → build → registry gate → manifest. slurm_multi use-image / build-on-compute paths. |
| Orch | orchestration/run_orchestrator.py | RunOrchestrator.execute(): manifest loading, target inference, script copy/cleanup, local/distributed dispatch. |
| Orch | orchestration/image_filtering.py | Filters manifest entries by GPU vendor, GPU arch, skip_gpu_arch field. |
| Dep | deployment/factory.py | DeploymentFactory.create(). Registers SlurmDeployment + KubernetesDeployment. UserWarning if kubernetes package missing. |
| Dep | deployment/base.py | BaseDeployment (Template Method), DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN. |
| Dep | deployment/kubernetes.py | KubernetesDeployment: composes 6 mixins, orchestrates K8s job lifecycle. |
| Dep | deployment/k8s_pvc.py | PVC creation/deletion, storage-class fallback chain. |
| Dep | deployment/k8s_results.py | Log/artifact collection, perf aggregation, collector_pod_name(). |
| Dep | deployment/k8s_scripts.py | Script extraction, ConfigMap building (rocenv_mode, guest_os). |
| Dep | deployment/k8s_template_context.py | Assembles Jinja2 template context for K8s jobs. |
| Dep | deployment/k8s_secrets.py | secrets dict → K8s Secret objects. |
| Dep | deployment/k8s_names.py | Name truncation/sanitization helpers for K8s resource names. |
| Dep | deployment/kubernetes_launcher_mixin.py | Selects Jinja2 template per launcher; sets MAD_MULTI_NODE_RUNNER for K8s pods. |
| Dep | deployment/slurm.py | SlurmDeployment: template prep, sbatch submit, bash-in-salloc, slurm_multi dispatch, monitoring, results collection. |
| Dep | deployment/slurm_node_selector.py | SlurmNodeSelector: health/cleanup srun, reservation parameter, node preflight. |
| Dep | deployment/common.py | Shared helpers: VALID_LAUNCHERS, slurm_multi wrapper assembly, launcher normalization. |
| Dep | deployment/config_loader.py | ConfigLoader: deep-merge, preset loading, target inference. env_vars merged recursively (not replaced). |
| Dep | deployment/primus_backend.py | Primus YAML / backend selection helper. |
| Dep | deployment/presets/slurm/defaults.json | SLURM base preset. |
| Dep | deployment/presets/slurm/profiles/ | single-node.json, multi-node.json. |
| Dep | deployment/presets/k8s/defaults.json | K8s base preset. |
| Dep | deployment/presets/k8s/gpu-vendors/ | amd.json, nvidia.json, amd-multi-gpu.json. |
| Dep | deployment/presets/k8s/profiles/ | single-gpu.json, multi-gpu.json, multi-node.json. |
| Dep | deployment/templates/slurm/job.sh.j2 | Main sbatch template (~822 lines). Sets all SLURM env vars, runs srun task scripts. |
| Dep | deployment/templates/kubernetes/ | K8s YAML templates: configmap.yaml.j2, job.yaml.j2, pvc.yaml.j2, pvc-data.yaml.j2, service.yaml.j2. |
| Exec | execution/container_runner.py | ContainerRunner: local docker run, AMD/NVIDIA run options, env injection, tools, perf parsing, _run_self_managed(), _generate_local_launcher_command(). |
| Exec | execution/container_runner_helpers.py | Log error pattern scan, resolve_run_timeout(), make_run_log_file_path(). |
| Exec | execution/docker_builder.py | DockerBuilder: build args, --build-context tools= (conditional), registry push, DOCKER_IMAGE_NAME injection into manifest. |
| Exec | execution/dockerfile_utils.py | Dockerfile parsing: GPU vendor from filename + FROM line. |
| Core | core/context.py | Context: ast.literal_eval parse, GPU vendor/arch detection, ROCm path resolution, MAD_SECRETS* propagation, renderD mapping. |
| Core | core/additional_context_defaults.py | Default values merged before user context: DEFAULT_GPU_VENDOR="AMD", DEFAULT_GUEST_OS="UBUNTU". |
| Core | core/console.py | Console: Rich-backed shell executor, live output, timeout, secret=True for credential commands. |
| Core | core/docker.py | Docker wrapper: shlex.quote() on every interpolation, auto stop/remove on __del__. |
| Core | core/errors.py | 10-type error hierarchy, ErrorCategory, ErrorContext, ErrorHandler, Rich panel display. |
| Core | core/auth.py | load_credentials(), login_to_registry() using --password-stdin + MAD_REGISTRY_PASSWORD. |
| Core | core/timeout.py | Timeout context manager; guards signal.alarm(None) when seconds is 0/None. |
| Core | core/constants.py | Misc core constants. |
| Core | core/dataprovider.py | Data: local / NAS / S3 / MinIO abstraction. |
| Util | utils/discover_models.py | DiscoverModels: root, dir, or dynamic discovery; scoped vs unscoped tags. |
| Util | utils/gpu_tool_factory.py | Returns AMD or NVIDIA tool manager based on vendor. |
| Util | utils/gpu_tool_manager.py | Abstract GPU tool manager interface. |
| Util | utils/rocm_tool_manager.py | AMD/ROCm implementation. |
| Util | utils/nvidia_tool_manager.py | NVIDIA implementation. |
| Util | utils/gpu_validator.py | ROCm install detection, GPU vendor detection. |
| Util | utils/gpu_config.py | GPU configuration helpers. |
| Util | utils/rocm_path_resolver.py | Host/in-container ROCm root resolver. |
| Util | utils/therock_markers.py | Shared TheRock detection markers. |
| Util | utils/config_parser.py | ConfigParser: parses additional context + tools config. |
| Util | utils/path_utils.py | Path helpers. |
| Core | core/dataprovider.py | Data abstraction: local / NAS / S3 / MinIO. |
| Util | utils/discover_models.py | DiscoverModels: root, dir, dynamic discovery; scoped vs unscoped tags; CustomModel dataclass. |
| Util | utils/gpu_tool_factory.py | Singleton get_gpu_tool_manager(vendor, rocm_path); auto-detects vendor. |
| Util | utils/gpu_validator.py | GPUVendor enum, ROCmValidator, NVIDIAValidator, GPUValidationResult. |
| Util | utils/rocm_path_resolver.py | Host + in-container ROCm path resolution chains. |
| Util | utils/therock_markers.py | Shared TheRock detection markers (rocm-sdk, layout probes). |
| Util | utils/config_parser.py | ConfigParser: 5-level config file resolution, CSV/JSON/YAML loading, multi-row result matching. |
| Util | utils/session_tracker.py | Session start/marker tracking. |
| Util | utils/ops.py | Misc operations. |
| Util | utils/log_formatting.py | Log formatting helpers. |
| Util | utils/run_details.py | Run metadata helpers. |
| Rep | reporting/update_perf_csv.py | Writes/appends to perf.csv and perf_entry.csv. |
| Rep | reporting/csv_to_html.py | HTML report generation. |
| Rep | reporting/update_perf_csv.py | Writes/appends perf.csv and perf_entry.csv. PERF_CSV_HEADER (28 columns). |
| Rep | reporting/csv_to_html.py | HTML performance report generation. |
| Rep | reporting/csv_to_email.py | Email-friendly consolidated report. |
| Rep | reporting/update_perf_super.py | Superset-shaped perf rollups. |
| DB | database/mongodb.py | MongoDB connection + insert; uses datetime.now(timezone.utc). |
| DB | database/mongodb.py | MongoDBConfig.from_env(), UploadOptions, UploadResult; upsert + batch upload. |
| Scripts | scripts/common/pre_scripts/rocEnvTool/ | rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes). |
| Scripts | scripts/common/tools/ | GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers. |
| Scripts | scripts/common/tools/ | GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers (rocblas, miopen, rccl, tensile). |
Test layout
unit/
-Fast, isolated, mocked. ~28 modules including test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py.
Fast, isolated, mocked. Key files: test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py, test_deployment.py, test_container_runner.py.
integration/
Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.
e2e/
Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.
Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0). Markers: unit, integration, e2e, slow, gpu, amd, nvidia, cpu, requires_docker, requires_models.
| Marker | What it selects |
|---|---|
unit | Fast unit tests with no external deps |
integration | Tests requiring Docker / real GPU calls |
e2e | Full end-to-end workflow tests |
slow | Long-running tests |
gpu | Requires GPU hardware |
amd / nvidia | Vendor-specific tests |
cpu | CPU-only tests |
requires_docker | Tests requiring Docker daemon |
requires_models | Tests requiring model files to be present |
Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0).
Contributing & code style
+Style rules
+-
+
- Formatting: Black (line-length 88), targets py3.8–py3.11 +
- Imports: isort with
profile="black"; first-party =madengine
+ - Lint: flake8 + mypy (strict equality, warn unused) + bandit (skips B101) +
- Docstrings: Google style; type hints required for public functions +
- Conventional commits:
feat:,fix:,docs:,test:,refactor:,style:,perf:,chore:
+
Security rules
-
-
- Formatting: Black (line-length 88), targets py38–py311. -
- Imports: isort with
profile = "black"; first-party =madengine.
- - Lint: flake8 + mypy (strict equality, warn unused, etc.) + bandit (skips B101). -
- Docstrings: Google style; type hints for public functions. -
- Conventional commits:
feat:,fix:,docs:,test:,refactor:,style:,perf:,chore:.
- - Pre-commit:
pip install pre-commit && pre-commit install.
+ - Use
shlex.quote()on every shell interpolation of user-controlled values (image names, paths, container names, build-args)
+ - Registry passwords via
--password-stdin(not command-line args); env varMAD_REGISTRY_PASSWORD
+ - Credential JSON must be a dict object — validated at load time (
ConfigurationErroron wrong type)
+ MIOPEN_USER_DB_PATHis filtered from deployment_config to prevent leaking temp paths
+ - Never log secret values — log keys only
Recent notable changes
+Changelog
[Unreleased] — slurm_multi launcher
+[2.1.0] — 2026-05-28
+Added
-
-
- New
slurm_multiSLURM launcher;slurm-multialias accepted.
- madengine build --use-image [IMAGE|auto]and--build-on-compute.
- - Build registry gate with structured
ConfigurationError.
- - bash-in-salloc execution path when
SLURM_JOB_IDis already set.
- DeploymentResult.skip_monitoringfor synchronous deploys.
- SlurmNodeSelectoraccepts areservationparameter.
- - perf.csv aggregation into cwd so the default reporter sees per-job rows. -
- Contract tests + minimal example config. +
slurm_multiself-managed SLURM launcher (PRs #130, #126): aliasslurm-multi, parallel docker pull, bash-in-salloc path,_run_self_managed()for local mode
+ madengine build --use-image [IMAGE|auto]— skip local build
+ madengine build --build-on-compute— build on compute node + push
+ - slurm_multi registry gate with structured
ConfigurationError
+ DeploymentResult.skip_monitoringfor synchronous deploy paths
+ SlurmNodeSelector.reservationparameter
+ DockerBuilder:--build-context tools=(conditional on dir existence, PR #131 + #134)
+ - Local
MAD_MULTI_NODE_RUNNERviaContainerRunner._generate_local_launcher_command()(PR #126)
+ - Model card
distributed/slurmauto-merged into manifestdeployment_config
+ DOCKER_IMAGE_NAMEinjection into manifestenv_varsafter successful registry push
+
Changed
+-
+
- SLURM env-var escaping: double-quote instead of
shlex.quoteto preserve spaces/paths (PR #134)
+ - Early
DiscoverModelsresult cached and reused for actual build (no duplicateget_models_json.pyruns)
+ - E2E test cleanup defaults include
build_manifest.json+ perf artefacts
[2.0.3] — rocEnvTool full mode, K8s refactor, security
+[2.0.3] — 2026-05-26
-
-
- K8s monolith decomposed into
k8s_pvc/k8s_results/k8s_scripts/k8s_template_contextmixins.
- - rocEnvTool
"full"mode (lshw, dmidecode, dmesg, modinfo) with guest_os-native installers.
- - Generic
storage_classfallback added; default preset nownfs-banff.
- rocm_trace_lite_defaulttool (RTLdefaultmode).
- - Security:
shlex.quote()on every shell interpolation incore/docker.py,container_runner.py,docker_builder.py,run_orchestrator.py.
- - Collector pod name mismatch fix (truncated
collector-{id[:15]}shared helper).
- - RPD pre-script:
xxdinstall + sudo/root branch fixes.
- CANCELLEDadded to terminal-state set soscancel'd jobs don't loop forever.
- Contextguards againstNonekfd_renderDson restricted ROCm.
+ - rocEnvTool
"full"mode (lshw, dmidecode, dmesg, modinfo)
+ - K8s monolith decomposed into 6 focused mixin modules +
- Generic
storage_classfallback; default presetnfs-banff
+ rocm_trace_lite_defaulttool (RTL default mode)
+ - Security:
shlex.quote()on every shell interpolation
+ - Collector pod name mismatch fix (shared
collector_pod_name()helper)
+ CANCELLEDadded to terminal-state set
+ - Local
MAD_MULTI_NODE_RUNNERfor Docker local (_generate_local_launcher_command())
[2.0.2] / [2.0.1] — credential validation, ROCm auto-detect, GPU arch
+[2.0.2] / [2.0.1]
-
-
load_credentials()validates JSON object type, raisesConfigurationError.
- - Host ROCm auto-detection via priority chain; in-container ROCm resolved independently. -
- TheRock layout support (
rocm-sdk+ markers).
- - GPU arch auto-detection injected into Docker build args for full-run mode. -
- Model discovery: scope-based tag selection replaces
strictflag.
- - Shared
login_to_registry, centralised credential loading.
- - Registry password via env +
--password-stdin(no more/procexposure).
- - Unified
PERFORMANCE_LOG_PATTERNacross local + deployment paths.
+ - Host ROCm auto-detection via priority chain; in-container ROCm resolved independently +
- TheRock (
rocm-sdk) layout support
+ - GPU arch auto-detection injected into Docker build args +
- Model discovery: scope-based tag selection replaces
strictflag
+ - Registry password via
--password-stdin+ env var
+ credential.jsontype validation
+ - Unified
PERFORMANCE_LOG_PATTERNacross local + deployment paths
+ - Run-phase host/container env table printed at startup
[2.0.0] — Complete rewrite
+[2.0.0] — 2026-04-09 — Complete rewrite
-
-
- Unified
madengineCLI; legacymad-*removed.
- - 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core). -
- Multi-target deployment via factory + presets + Jinja2 templates. -
- Launcher mixin with torchrun / DeepSpeed / Megatron-LM / TorchTitan / Primus / vLLM / SGLang. -
- Log error pattern scanning;
--skip-model-run; batch build manifest.
- - SLURM nodelist pinning; K8s Secrets management. -
- Structured errors (10 types) with Rich panels; fixed exit codes. -
RuntimeErrorrenamed toExecutionError(alias preserved).
+ - Unified
madengineCLI; legacymad-*removed
+ - 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core) +
- Factory + Template Method patterns;
DeploymentFactory,BaseDeployment,ConfigLoader
+ - Multi-target deployment: presets + Jinja2 templates per launcher +
- Launcher matrix: torchrun / DeepSpeed / Megatron / TorchTitan / Primus / vLLM / SGLang +
- Log error pattern scanning;
--skip-model-run; batch build manifest
+ - Structured errors (10 types) with Rich panels; fixed exit codes +
- SLURM nodelist pinning; K8s Secrets management; data provider abstraction