diff --git a/docs/wiki/index.html b/docs/wiki/index.html index 9cbc6ed1..a34889dc 100644 --- a/docs/wiki/index.html +++ b/docs/wiki/index.html @@ -3,246 +3,373 @@ -madengine — Codebase Wiki (branch: develop) +madengine — Codebase Wiki v2.1.0
+ + +
+ +

madengine — Codebase Wiki

-

AI/ML model automation and benchmarking platform for local Docker, Kubernetes and SLURM. This wiki reflects branch - develop. madengine is a streamlined CLI tool for running and benchmarking AI models on ROCm GPUs, offering a production‑ready workflow for local single node or remote multi node execution with integrated performance monitoring.

+

AI/ML model automation & benchmarking platform for local Docker, Kubernetes, and SLURM. + A Typer-based CLI that discovers models, builds Docker images, runs them across compute targets, + and writes structured performance results.

+

Entry point: src/madengine/cli/app.py::cli_main + → console script madengine registered in pyproject.toml.

- branch: develop + v2.1.0 — 2026-05-28 Python ≥ 3.8 5-layer CLI - Local / K8s / SLURM / slurm_multi + Local · K8s · SLURM · slurm_multi Typer + Rich ROCm & CUDA + Jinja2 templates
- +

Overview

-

What it does

-

madengine is a Typer-based CLI (madengine) that discovers models from a - MAD package, builds Docker images, and runs them either locally or on distributed - backends (Kubernetes, SLURM). It writes performance results to perf.csv - and can generate HTML reports or upload to MongoDB.

-

Entry point: src/madengine/cli/app.py::cli_main - (registered as the madengine console script in pyproject.toml).

+

What madengine does

+
    +
  1. Discover — finds model definitions from models.json or dynamic scripts, resolves tags
  2. +
  3. Build — calls docker build for each model, writes build_manifest.json
  4. +
  5. Run — reads manifest, infers compute target, dispatches containers, writes perf.csv
  6. +
  7. Report — converts perf.csv to HTML or email; uploads to MongoDB
  8. +
+

All four stages share a single --additional-context configuration spine that controls + GPU vendor, deployment type, launcher, profiling tools, and environment variables.

-

Why this branch matters

-

The add_slurm_multi_launcher branch adds a self-managed multi-node SLURM launcher - so that workloads with their own per-node Docker orchestration (e.g. SGLang Disaggregated - prefill + decode + proxy) can run via a thin wrapper SBATCH that does not nest Docker - inside the job step. It adds --use-image / --build-on-compute build modes, - a registry gate, parallel image pull, and a bash-in-salloc execution path.

+

What's new in v2.1.0

+
    +
  • slurm_multi — self-managed multi-node SLURM launcher for workloads with per-node Docker (e.g. SGLang Disagg)
  • +
  • --use-image [auto] / --build-on-compute — new madengine build modes
  • +
  • Docker --build-context tools= — shared tool APIs accessible in every Dockerfile
  • +
  • Local MAD_MULTI_NODE_RUNNER — Megatron / DeepSpeed / TorchTitan now work on local Docker
  • +
  • SLURM env-var escaping — double-quote escaping preserves spaces & paths
  • +
- +

Quick start

- + +
+
-
# Install
+
# 1. Install
 pip install -e ".[dev]"
 
-# Discover models
+# 2. Discover available models
 madengine discover --tags dummy
 
-# Run locally (build + run)
+# 3. Build + run (single command)
 madengine run --tags dummy \
-  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
+ --additional-context '{"gpu_vendor":"AMD","guest_os":"UBUNTU"}' + +# 4. Build only, then run from manifest +madengine build --tags llama3 --registry registry.example.com/ml +madengine run --manifest-file build_manifest.json \ + --additional-context '{"docker_gpus":"0,1,2,3"}'
+

Local mode: no k8s or slurm key in context → ContainerRunner (local Docker).

+
-
# Minimal K8s config — defaults applied automatically
-madengine run --tags model \
-  --additional-context '{"k8s": {"gpu_count": 2}}'
+
# Single-node K8s (minimal — defaults applied from presets/k8s/)
+madengine run --tags llama3 \
+  --additional-context '{"k8s":{"gpu_count":4}}'
 
-# Multi-node vLLM
-madengine run --tags model --additional-context '{
-  "k8s": {"namespace": "ml-team", "gpu_count": 8},
-  "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4}
-}'
+# Multi-node vLLM on K8s +madengine run --tags vllm-serve \ + --additional-context '{ + "k8s": {"namespace":"ml-team","gpu_count":8}, + "distributed": {"launcher":"vllm","nnodes":2,"nproc_per_node":4} + }' + +# K8s with NFS data PVC and secrets +madengine run --tags model \ + --additional-context '{ + "k8s": {"namespace":"ml","gpu_count":8,"data_storage_class":"nfs-banff"}, + "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy"} + }'
+

Presence of "k8s" or "kubernetes" key → KubernetesDeployment. Requires pip install -e ".[all]".

+
-
# Build phase (login node or CI) then deploy
-madengine build --tags model --registry gcr.io/myproject
+
# Single-node SLURM (build on login node, deploy via sbatch)
+madengine build --tags llama3 --registry registry.example.com/ml
+madengine run --manifest-file build_manifest.json \
+  --additional-context '{
+    "slurm": {"partition":"gpu","nodes":1,"gpus_per_node":8,"time":"12:00:00"}
+  }'
 
+# Multi-node torchrun
 madengine run --manifest-file build_manifest.json \
   --additional-context '{
-    "slurm":{"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
-    "distributed":{"launcher":"torchtitan","nnodes":4,"nproc_per_node":8}
+    "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8}
+  }'
+
+# DeepSpeed with reservation
+madengine run --manifest-file build_manifest.json \
+  --additional-context '{
+    "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,
+              "time":"48:00:00","reservation":"ml-training"},
+    "distributed": {"launcher":"deepspeed","nnodes":8,"nproc_per_node":8}
   }'
+

Presence of "slurm" key → SlurmDeployment. Generates sbatch wrapper from Jinja2 template.

+
-
# slurm_multi — for workloads that run their own docker via srun
-madengine run --tags pyt_sglang_disagg_qwen3-32b_short \
+
# SGLang Disaggregated (3+ nodes: proxy + prefill + decode)
+madengine run --tags pyt_sglang_disagg_qwen3-32b \
   --additional-context '{
-    "slurm":{"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
-    "distributed":{"launcher":"slurm_multi"}
+    "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
+    "distributed": {"launcher":"slurm_multi"}
   }'
 
-# Build on a compute node, push, then have run pull in parallel
-madengine build --tags model --build-on-compute --registry myreg.io/team
-# or skip build entirely and use a pre-baked image
-madengine build --tags model --use-image auto
+# Build options for slurm_multi models: +# Option A — use pre-built registry image (skip local build) +madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:latest + +# Option B — auto-resolve DOCKER_IMAGE_NAME from model card +madengine build --tags pyt_sglang_disagg --use-image + +# Option C — build on compute node, push, then run pulls in parallel +madengine build --tags pyt_sglang_disagg \ + --registry registry.io/ml --build-on-compute
+

slurm_multi bypasses the standard sbatch template: the model's own .slurm script runs directly on the head node so srun/scontrol work inside it.

+
+ +
+
# Store configuration in a JSON file and reference it
+cat > my_run.json <<'EOF'
+{
+  "gpu_vendor": "AMD",
+  "guest_os": "UBUNTU",
+  "slurm": {
+    "partition": "gpu",
+    "nodes": 4,
+    "gpus_per_node": 8,
+    "time": "24:00:00",
+    "exclusive": true
+  },
+  "distributed": {
+    "launcher": "torchrun",
+    "nnodes": 4,
+    "nproc_per_node": 8,
+    "backend": "nccl"
+  },
+  "env_vars": {
+    "NCCL_DEBUG": "WARN",
+    "HSA_ENABLE_SDMA": "0"
+  },
+  "tools": [{"name": "rocprofv3_compute"}]
+}
+EOF
+
+madengine run --tags llama3 --additional-context-file my_run.json
+

--additional-context-file and --additional-context are mutually exclusive. The file is parsed as JSON (not ast.literal_eval).

- +

Install & dev

Setup

-
pip install -e ".[dev]"      # base + dev
-pip install -e ".[all]"      # + kubernetes
+
# Base install (includes dev tools)
+pip install -e ".[dev]"
+
+# With Kubernetes support
+pip install -e ".[all]"
+
+# Enable pre-commit hooks
 pre-commit install
+

Optional extras

+ + + + + + + +
ExtraAdds
[dev]pytest, black, flake8, mypy, isort, pre-commit
[kubernetes]kubernetes>=28.0.0, pyyaml
[all]dev + kubernetes

Test & quality

-
pytest                            # all tests
+
pytest                           # all tests
+pytest tests/unit/ -v            # unit only
 pytest tests/unit/test_slurm_multi.py -v
 pytest --cov=src/madengine --cov-report=html
-pytest -m "not slow"
-black src/ tests/ && isort src/ tests/
+pytest -m "not slow"             # skip slow tests
+pytest -m "unit and amd"         # combined markers
+
+black src/ tests/
+isort src/ tests/
 flake8 src/ tests/
 mypy src/madengine
 pre-commit run --all-files
@@ -250,127 +377,142 @@

Test & quality

- +

5-layer architecture

-

Each layer talks only to the one below it. Layers are color-coded throughout this wiki.

+

Each layer talks only to the layers below it. Layers are color-coded throughout this wiki.

- CLI - Orchestration - Deployment - Execution - Core - Utils - Reporting + CLI + Orchestration + Deployment + Execution + Core + Utils + Reporting
- - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LayerPathResponsibilitiesKey types
CLIsrc/madengine/cli/Typer app, command parsing, Rich output, exit-code mapping.app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode
Orchestrationsrc/madengine/orchestration/Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment.BuildOrchestrator, RunOrchestrator, image_filtering.py
Deploymentsrc/madengine/deployment/Factory + K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring.DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment
Executionsrc/madengine/execution/Local Docker build/run, log scanning, timeout resolution, perf parsing.ContainerRunner, DockerBuilder, container_runner_helpers.py
Coresrc/madengine/core/Cross-cutting primitives: context merging, console, docker wrapper, errors, auth, timeout.Context, Console, Docker, MADEngineError, load_credentials
Utilssrc/madengine/utils/Discovery, GPU vendor abstraction, ROCm path resolution, config parsing.DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser
Reportingsrc/madengine/reporting/perf.csv writers, HTML/email report generation.update_perf_csv, csv_to_html, csv_to_email
CLIsrc/madengine/cli/Typer app, 5 commands, argument validation, Rich output, exit-code mapping.app.py, commands/{build,run,discover,report,database}.py, constants.ExitCode
Orchestrationsrc/madengine/orchestration/Discover → build → run pipeline. Decides whether to dispatch locally or to a deployment backend.BuildOrchestrator, RunOrchestrator, image_filtering.py
Deploymentsrc/madengine/deployment/Factory + Template Method pattern. K8s/SLURM concrete deployments, preset merging, Jinja2 templates, monitoring.DeploymentFactory, BaseDeployment, KubernetesDeployment, SlurmDeployment, ConfigLoader
Executionsrc/madengine/execution/Local Docker build/run, log scanning, timeout resolution, perf parsing, self-managed launcher bypass.ContainerRunner, DockerBuilder, container_runner_helpers
Coresrc/madengine/core/Cross-cutting primitives: context merging & GPU detection, shell execution, Docker wrapper, error hierarchy, auth, timeout.Context, Console, Docker, MADEngineError, load_credentials
Utilssrc/madengine/utils/Model discovery, GPU vendor abstraction, ROCm path resolution, config parsing.DiscoverModels, gpu_tool_factory, rocm_path_resolver, ConfigParser
Reportingsrc/madengine/reporting/perf.csv writers, HTML/email report generation. Database upload in src/madengine/database/.update_perf_csv, csv_to_html, csv_to_email, mongodb.py
- +

Architecture diagram

- + - + - - - CLI · Typer + Rich - discover · build · run · report · database → ExitCode { SUCCESS=0, BUILD_FAILURE=2, RUN_FAILURE=3, INVALID_ARGS=4 } - + + CLI · Typer + Rich + discover · build · run · report · database ExitCode: SUCCESS=0 · BUILD_FAILURE=2 · RUN_FAILURE=3 · INVALID_ARGS=4 - - Orchestration - - BuildOrchestrator - DiscoverModels → DockerBuilder → manifest - - RunOrchestrator - load manifest → infer target → dispatch - - image_filtering - arch/tag selection - - - - Deployment · DeploymentFactory (inferred target) - no key → local Docker · "k8s"/"kubernetes" → K8s Jobs · "slurm" → SLURM · distributed.launcher = "slurm_multi" → self-managed - - - - Local · ContainerRunner - - - KubernetesDeployment - - - SlurmDeployment - - - slurm_multi (this branch) - - - - - Launchers (training + inference) - torchrun · DeepSpeed · Megatron-LM · TorchTitan · Primus · vLLM · SGLang · SGLang Disagg - - - - Reporting - perf.csv · perf_entry.csv · csv_to_html · csv_to_email - report to-html · report to-email - - - Database - MongoDB upload (madengine database …) - - - - - - - + + Orchestration + + BuildOrchestrator + DiscoverModels → DockerBuilder + → build_manifest.json + + RunOrchestrator + load manifest → merge context + → infer target → dispatch + + image_filtering + GPU arch / vendor + tag selection + + + Deployment · DeploymentFactory (inferred from context keys) + no k8s/slurm → local · "k8s"/"kubernetes" → K8s Jobs · "slurm" → SLURM sbatch · distributed.launcher="slurm_multi" → self-managed + + Local · ContainerRunner + docker run + perf.csv + + KubernetesDeployment + K8s Jobs, PVCs, Secrets + + SlurmDeployment + sbatch · Jinja2 template + + slurm_multi (2.1.0) + head-node script + srun pull + + + Core + Context · Console · Docker · MADEngineError · auth · timeout + + + Utils + DiscoverModels · gpu_tool_factory · rocm_path_resolver · ConfigParser + + + Reporting + perf.csv · perf_entry.csv · csv_to_html · csv_to_email + + + Database + MongoDB upload · MongoDBConfig.from_env() + + + + + + +
- +

Key data flows

@@ -378,538 +520,1399 @@

Key data flows

Build flow

  1. madengine buildBuildOrchestrator.execute()
  2. -
  3. DiscoverModels resolves --tags against the MAD package - (root models.json, scripts/{dir}/models.json, or - scripts/{dir}/get_models_json.py).
  4. -
  5. Each model is materialised through Context (system + user - additional_context) and passed to DockerBuilder.
  6. -
  7. Optionally tags & pushes to --registry.
  8. -
  9. Writes build_manifest.json consumed by run.
  10. +
  11. Context(build_only_mode=True) — GPU vendor / arch detection skipped unless detect_local_gpu_arch=True
  12. +
  13. ConfigLoader.load_config() applies preset defaults (SLURM or K8s) over user config
  14. +
  15. DiscoverModels resolves --tags from root models.json, scripts/{dir}/models.json, or scripts/{dir}/get_models_json.py
  16. +
  17. slurm_multi gate: if model uses slurm_multi and no --registry/--use-image given → auto-resolves DOCKER_IMAGE_NAME from model card or raises ConfigurationError
  18. +
  19. DockerBuilder.build_all_models() — passes --build-context tools=scripts/common/tools if that dir exists
  20. +
  21. After registry push: sets DOCKER_IMAGE_NAME in manifest env_vars for parallel SLURM pull
  22. +
  23. Writes build_manifest.json
-

Special build modes on this branch:

-
    -
  • --use-image [IMAGE|auto] — skip local build, use a prebuilt image (auto resolves - env_vars.DOCKER_IMAGE_NAME from the model card). Mutually exclusive with - --registry and --build-on-compute.
  • -
  • --build-on-compute — build on a SLURM compute node and push to --registry; - manifest carries built_on_compute: true.
  • -
-

Run flow

    -
  1. madengine runRunOrchestrator loads existing manifest or triggers a build.
  2. -
  3. Target inference (Convention over Configuration): -
      -
    • "k8s"/"kubernetes" in context → KubernetesDeployment
    • -
    • "slurm" in context → SlurmDeployment
    • -
    • distributed.launcher == "slurm_multi"slurm_multi path
    • -
    • neither → ContainerRunner (local Docker)
    • -
    -
  4. -
  5. scripts/common/ is populated from the package (pre_scripts, post_scripts, tools) and cleaned up afterwards.
  6. -
  7. Per-model results parsed via PERFORMANCE_LOG_PATTERN and appended to - perf.csv/perf_entry.csv. Failed runs are still recorded with - STATUS=FAILURE.
  8. +
  9. madengine runRunOrchestrator.execute()
  10. +
  11. If manifest exists: skip build; else trigger _build_phase()
  12. +
  13. Context(build_only_mode=False) — full GPU detection, ROCm path resolution
  14. +
  15. _load_and_merge_manifest() — runtime context overrides manifest deployment_config
  16. +
  17. Target inference: "k8s"/"kubernetes" → K8s · "slurm" → SLURM · neither → local
  18. +
  19. _copy_scripts() — populates scripts/common/{pre_scripts,post_scripts,tools} from madengine package
  20. +
  21. Dispatch: ContainerRunner (local) or DeploymentFactory.create() (SLURM/K8s)
  22. +
  23. Results → perf.csv / perf_entry.csv
  24. +
  25. _cleanup_model_dir_copies() — removes populated scripts/common/ files
+ +
+

SLURM job flow (inside sbatch)

+
    +
  1. sbatch script sets MASTER_ADDR (via scontrol), WORLD_SIZE, NNODES, node-local GPU visibility
  2. +
  3. Multi-node: generates a task script per node; runs via srun bash $TASK_SCRIPT — each node calls madengine run with local manifest
  4. +
  5. Single-node: creates local manifest with deployment_config.target="docker", calls madengine run
  6. +
  7. Each node's madengine runContainerRunnerdocker run with SLURM env vars injected
  8. +
  9. Results collected from per-node perf.csv and aggregated
  10. +
+
- -
-

additional_context — the configuration spine

-

--additional-context accepts a JSON or Python-dict string (parsed with -ast.literal_eval(), not json.loads) or a path to a JSON file. -It is merged into Context.ctx alongside system-detected values -(GPU vendor, architecture, OS, ROCm path). Specific keys drive different subsystems.

+ +
+

CLI — discover

+

Lists and validates model definitions without building or running.

+
madengine discover [OPTIONS]
+
+  --tags TEXT              Comma-separated tags/names to filter  [required]
+  --verbose / --no-verbose Show full model JSON  [default: no-verbose]
+

Tag syntax

- + - - - - - - - - - - - - - - - + + + + +
KeyWhere it goesWhat it does
PatternExampleMeaning
gpu_vendorCoreAMD or NVIDIA. Defaults to AMD if missing.
guest_osCoreUBUNTU or CENTOS; selects package manager for in-container installs.
MAD_ROCM_PATHCoreOverride host ROCm root (top-level only).
docker_env_varsExecutionEnv vars injected into the container. docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host.
docker_gpusExecutionComma list of GPU indices or all.
k8s / kubernetesDeploymentSelects K8s. Merged with preset defaults; supports namespace, gpu_count, storage class fallback chain (data_storage_classnfs_storage_classstorage_class).
slurmDeploymentSelects SLURM. partition, nodes, gpus_per_node, time, exclusive, reservation, nodelist. Setting nodelist also skips automatic node health preflight.
distributed.launcherDeploymenttorchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang_disagg, slurm_multi / slurm-multi.
distributed.nnodes / nproc_per_nodeDeploymentTopology hints for launcher templates.
toolsExecutionList of profilers/tracers to enable, e.g. [{"name":"rocprofv3_compute"}].
rocenv_modeExecution"lite" (default) or "full" — full collects lshw / dmidecode / dmesg / modinfo, best-effort installs missing tools per guest_os.
log_error_pattern_scanExecutionfalse disables post-run log substring scan (use when pytest/JUnit is authoritative).
log_error_patterns / log_error_benign_patternsExecutionOverride or extend the failure-substring lists.
pre_scripts / post_scriptsExecutionCustom scripts to run before/after the model.
secretsDeployment (K8s)Auto-converted to a K8s Secret and mounted as env vars.
Simple tag--tags llama3Any model with tag llama3
Multiple tags--tags llama3,vllmAny model matching any listed tag
All models--tags allEvery discovered model
Scoped (exact dir)--tags MAD/llama3Only from scripts/MAD/ subdirectory
Dynamic + args--tags dummy3:dummy_3:batch=512Dynamic model with arg override
-
-Gotcha: Context parses with ast.literal_eval(). Pass a Python dict -repr (single quotes are fine in shells if you wrap the whole argument in single quotes and use -double quotes inside) — strictly JSON also works since JSON ⊂ Python literals. +

Discovery sources (checked in order per directory)

+
    +
  1. Root models.json
  2. +
  3. scripts/{dir}/models.json (static list)
  4. +
  5. scripts/{dir}/get_models_json.py — dynamic; must export list_models() → List[CustomModel]
  6. +
+
+ + +
+

CLI — build

+

Builds Docker images for discovered models and writes build_manifest.json.

+
madengine build [OPTIONS]
+
+  --tags TEXT                    Tags to select models (mutually exclusive with --batch-manifest)
+  --batch-manifest FILE          JSON file of multiple tag groups to build in sequence
+  --registry TEXT                Push built images to this registry URL
+  --target-archs TEXT            Comma-separated GPU arch list (e.g. "gfx90a,gfx942")
+  --use-image [IMAGE|auto]       Skip local build; use named image or auto-resolve from model card
+  --build-on-compute             Build on SLURM compute node + push (requires --registry)
+  --additional-context TEXT      Python dict / JSON string of context overrides
+  --additional-context-file FILE Path to a JSON context file (mutually exclusive with --additional-context)
+  --clean-docker-cache           Pass --no-cache to docker build
+  --manifest-output FILE         Output path for build_manifest.json  [default: build_manifest.json]
+  --summary-output FILE          Output path for build summary JSON
+  --live-output / --no-live-output   Stream docker build output line by line  [default: no-live-output]
+  --verbose / --no-verbose
+ +
+Mutual exclusions: +
    +
  • --batch-manifest vs --tags
  • +
  • --use-image vs --registry
  • +
  • --use-image vs --build-on-compute
  • +
  • --build-on-compute requires --registry
  • +
  • --additional-context-file vs --additional-context
  • +
+ +

--use-image modes

+ + + + + + +
InvocationBehavior
--use-image (bare flag)Resolves to "auto" — reads DOCKER_IMAGE_NAME from model card env_vars
--use-image registry.io/img:tagUses the explicit image name; skips all Docker build steps
- -
-

CLI commands

+ +
+

CLI — run

+

Runs models from a manifest (build if needed) and writes perf.csv.

+
madengine run [OPTIONS]
+
+  --tags TEXT                    Select models (triggers build if no manifest)
+  --manifest-file FILE           Use existing manifest; skip build  [default: build_manifest.json]
+  --registry TEXT                Registry for image pull auth
+  --timeout INT                  Seconds per model; -1=7200s default, 0=disabled
+  --additional-context TEXT      Python dict or JSON string
+  --additional-context-file FILE JSON file (mutually exclusive with --additional-context)
+  --keep-alive                   Leave container running after model completes
+  --keep-model-dir               Do not clean up model directory copy
+  --clean-docker-cache           Remove docker image before pull (SLURM mode)
+  --skip-model-run               Build/pull only; skip execution
+  --manifest-output FILE
+  --summary-output FILE
+  --live-output / --no-live-output  Stream container output  [default: no-live-output]
+  --output FILE                  Redirect container stdout to file
+  --tools-json-file-name FILE    Tools config  [default: ./scripts/common/tools.json]
+  --generate-sys-env-details / --no-generate-sys-env-details
+  --force-mirror-local           Force ContainerRunner even in SLURM/K8s context
+  --disable-skip-gpu-arch        Ignore skip_gpu_arch model field
+  --cleanup-perf                 Remove existing perf.csv before run
+  --verbose / --no-verbose
+ +

Timeout resolution

- + - - - - - - - - - - - - - - - - - - - - + + + +
CommandSourcePurposeNotable flags
ValueResolved timeout
discovercli/commands/discover.pyList/validate models matching tags.--tags (scoped: MAD/foo, dynamic: dummy3:dummy_3:batch=512)
buildcli/commands/build.pyBuild Docker images; write build_manifest.json.--registry, --target-archs, --batch-manifest, --clean-docker-cache, --use-image new, --build-on-compute new
runcli/commands/run.pyRun models from manifest or trigger a build first.--manifest-file, --additional-context[-file], --skip-model-run, --live-output, --keep-alive, --verbose, --timeout
reportcli/commands/report.pyConvert perf CSVs to HTML/email.Sub-apps: to-html --csv-file …, to-email --directory …
databasecli/commands/database.pyUpload perf CSV to MongoDB.--csv-file, --database-name, --collection-name (uses MONGO_HOST/USER/PASSWORD env)
-1 (default)7200 s (2 hours)
0Disabled (no timeout)
model card timeout fieldUsed when CLI is default (-1)
Explicit positive intThat many seconds, overrides model card
- + +
+

CLI — report & database

+
+
+

report

+
# Convert perf.csv to HTML
+madengine report to-html --csv-file perf.csv
+
+# Generate consolidated email report
+madengine report to-email \
+  --directory ./results \
+  --output run_results.html
+

Source: cli/commands/report.pyreporting/csv_to_html.py, reporting/csv_to_email.py

+
+
+

database

+
madengine database \
+  --csv-file perf.csv \
+  --database-name benchmarks \
+  --collection-name runs
+

Reads from env: MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MS.

+

Source: cli/commands/database.pydatabase/mongodb.py

+
+
+
+ +
-

Exit codes (CI contract)

-

From src/madengine/cli/constants.py::ExitCode. Use these in pipelines instead of log scraping.

+

Exit codes CI contract

+

Defined in src/madengine/cli/constants.py::ExitCode. Use these in CI pipelines instead of log scraping.

- - - - - + + + + +
CodeNameMeaning
0SUCCESSAll operations succeeded.
1FAILUREGeneral/unhandled failure.
2BUILD_FAILUREOne or more image builds failed.
3RUN_FAILUREOne or more model runs failed (still written to perf.csv with status FAILURE).
4INVALID_ARGSArgument validation rejected the invocation.
0SUCCESSAll operations succeeded.
1FAILUREGeneral / unhandled failure (keyboard interrupt, unexpected exception).
2BUILD_FAILUREOne or more Docker image builds failed.
3RUN_FAILUREOne or more model runs failed. Results still written to perf.csv with STATUS=FAILURE.
4INVALID_ARGSArgument validation rejected the invocation.
-In Jenkins use ... 2>&1 | tee madengine.run.log with bash -o pipefail -so the step's exit code is still madengine's, not tee's. +In Jenkins, use madengine run … 2>&1 | tee madengine.log with bash -o pipefail so tee doesn't swallow the exit code.
- + +
+

additional_context — configuration spine

+

--additional-context accepts a Python dict string (parsed with ast.literal_eval, not json.loads) or --additional-context-file accepts a JSON file. The dict is deep-merged into Context.ctx alongside system-detected values.

+ +
+Gotcha — Python dict, not JSON: pass '{"key":"val"}' (valid JSON is also valid Python) or "{'key':'val'}". Do not use True/False as unquoted Python booleans in shell — shell expansion will fail. Use true/false (JSON) or single-quote the whole argument. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
KeyTypeSubsystemDescription & example
gpu_vendorstringCoreOverride GPU vendor detection. "AMD" or "NVIDIA". Defaults to "AMD" if not set and auto-detect fails.
guest_osstringCoreContainer OS for package manager selection. "UBUNTU" or "CENTOS". Affects rocEnvTool installer selection.
MAD_ROCM_PATHstringCoreOverride host ROCm root path (e.g. "/opt/rocm-6.2"). Takes priority over auto-detection and ROCM_PATH env.
docker_env_varsdictExecEnv vars injected as --env into docker run. Keys are validated with _ENV_KEY_RE. Special: docker_env_vars.MAD_ROCM_PATH overrides in-container ROCm root independently of host.
docker_build_argdictExecExtra --build-arg KEY=VAL flags passed to docker build.
docker_gpusstringExecComma-separated GPU indices to expose, or "all". E.g. "0,1,2,3".
docker_cpusstringExecCPU affinity string for --cpuset-cpus. E.g. "0-15".
docker_mountsdictExecExtra volume mounts. E.g. {"host_path":"/data","container_path":"/mnt/data"}.
docker_image / MAD_CONTAINER_IMAGEstringOrchSkip build entirely; use this image for all models. Creates a synthetic manifest.
k8s / kubernetesdictDeploySelects Kubernetes deployment. See K8s config section for sub-keys.
slurmdictDeploySelects SLURM deployment. See SLURM config section for sub-keys.
distributeddictDeployDistributed launcher configuration. launcher, nnodes, nproc_per_node, backend, port. See Per-launcher config.
distributed.launcherstringDeploy"torchrun", "deepspeed", "megatron", "torchtitan", "primus", "vllm", "sglang", "sglang_disagg", "slurm_multi"/"slurm-multi".
distributed.sglang_disaggdictDeployFine-tune prefill/decode node split. {"prefill_nodes":1,"decode_nodes":2}. Default ~40% prefill, rest decode. Min 3 nodes total.
vllmdictDeployvLLM-specific config (tensor/pipeline parallelism, model, etc.).
primusdictDeployPrimus-specific config. config_path, cli_extra, backend.
secretsdictDeployK8s only. Auto-converted to a K8s Secret and mounted as env vars. E.g. {"HF_TOKEN":"hf_xxx"}.
toolslistExecProfiling/tracing tools. Each item: {"name":"rocprofv3_compute"}. Stackable. See Profiling tools.
rocenv_modestringExec"lite" (default) or "full". Full mode runs lshw/dmidecode/dmesg/modinfo, installs missing tools per guest_os.
pre_scriptslistExecScripts to run inside the container before the model script.
post_scriptslistExecScripts to run inside the container after the model script.
encapsulate_scriptstringExecScript prepended to the model run command (wraps the whole execution).
log_error_pattern_scanboolExecSet false to disable post-run log substring error detection. Useful when pytest/JUnit is authoritative.
log_error_patternslistExecReplace the default error patterns list entirely. Each string is matched as substring in log lines.
log_error_benign_patternslistExecLiteral substrings that mark a matching log line as benign (not an error).
env_varsdictDeployTop-level env vars merged into deployment config (SLURM script / K8s job manifest).
gen_sys_env_detailsboolExecEnable/disable rocEnvTool system environment collection. Default: true.
debugboolDeployEnable debug-level logging in deployment templates.
+ +

SLURM sub-keys (slurm dict)

+ + + + + + + + + + + + + + + + + + + +
KeyDefault (from preset)Description
partition"amd-rccl"SLURM partition name.
nodes1Number of nodes to allocate.
gpus_per_node8GPUs per node.
time"24:00:00"Wall time limit (HH:MM:SS).
exclusivetrueRequest exclusive node access.
nodelistPin to specific nodes. Also skips node health preflight check.
excludeNodes to exclude.
constraintNode feature constraints.
reservationSLURM reservation name. Forwarded to srun health/cleanup commands.
qosQuality of service.
accountSLURM account for billing.
modules[]List of environment modules to load before job.
output_dirCWDDirectory for SLURM log/output files.
network_interfaceNetwork interface for NCCL/RCCL (e.g. "ib0").
shared_workspaceShared filesystem path accessible from all nodes.
+ +

Kubernetes sub-keys (k8s dict)

+ + + + + + + + + + + + + + + + + + + + + +
KeyDefaultDescription
namespace"default"Kubernetes namespace.
gpu_countNumber of GPUs per pod.
gpu_resource_name"amd.com/gpu"K8s GPU resource type. Auto-set by GPU-vendor preset.
image_pull_policy"Always"K8s imagePullPolicy.
kubeconfig"~/.kube/config"Path to kubeconfig.
data_storage_class"nfs-banff"Storage class for data PVC. Falls back to nfs_storage_class then storage_class.
storage_class"nfs-banff"Generic storage class fallback.
memory"64Gi"Container memory request.
memory_limit"128Gi"Container memory limit.
cpu"16"CPU request.
cpu_limit"32"CPU limit.
host_ipcfalseEnable hostIPC (needed for multi-node NCCL).
backoff_limit3K8s Job backoffLimit (retries).
ttl_seconds_after_finishednullAuto-delete job after N seconds.
recreate_shared_data_pvcfalseRe-create data PVC even if it already exists.
secrets.strategy"from_local_credentials"How to load K8s image pull secrets.
secrets.image_pull_secret_names[]Existing K8s secret names to use as image pull secrets.
+
+ + +
+

Model definition — models.json

+

Each model definition lives in a models.json file (or is returned by get_models_json.py::list_models()). Fields map to the CustomModel dataclass in utils/discover_models.py.

+
{
+  "name": "llama3-8b-train",          // Unique model identifier
+  "dockerfile": "docker/Dockerfile.ubuntu.amd",
+  "dockercontext": ".",               // Build context dir (relative to scripts dir)
+  "scripts": "scripts/llama3/train.sh",
+  "url": "https://github.com/org/repo",
+  "cred": "hf_token",                 // Credential key from credential.json
+  "owner": "ml-team",
+  "data": "llama3-dataset",           // Data identifier for DataProvider
+  "n_gpus": "8",                      // "-1" = all available; "0" = CPU-only
+  "timeout": 14400,                   // Seconds; overridden by --timeout CLI flag
+  "training_precision": "bf16",
+  "tags": ["llama3", "training", "amd"],
+  "args": "--batch-size 4 --seq-len 4096",
+  "multiple_results": "results.csv",  // CSV file with multiple perf rows
+  "skip_gpu_arch": "gfx908,gfx1100", // Comma-list of archs to skip this model on
+  "additional_docker_run_options": "--shm-size 64g",
+  "distributed": {
+    "launcher": "torchrun",
+    "nnodes": 2,
+    "nproc_per_node": 8
+  },
+  "env_vars": {
+    "HF_TOKEN": "auto",              // Injected into container env
+    "DOCKER_IMAGE_NAME": "reg/img"   // Used by slurm_multi parallel pull
+  }
+}
+ +

Key field notes

+ + + + + + + + + +
FieldNotes
n_gpus"-1" = use all GPUs on the host (MAD_SYSTEM_NGPUS). Positive int = that many GPUs. Used for perf CSV metadata.
timeoutUsed when CLI --timeout=-1 (default). Explicit CLI value always wins.
skip_gpu_archComma-separated GPU arch names (e.g. "gfx908,A100"). Model is skipped if detected arch matches. Disable with --disable-skip-gpu-arch.
multiple_resultsPath to CSV file (relative to model dir) with per-result rows that are appended to perf.csv individually.
DOCKER_IMAGE_NAME in env_varsRequired for slurm_multi: specifies the registry image for parallel srun docker pull on compute nodes. Also set automatically by DockerBuilder after a successful push.
+
+ + +
+

Build manifest — build_manifest.json

+

Written by madengine build, consumed by madengine run. Pass with --manifest-file.

+
{
+  "built_images": {
+    "ci-llama3_Dockerfile.ubuntu.amd": {
+      "docker_image": "registry.io/ml/ci-llama3:sha256-abc",
+      "docker_sha":   "sha256:abc123",
+      "build_duration": 183.4
+    }
+  },
+  "built_models": {
+    "ci-llama3_Dockerfile.ubuntu.amd": {
+      "name":          "llama3-8b-train",
+      "dockerfile":    "docker/Dockerfile.ubuntu.amd",
+      "docker_image":  "ci-llama3_Dockerfile.ubuntu.amd",
+      "docker_sha":    "sha256:abc123",
+      "build_duration": 183.4,
+      "scripts":       "scripts/llama3/train.sh",
+      "args":          "--batch-size 4",
+      "tags":          ["llama3","training"],
+      "n_gpus":        "8",
+      "timeout":       14400,
+      "skip_gpu_arch": "",
+      "multiple_results": "",
+      "distributed":   {"launcher":"torchrun","nnodes":2,"nproc_per_node":8},
+      "env_vars":      {"DOCKER_IMAGE_NAME":"registry.io/ml/ci-llama3:sha256-abc"},
+      "built_on_compute": false
+    }
+  },
+  "context": {
+    "gpu_vendor": "AMD",
+    "guest_os":   "UBUNTU",
+    "docker_env_vars": {"MAD_GPU_VENDOR":"AMD","MAD_SYSTEM_NGPUS":"8"},
+    "docker_build_arg": {}
+  },
+  "deployment_config": {
+    "target":  "slurm",
+    "slurm":   {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+    "distributed": {"launcher":"torchrun","nnodes":4,"nproc_per_node":8},
+    "env_vars": {"NCCL_DEBUG":"WARN"},
+    "debug": false
+  },
+  "summary": {"total":1,"success":1,"failed":0}
+}
+
+Merging at runtime: values in deployment_config are merged into the runtime context at startup. Keys in --additional-context take precedence over deployment_config. +
+
+ +

Deployment target inference

-

No explicit deploy field exists. The factory inspects additional_context:

+

No explicit deploy field needed. RunOrchestrator._infer_deployment_target() inspects the merged context:

- + - - - - + + +
TriggerClassSource
Context conditionTargetClassPath
no k8s/slurm keyLocal ContainerRunnerexecution/container_runner.py
"k8s" or "kubernetes" keyKubernetesDeploymentdeployment/kubernetes.py
"slurm" keySlurmDeploymentdeployment/slurm.py
distributed.launcher == "slurm_multi"slurm_multi path (within Slurm)deployment/slurm.py + common.py
"k8s" or "kubernetes" key presentKubernetesKubernetesDeploymentdeployment/kubernetes.py
"slurm" key presentSLURMSlurmDeploymentdeployment/slurm.py
NeitherLocal DockerContainerRunnerexecution/container_runner.py
-

The mixin deployment/kubernetes_launcher_mixin.py selects the correct Jinja2 template -under src/madengine/deployment/templates/{kubernetes,slurm}/ per launcher.

+

Within SLURM deployment, if distributed.launcher == "slurm_multi" (or "slurm-multi"), SlurmDeployment.prepare() takes the slurm_multi path instead of generating the standard Jinja2 template.

+
+Force local: use --force-mirror-local on madengine run to always use ContainerRunner even when slurm/k8s keys are in context. +
+
+ + +
+

SLURM deployment

+

Implemented in src/madengine/deployment/slurm.py. Generates an sbatch script from a Jinja2 template at src/madengine/deployment/templates/slurm/job.sh.j2.

+ +

Preset merge order

+

ConfigLoader.load_slurm_config() applies three layers (last wins):

+
    +
  1. presets/slurm/defaults.json — base defaults for all SLURM runs
  2. +
  3. presets/slurm/profiles/single-node.json or multi-node.json — profile selected by nodes count
  4. +
  5. User-supplied slurm / distributed / env_vars keys
  6. +
+ +
+presets/slurm/defaults.json — base preset contents +
{
+  "gpu_vendor": "AMD",
+  "guest_os": "UBUNTU",
+  "debug": false,
+  "slurm": {
+    "partition": "amd-rccl",
+    "nodes": 1,
+    "gpus_per_node": 8,
+    "time": "24:00:00",
+    "exclusive": true,
+    "modules": []
+  },
+  "distributed": {
+    "backend": "nccl",
+    "port": 29500
+  },
+  "env_vars": {
+    "OMP_NUM_THREADS": "8",
+    "MIOPEN_FIND_MODE": "1",
+    "MIOPEN_USER_DB_PATH": "/tmp/.miopen"
+  }
+}
+
+ +
+presets/slurm/profiles/multi-node.json — additional env vars for multi-node +
{
+  "slurm": {"nodes": 2, "gpus_per_node": 8, "time": "24:00:00"},
+  "distributed": {"backend": "nccl", "port": 29500},
+  "env_vars": {
+    "NCCL_DEBUG": "WARN",
+    "NCCL_DEBUG_SUBSYS": "INIT",
+    "NCCL_IB_DISABLE": "0",
+    "NCCL_SOCKET_IFNAME": "ib0",
+    "TORCH_NCCL_HIGH_PRIORITY": "1",
+    "GPU_MAX_HW_QUEUES": "8",
+    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1",
+    "NCCL_TIMEOUT": "1200",
+    "HSA_ENABLE_SDMA": "0",
+    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
+    "RCCL_ENABLE_HIPGRAPH": "0"
+  }
+}
+
+ +

What the SLURM job script does

+
    +
  • Sets MASTER_ADDR via scontrol show hostnames, MASTER_PORT, WORLD_SIZE, NNODES
  • +
  • Sets per-node HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES (vLLM/SGLang: only HIP_VISIBLE_DEVICES)
  • +
  • Sets MIOPEN_USER_DB_PATH per-process: /tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
  • +
  • Sets TORCH_ELASTIC_RDZV_TIMEOUT=3600 for PyTorch elastic
  • +
  • Sets MAD_DEPLOYMENT_TYPE=slurm, MAD_SLURM_JOB_ID, MAD_NODE_RANK, MAD_IN_SLURM_JOB=1
  • +
  • Multi-node: generates per-node task script; runs via srun bash $TASK_SCRIPT
  • +
  • Single-node: creates synthetic manifest with deployment_config.target="docker" and calls madengine run
  • +
+ +

Node health preflight

+

SlurmNodeSelector runs a health-check srun before the main job unless slurm.nodelist is set (then skipped). Supports slurm.reservation forwarded to srun commands.

+ +

Monitoring

+

Polls squeue every 30 seconds. Terminal states: COMPLETED, FAILED, CANCELLED — a scancel'd job will not loop forever.

+ +
+SLURM inside existing allocation (salloc): if SLURM_JOB_ID is set and the launcher is slurm_multi, madengine runs the wrapper script directly with bash instead of nesting a new sbatch. Other launchers still submit via sbatch even inside salloc. +
+
+ + +
+

Kubernetes deployment

+

Implemented in src/madengine/deployment/kubernetes.py and 6 focused mixin modules (refactored in v2.0.3). Requires pip install -e ".[kubernetes]".

+ +

Mixin modules

+ + + + + + + + + + +
ModuleConcern
k8s_pvc.pyPVC lifecycle. Storage-class fallback: data_storage_classnfs_storage_classstorage_class. Default: "nfs-banff".
k8s_results.pyLog/artifact collection, perf aggregation. Uses shared collector_pod_name() helper — truncated collector-{id[:15]} to stay within K8s name limits.
k8s_scripts.pyScript extraction, ConfigMap building. Carries rocenv_mode and guest_os into the ConfigMap.
k8s_template_context.pyAssembles Jinja2 template context dict passed to job.yaml.j2.
kubernetes_launcher_mixin.pySelects the right Jinja2 template per launcher type.
k8s_secrets.pyConverts additional_context.secrets dict to K8s Secret objects mounted as env vars.
+ +

Preset merge order

+

ConfigLoader.load_k8s_config() applies five layers (last wins):

+
    +
  1. presets/k8s/defaults.json — base defaults
  2. +
  3. presets/k8s/gpu-vendors/amd.json or nvidia.json — GPU resource name
  4. +
  5. presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars (only if AMD + multi-GPU)
  6. +
  7. presets/k8s/profiles/single-gpu.json, multi-gpu.json, or multi-node.json
  8. +
  9. User config
  10. +
+ +
+presets/k8s/defaults.json — base preset contents +
{
+  "k8s": {
+    "kubeconfig": "~/.kube/config",
+    "namespace": "default",
+    "image_pull_policy": "Always",
+    "backoff_limit": 3,
+    "ttl_seconds_after_finished": null,
+    "nfs_storage_class": "nfs-banff",
+    "storage_class": "nfs-banff",
+    "data_storage_class": "nfs-banff",
+    "recreate_shared_data_pvc": false,
+    "secrets": {
+      "strategy": "from_local_credentials",
+      "image_pull_secret_names": [],
+      "runtime_secret_name": null
+    }
+  },
+  "env_vars": {"OMP_NUM_THREADS": "8"}
+}
+
+ +
+presets/k8s/gpu-vendors/amd-multi-gpu.json — AMD multi-GPU NCCL env vars +
{
+  "env_vars": {
+    "NCCL_DEBUG": "WARN",
+    "NCCL_IB_DISABLE": "0",
+    "NCCL_SOCKET_IFNAME": "ib0",
+    "TORCH_NCCL_HIGH_PRIORITY": "1",
+    "GPU_MAX_HW_QUEUES": "8",
+    "HSA_ENABLE_SDMA": "0",
+    "MIOPEN_FIND_MODE": "1",
+    "MIOPEN_USER_DB_PATH": "/tmp/.miopen",
+    "HSA_FORCE_FINE_GRAIN_PCIE": "1",
+    "RCCL_ENABLE_HIPGRAPH": "0"
+  }
+}
+
+ +
+Known issue: in multi-node K8s jobs, a node may show FAILED in the results table even when the pod succeeded — this occurs when the kubelet returns 502 between job completion and log collection. PVC artifacts are still collected. Check kubectl describe pod <pod>. +
+ +

Secrets management

+
# Pass secrets via additional_context
+madengine run --tags llm-serve \
+  --additional-context '{
+    "k8s": {"namespace":"ml","gpu_count":8},
+    "secrets": {"HF_TOKEN":"hf_xxx","WANDB_API_KEY":"yyy","S3_KEY":"zzz"}
+  }'
+

Secrets in additional_context.secrets are auto-converted to a K8s Secret object and mounted as environment variables in the job pod. They are never written to perf.csv or build logs.

- +
-

slurm_multi launcher branch focus

+

slurm_multi launcher merged in v2.1.0

What it is

-

A minimal-but-additive SLURM launcher for workloads that orchestrate their own per-node -Docker containers via srun — for example SGLang Disaggregated (proxy + -prefill + decode topologies) or anything that needs to call srun / scontrol from -inside the job script.

-

Generates a wrapper SBATCH that runs the model's .slurm script -directly on baremetal (not inside a container), so the workload can spawn its own -per-node containers without the outer job step holding a container open.

+

An escape-hatch SLURM launcher for workloads that orchestrate their own per-node Docker containers via srun — for example SGLang Disaggregated (proxy + prefill + decode) or any topology that needs to call srun/scontrol from inside the job step.

+

Generates a wrapper SBATCH that runs the model's own .slurm (or .sh) script directly on the head node on baremetal — no outer container — so the workload can spawn its own per-node containers without nesting.

-

How to pick it

+

How to select it

{
-  "slurm": {"partition":"gpu","nodes":3,"gpus_per_node":8,"time":"02:00:00"},
-  "distributed": {"launcher": "slurm_multi"}
-  // aliases: "slurm-multi"
+  "slurm": {
+    "partition": "gpu",
+    "nodes": 3,
+    "gpus_per_node": 8,
+    "time": "02:00:00"
+  },
+  "distributed": {
+    "launcher": "slurm_multi"
+  }
 }
-

Honors model-card + context slurm fields: -partition, nodes, gpus_per_node, time, -exclusive, reservation, nodelist.

+

Alias "slurm-multi" (hyphen) is also accepted and normalized automatically.

-

Build modes added with this launcher

+

Build modes

- + - - - - + + + +
ModeFlagBehaviour
ModeFlagBehavior
Local build (default)Normal madengine build.
Use prebuilt image--use-image [IMAGE | auto]Skip local build. auto resolves to the model card's env_vars.DOCKER_IMAGE_NAME. Mutually exclusive with the two below.
Build on compute--build-on-compute (requires --registry)Build on a SLURM compute node, push to registry; manifest sets built_on_compute: true. run then does parallel srun docker pull on all allocated nodes.
Implicit auto-use-imagenoneIf build finds a slurm_multi model and none of --registry / --use-image / --build-on-compute is set, it either auto-resolves the model card's DOCKER_IMAGE_NAME or raises a structured ConfigurationError listing the four supported options.
Use prebuilt image--use-image registry.io/img:tagSkip local build. Uses explicit image.
Auto-resolve from model card--use-image (bare)Reads env_vars.DOCKER_IMAGE_NAME from model card.
Build on compute--build-on-compute --registry reg.io/mlBuilds on SLURM compute node, pushes to registry. Manifest sets built_on_compute: true. Run phase pulls in parallel on all nodes.
Implicit fallbackno flagsIf model card has DOCKER_IMAGE_NAME, auto-uses it. Otherwise raises ConfigurationError listing options.

Execution paths

    -
  • sbatch (default): wrapper SBATCH submitted to SLURM.
  • -
  • bash-in-salloc: when SLURM_JOB_ID is already set (inside an - existing salloc), the slurm_multi launcher runs the wrapper synchronously with - bash instead of nesting sbatch. Other launchers keep using - sbatch even inside salloc. Uses - DeploymentResult.skip_monitoring=True to skip the monitor poll.
  • +
  • sbatch (default): wrapper SBATCH submitted to SLURM. Head node calls srun docker pull on all nodes in parallel, then runs the model's script.
  • +
  • bash-in-salloc: if SLURM_JOB_ID env var is set (inside existing salloc), the launcher runs the wrapper synchronously with bash. Sets DeploymentResult.skip_monitoring=True so the monitor poll is skipped.

Results aggregation

-

_collect_slurm_multi_results reads the per-job CSV at -/shared_inference/$USER/$JOBID/perf.csv and now also writes those rows -into cwd/perf.csv (copy if absent, append data rows if present), so the default -reporter (display_performance_table) finds them without extra args. Local + classic-SLURM -flows are unchanged.

- -

Tests & examples

-
    -
  • tests/unit/test_slurm_multi.py — registry membership, hyphen alias - normalization, env_vars-export contract against MAD-private PR #186's - pyt_sglang_disagg_qwen3-32b_short model card.
  • -
  • examples/slurm-configs/minimal/slurm-multi-minimal.json — reference config.
  • -
+

_collect_slurm_multi_results() reads per-job CSV from /shared_inference/$USER/$JOBID/perf.csv and writes those rows into cwd/perf.csv (copy if absent, append data rows if present). This ensures display_performance_table and madengine report to-html find results without extra arguments.

-
-Recent commits on this branch (most recent first) -
2e8f1a4 Merge remote-tracking branch 'upstream/develop' into add_slurm_multi_launcher
-68d0bf3 fix(slurm_multi): address Copilot review on PR #124
-dc3bc48 docs(slurm_multi): CHANGELOG entry + forward-compat TODO on --use-image
-e84506a fix(slurm_multi): aggregate per-job perf.csv into cwd for dashboard reporter
-e281e7e fix(deployment): add skip_monitoring to DeploymentResult for slurm_multi bash branch
-f7af062 test(slurm_multi): contract tests + minimal example config
-8a5e174 feat(cli): expose --use-image and --build-on-compute on madengine build
-bd371fe feat(orchestration): build_on_compute, registry gate, parallel pull for slurm_multi
-941d56d feat(deployment): add slurm_multi launcher (minimal additive)
-
+

Local self-managed execution

+

When slurm_multi is detected in a non-SLURM context (e.g. local Docker mode), ContainerRunner._run_self_managed() runs the model's script directly on the host. Env vars from model card and additional_context are injected; keys are logged without values to avoid leaking credentials.

- -
-

Kubernetes deployment

-

Decomposed (v2.0.3) into focused mixins composed by KubernetesDeployment:

- - - - - - - - - - - -
ModuleConcern
k8s_pvc.pyPVC lifecycle (data PVC, single-node results PVC).
k8s_results.pyLog/artifact collection, performance aggregation. Uses the shared collector_pod_name() helper so cleanup matches the truncated collector-{deployment_id[:15]} name.
k8s_scripts.pyScript extraction, ConfigMap building.
k8s_template_context.pyJinja2 template context assembly.
kubernetes_launcher_mixin.pyPer-launcher template selection.
k8s_secrets.pysecrets dict → K8s Secret objects → env vars.
k8s_pvc.pyStorage-class fallback: data_storage_classnfs_storage_classstorage_class; single_node_results_storage_classlocal_path_storage_classstorage_class. Default bundled preset: storage_class: "nfs-banff".
-
-Known issue: in multi-node K8s jobs a node may report FAILED in the results table -even though the pod actually succeeded — this happens when the kubelet returns 502 between -job completion and log collection, so madengine cannot parse perf metrics. PVC artifacts are still collected. -Check kubectl describe pod <pod>. + +
+

Docker --build-context tools= v2.1.0

+
+
+

What it does

+

Every docker build issued by DockerBuilder now passes --build-context tools=scripts/common/tools when that directory exists. Dockerfiles can pull shared helper scripts from the named context:

+
# In any model Dockerfile
+COPY --from=tools rocm_smi/*.py /opt/mad/tools/rocm_smi/
+COPY --from=tools gpu_info/*.py /opt/mad/tools/
+

Eliminates duplication of shared APIs across model Dockerfiles.

+
+
+

Conditional emission (PR #134)

+

The flag is only added when scripts/common/tools/ exists at build time. Builds in MAD projects without a tools directory do not receive the flag and will not fail.

+

Implementation: single guarded block in execution/docker_builder.py.

+

SLURM fix in same PR: switched from shlex.quote() to double-quote escaping in slurm.py env-var generation so spaces and paths in values survive correctly in the sbatch script.

+
- +

Launcher matrix

- - - - - - - - - + + + + + + + + +
LauncherLocalK8sSLURMTypeNotes
torchrunTrainDDP / FSDP, elastic.
DeepSpeedTrainZeRO, pipeline parallelism.
Megatron-LMTrainTP + PP, large transformers.
TorchTitanTrainFSDP2 + TP + PP + CP, Llama 3.1 8B–405B.
PrimusTrainMegatron / TorchTitan / MaxText via Primus YAML.
vLLMInferv1 engine, PagedAttention.
SGLangInferRadixAttention, structured gen.
SGLang DisaggInferDisagg prefill/decode, Mooncake, 3+ nodes.
slurm_multi branchMetaSelf-managed multi-node SLURM wrapper for workloads with their own per-node container orchestration.
torchrunTrainDDP / FSDP, elastic rendezvous.
megatron / megatron-lmTrainTP + PP parallelism; sets TP/PP/CP size env vars.
torchtitanTrainFSDP2 + TP + PP + CP; Llama 3.1 8B–405B.
deepspeedTrainZeRO, pipeline parallelism; dynamic hostfile from SLURM.
vllmInferPagedAttention; each node self-managing (no torchrun wrapper).
sglangInferRadixAttention, structured gen; each node self-managing.
sglang_disaggInferDisaggregated prefill/decode; min 3 nodes (1 proxy + ≥1P + ≥1D).
primusTrainMegatron / TorchTitan / MaxText via Primus YAML config.
slurm_multi(self-mgd)MetaBypasses template; model's own SLURM script on head node.
- + +
+

Per-launcher configuration

+
+ + + + + + + + +
+ +
+

Standard PyTorch distributed launcher. Generates: torchrun --nnodes=N --nproc_per_node=N --node_rank=R --master_addr=ADDR --master_port=PORT

+
{
+  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+  "distributed": {
+    "launcher": "torchrun",
+    "nnodes": 4,
+    "nproc_per_node": 8,
+    "backend": "nccl",
+    "port": 29500
+  },
+  "env_vars": {
+    "NCCL_DEBUG": "WARN",
+    "HSA_ENABLE_SDMA": "0",
+    "TORCH_NCCL_ASYNC_ERROR_HANDLING": "1"
+  }
+}
+

Local: MAD_MULTI_NODE_RUNNER is set to torchrun --standalone --nproc_per_node=N (single-node only).

+
+ +
+

Uses torchrun under the hood; sets TENSOR_MODEL_PARALLEL_SIZE, PIPELINE_MODEL_PARALLEL_SIZE, CONTEXT_PARALLEL_SIZE env vars for the Megatron script to read.

+
{
+  "slurm": {"partition":"gpu","nodes":8,"gpus_per_node":8,"time":"48:00:00"},
+  "distributed": {
+    "launcher": "megatron",
+    "nnodes": 8,
+    "nproc_per_node": 8
+  },
+  "env_vars": {
+    "TENSOR_MODEL_PARALLEL_SIZE": "4",
+    "PIPELINE_MODEL_PARALLEL_SIZE": "2",
+    "CONTEXT_PARALLEL_SIZE": "1",
+    "NCCL_IB_DISABLE": "0"
+  }
+}
+
+ +
+

FSDP2 + TP + PP + CP. Sets TORCHTITAN_TENSOR_PARALLEL_SIZE, TORCHTITAN_PIPELINE_PARALLEL_SIZE, TORCHTITAN_FSDP_ENABLED, TORCHTITAN_CONTEXT_PARALLEL_SIZE.

+
{
+  "slurm": {"partition":"gpu","nodes":4,"gpus_per_node":8,"time":"24:00:00"},
+  "distributed": {
+    "launcher": "torchtitan",
+    "nnodes": 4,
+    "nproc_per_node": 8
+  },
+  "env_vars": {
+    "TORCHTITAN_TENSOR_PARALLEL_SIZE": "2",
+    "TORCHTITAN_FSDP_ENABLED": "true"
+  }
+}
+
+ +
+

DeepSpeed with dynamic SLURM hostfile generation. Generates: deepspeed --hostfile=/tmp/hostfile …

+
{
+  "slurm": {
+    "partition": "gpu",
+    "nodes": 8,
+    "gpus_per_node": 8,
+    "time": "48:00:00",
+    "reservation": "ml-priority"
+  },
+  "distributed": {
+    "launcher": "deepspeed",
+    "nnodes": 8,
+    "nproc_per_node": 8,
+    "backend": "nccl"
+  },
+  "env_vars": {
+    "NCCL_DEBUG": "WARN",
+    "HSA_ENABLE_SDMA": "0"
+  }
+}
+
+ +
+

Each node runs independently (no torchrun). Sets VLLM_TENSOR_PARALLEL_SIZE, VLLM_PIPELINE_PARALLEL_SIZE, VLLM_DISTRIBUTED_BACKEND. Only HIP_VISIBLE_DEVICES is set (not ROCR_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES) to avoid conflict with Ray.

+
{
+  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"12:00:00"},
+  "distributed": {
+    "launcher": "vllm",
+    "nnodes": 2,
+    "nproc_per_node": 8
+  },
+  "env_vars": {
+    "VLLM_TENSOR_PARALLEL_SIZE": "8",
+    "VLLM_PIPELINE_PARALLEL_SIZE": "2"
+  }
+}
+
+AMD+Ray gotcha: RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES is automatically overridden to "" when HIP_VISIBLE_DEVICES is set, preventing the rocm/vllm image from ignoring GPU visibility. +
+
+ +
+

SGLang standard (RadixAttention, structured gen). Each node self-managing. Sets SGLANG_TENSOR_PARALLEL_SIZE, SGLANG_PIPELINE_PARALLEL_SIZE.

+
{
+  "slurm": {"partition":"gpu","nodes":2,"gpus_per_node":8,"time":"06:00:00"},
+  "distributed": {
+    "launcher": "sglang",
+    "nnodes": 2,
+    "nproc_per_node": 8
+  },
+  "env_vars": {
+    "SGLANG_TENSOR_PARALLEL_SIZE": "8"
+  }
+}
+
+ +
+

Disaggregated prefill + decode topology. Minimum 3 nodes: 1 proxy + ≥1 prefill + ≥1 decode. Node split: default ~40% prefill, rest decode.

+
{
+  "slurm": {
+    "partition": "gpu",
+    "nodes": 5,
+    "gpus_per_node": 8,
+    "time": "04:00:00"
+  },
+  "distributed": {
+    "launcher": "sglang_disagg",
+    "nnodes": 5,
+    "nproc_per_node": 8,
+    "sglang_disagg": {
+      "prefill_nodes": 2,
+      "decode_nodes": 2
+    }
+  },
+  "env_vars": {
+    "SGLANG_TP_SIZE": "8"
+  }
+}
+

Sets: SGLANG_DISAGG_MODE, SGLANG_DISAGG_PREFILL_NODES, SGLANG_DISAGG_DECODE_NODES, SGLANG_DISAGG_TOTAL_NODES, SGLANG_NODE_IPS, SGLANG_NODE_RANK.

+
+
+ + +
+

Config recipes

+

Complete working configurations for common scenarios.

+ +
+ + + + + + + + +
+ +
+

Local — single GPU, AMD

+
madengine run --tags llama3 \
+  --additional-context '{
+    "gpu_vendor": "AMD",
+    "guest_os": "UBUNTU",
+    "docker_gpus": "0"
+  }'
+
+ +
+

Local — all 8 GPUs, with Megatron env vars

+
madengine run --tags megatron-llama3 \
+  --additional-context '{
+    "gpu_vendor": "AMD",
+    "guest_os": "UBUNTU",
+    "docker_env_vars": {
+      "TENSOR_MODEL_PARALLEL_SIZE": "4",
+      "PIPELINE_MODEL_PARALLEL_SIZE": "2"
+    }
+  }'
+
+ +
+

SLURM — single node torchrun

+
cat > slurm-single.json <<'EOF'
+{
+  "slurm": {
+    "partition": "amd-gpu",
+    "nodes": 1,
+    "gpus_per_node": 8,
+    "time": "12:00:00",
+    "exclusive": true
+  },
+  "distributed": {
+    "launcher": "torchrun",
+    "nnodes": 1,
+    "nproc_per_node": 8
+  }
+}
+EOF
+madengine build --tags llama3 --registry registry.example.com/ml
+madengine run --manifest-file build_manifest.json \
+  --additional-context-file slurm-single.json
+
+ +
+

SLURM — 4-node DeepSpeed with reservation

+
cat > slurm-multi.json <<'EOF'
+{
+  "slurm": {
+    "partition": "amd-gpu",
+    "nodes": 4,
+    "gpus_per_node": 8,
+    "time": "24:00:00",
+    "exclusive": true,
+    "reservation": "ml-training-q1",
+    "network_interface": "ib0"
+  },
+  "distributed": {
+    "launcher": "deepspeed",
+    "nnodes": 4,
+    "nproc_per_node": 8,
+    "backend": "nccl"
+  },
+  "env_vars": {
+    "NCCL_IB_DISABLE": "0",
+    "NCCL_SOCKET_IFNAME": "ib0",
+    "NCCL_DEBUG": "WARN",
+    "HSA_ENABLE_SDMA": "0"
+  }
+}
+EOF
+madengine run --manifest-file build_manifest.json \
+  --additional-context-file slurm-multi.json
+
+ +
+

K8s — single pod, 4 AMD GPUs

+
madengine run --tags llama3-infer \
+  --additional-context '{
+    "k8s": {
+      "namespace": "ml-team",
+      "gpu_count": 4
+    }
+  }'
+
+ +
+

K8s — multi-node vLLM with HF secret

+
madengine run --tags vllm-llama3-70b \
+  --additional-context '{
+    "k8s": {
+      "namespace": "ml-team",
+      "gpu_count": 8,
+      "host_ipc": true,
+      "data_storage_class": "nfs-banff"
+    },
+    "distributed": {
+      "launcher": "vllm",
+      "nnodes": 2,
+      "nproc_per_node": 8
+    },
+    "secrets": {"HF_TOKEN": "hf_xxxxxxx"},
+    "env_vars": {
+      "VLLM_TENSOR_PARALLEL_SIZE": "8",
+      "VLLM_PIPELINE_PARALLEL_SIZE": "2"
+    }
+  }'
+
+ +
+

SLURM — SGLang Disagg (3 nodes: 1 proxy + 1P + 1D)

+
madengine build --tags pyt_sglang_disagg --use-image registry.io/sglang:v0.4
+
+madengine run --manifest-file build_manifest.json \
+  --additional-context '{
+    "slurm": {
+      "partition": "amd-gpu",
+      "nodes": 3,
+      "gpus_per_node": 8,
+      "time": "04:00:00"
+    },
+    "distributed": {
+      "launcher": "slurm_multi"
+    }
+  }'
+
+ +
+

Local run with ROCm compute profiling

+
madengine run --tags llama3 \
+  --additional-context '{
+    "gpu_vendor": "AMD",
+    "tools": [
+      {"name": "rocprofv3_compute"}
+    ],
+    "rocenv_mode": "full"
+  }'
+

Stack multiple profilers:

+
  "tools": [
+    {"name": "rocprofv3_compute"},
+    {"name": "rccl_trace"},
+    {"name": "gpu_info_power_profiler"}
+  ]
+
+
+
-

Profiling & tracing

-

Enable via --additional-context '{"tools":[{"name":"…"}]}'. Stackable.

+

Profiling & tracing tools

+

Enable via --additional-context '{"tools":[{"name":"…"}]}'. Tools are stackable — list multiple objects. Implemented in scripts/common/tools/ and execution/container_runner.py::apply_tools().

+ +
+Do not combine rocm_trace_lite with rocprof / rocprofv3_* in the same run — they conflict at the kernel-tracing level. +
+ - + - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + +
ToolPurposeOutput
Tool namePurposeOutput locationNotes
rocprofLegacy GPU kernel profilingKernel timings/occupancy
rocprofv3_computeCompute-bound (ROCm ≥ 7.0)ALU, wave execution
rocprofv3_memoryMemory-boundCache hits, bandwidth
rocprofv3_communicationMulti-GPURCCL traces
rocprofv3_fullComprehensiveAll metrics, high overhead
rocprofv3_lightweightMinimal overheadHIP + kernel traces
rocprofv3_perfettoPerfetto UI tracesPerfetto JSON
rocprofv3_api_overheadAPI call timingAPI timings
rocprofv3_pc_samplingKernel hotspotsPC sample histograms
rocm_trace_liteRTL lite dispatch tracerocm_trace_lite_output/trace.db
rocm_trace_lite_defaultRTL default modeSame paths, broader coverage
rocblas_trace / miopen_trace / tensile_trace / rccl_traceLibrary call tracingPer-library log
gpu_info_power_profiler / gpu_info_vram_profilerPower / VRAM over timeCSV time series
therock_checkTheRock ROCm validationDetection report
rocprofLegacy GPU kernel profilingKernel timings / occupancy CSVsUse rocprofv3_* on ROCm ≥ 7.0
rocprofv3_computeCompute-bound kernelsALU, wave execution metricsROCm ≥ 7.0
rocprofv3_memoryMemory-bound workloadsCache hits, bandwidth
rocprofv3_communicationMulti-GPU communicationRCCL traces
rocprofv3_fullComprehensive (all metrics)All countersHigh overhead — short runs only
rocprofv3_lightweightMinimal overhead tracingHIP API + kernel traces
rocprofv3_perfettoPerfetto UI tracesPerfetto JSON for ui.perfetto.dev
rocprofv3_api_overheadAPI call timingPer-API timing report
rocprofv3_pc_samplingKernel hotspot identificationPC sample histograms
rocm_trace_liteRTL lite dispatch tracerocm_trace_lite_output/trace.dbPinned GitHub release wheel by default
rocm_trace_lite_defaultRTL default modeSame paths, broader coveragev2.0.3+
rocblas_tracerocBLAS call tracingPer-library log
miopen_traceMIOpen call tracingPer-library log
tensile_traceTensile call tracingPer-library log
rccl_traceRCCL communication tracingPer-library log
gpu_info_power_profilerPower consumption over timeCSV time series
gpu_info_vram_profilerVRAM usage over timeCSV time series
therock_checkTheRock ROCm stack validationDetection reportIdentifies apt vs TheRock install
+ +

rocm_trace_lite wheel control

+ + + + + + +
Env varEffect
ROCM_TRACE_LITE_FOLLOW_LATEST=1Always pull the latest wheel from GitHub
ROCM_TRACE_LITE_WHEEL_URL=https://…Use a specific wheel URL (air-gapped installs)
+ +

rocEnvTool modes

+ + + + +
Mode (rocenv_mode)Collects
"lite" (default)Basic ROCm info, GPU topology, driver version
"full"All of lite + lshw, dmidecode, dmesg, modinfo; best-effort installs missing tools per guest_os
-
-Do not combine rocm_trace_lite with rocprof / -rocprofv3_* in the same run. RTL installs from a pinned GitHub release wheel by -default — set ROCM_TRACE_LITE_FOLLOW_LATEST=1 or -ROCM_TRACE_LITE_WHEEL_URL=… for latest / air-gapped installs. -
- +

ROCm path resolution

-

Implemented in src/madengine/utils/rocm_path_resolver.py.

-

Host (build & tools)

+

Implemented in src/madengine/utils/rocm_path_resolver.py and src/madengine/core/context.py. Two independent resolution chains run in parallel.

+
+
+

Host path (build & tools)

    -
  1. Top-level MAD_ROCM_PATH in --additional-context
  2. -
  3. Auto-detect: /opt/rocm, /opt/rocm-*, TheRock rocm-sdk + markers, then rocminfo / amd-smi / rocm-smi on PATH
  4. -
  5. ROCM_PATH env var
  6. -
  7. /opt/rocm fallback
  8. +
  9. MAD_ROCM_PATH in --additional-context
  10. +
  11. Auto-detect: /opt/rocm, versioned /opt/rocm-*, TheRock (rocm-sdk + markers)
  12. +
  13. rocminfo / amd-smi / rocm-smi location on PATH
  14. +
  15. ROCM_PATH environment variable
  16. +
  17. /opt/rocm fallback (with warning)
-

Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only the env var/default.

-

In-container (AMD Docker runs)

+

Set MAD_AUTO_ROCM_PATH=0 to disable scanning and use only env var / default.

+
+
+

In-container path (AMD Docker runs)

    -
  1. docker_env_vars.MAD_ROCM_PATH (consumed; not forwarded as-is)
  2. -
  3. ROCM_PATH/ROCM_HOME from image OCI config (docker image inspect)
  4. -
  5. In-image shell probe (docker run --rm)
  6. -
  7. /opt/rocm with a warning
  8. +
  9. docker_env_vars.MAD_ROCM_PATH in additional_context
  10. +
  11. ROCM_PATH / ROCM_HOME from image OCI config (docker image inspect)
  12. +
  13. In-image shell probe (docker run --rm image env)
  14. +
  15. /opt/rocm fallback with warning
-

The run-phase environment table prints host vs container installation type -(apt / therock / unknown), ROCm/CUDA root, and version side-by-side.

+

The run-phase env table prints host vs container ROCm root, installation type (apt / therock / unknown), and version side-by-side.

+
+
+
+renderD mapping: ROCm < 6.4.1 uses legacy unique_id method; 6.4.1+ uses amd-smi node_id. The gpu_renderDs context key maps GPU index → /dev/dri/renderD number. Guards against None entries on restricted ROCm installs. +
- + +
+

Environment variables

+ + +

Read by madengine at runtime

+ + + + + + + + + + + + + + + + + + + + + + + + + +
VariableModulePurpose
MAD_ROCM_PATHcontext.pyOverride ROCm root on host. Priority 1.
ROCM_PATHcore/constants.pyFallback ROCm root. Priority 3.
MAD_AUTO_ROCM_PATHrocm_path_resolverSet 0 to disable auto-scan.
MODEL_DIRcore/constants.pyWorking directory for model scripts. Default: .
MAD_VERBOSE_CONFIGcore/constants.pyEnable verbose config output.
MAD_SETUP_MODEL_DIRcore/constants.pyTrigger model directory setup.
MAD_SECRETS*context.pyAny env var with this prefix is automatically copied to docker_build_arg AND docker_env_vars.
MAD_DOCKERHUB_USERbuild_orchestratorDocker Hub username for registry auth.
MAD_DOCKERHUB_PASSWORDbuild_orchestratorDocker Hub password for registry auth.
SLURM_JOB_IDslurm.pyDetect existing SLURM allocation (triggers bash-in-salloc for slurm_multi).
SLURM_NNODES, SLURM_NPROCScontainer_runnerRead in SLURM job to resolve GPU count per node.
NPROC_PER_NODE, GPUS_PER_NODEcontainer_runnerInjected by SLURM template; read by ContainerRunner to set up docker run GPU args.
MONGO_HOST, MONGO_PORTdatabase/mongodb.pyMongoDB connection.
MONGO_USER, MONGO_PASSWORDdatabase/mongodb.pyMongoDB credentials.
MONGO_AUTH_SOURCE, MONGO_TIMEOUT_MSdatabase/mongodb.pyMongoDB auth source and timeout.
NAS_NODEScore/constants.pyNAS node config (JSON string).
MAD_AWS_S3core/constants.pyAWS S3 credentials (JSON: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, …).
MAD_MINIOcore/constants.pyMinIO credentials (JSON: MINIO_ENDPOINT, AWS_ENDPOINT_URL_S3, …).
PUBLIC_GITHUB_ROCM_KEYcore/constants.pyGitHub ROCm key (JSON).
ROCM_TRACE_LITE_FOLLOW_LATESTtoolsSet 1 to always pull latest RTL wheel.
ROCM_TRACE_LITE_WHEEL_URLtoolsOverride RTL wheel URL (air-gapped installs).
+ +

Set by madengine in Docker containers

+ + + + + + + + + + + + + + + + + +
VariableSet byValue / source
MAD_GPU_VENDORcontext.py"AMD" or "NVIDIA"
MAD_SYSTEM_NGPUScontext.pyTotal GPU count on host
MAD_SYSTEM_GPU_ARCHITECTUREcontext.pyGPU arch string (e.g. "gfx90a")
MAD_SYSTEM_HIP_VERSIONcontext.pyHIP version string
MAD_SYSTEM_GPU_PRODUCT_NAMEcontext.pyGPU product name
MAD_GUEST_OScontainer_runner"UBUNTU" or "CENTOS"
MAD_RUNTIME_NGPUScontainer_runnerGPU count allocated for this specific run
MAD_MULTI_NODE_RUNNERcontainer_runnerDistributed launcher command (e.g. torchrun --standalone --nproc_per_node=8)
MAD_MODEL_NAMEcontainer_runnerModel name from model definition
MAD_OUTPUT_CSVcontainer_runnerPath for multiple_results CSV output
ROCM_PATHcontainer_runnerResolved in-container ROCm root
JENKINS_BUILD_NUMBERcontainer_runnerCI build number (from shell env if set)
RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICEScontainer_runnerForce-set to "" when HIP_VISIBLE_DEVICES is active (AMD+Ray fix)
+ +

Set by SLURM job script (job.sh.j2)

+ + + + + + + + + + + + + + + + + + + + + +
VariableValue
MAD_DEPLOYMENT_TYPE"slurm"
MAD_SLURM_JOB_IDSLURM job ID
MAD_NODE_RANKThis node's rank (0-indexed)
MAD_TOTAL_NODESTotal node count
MAD_IN_SLURM_JOB"1"
MAD_LAUNCHER_TYPELauncher type string
MASTER_ADDRHead node hostname (via scontrol)
MASTER_PORTCommunication port (default 29500)
WORLD_SIZETotal GPU processes (nodes × GPUs/node)
NNODESNode count
GPUS_PER_NODEGPU count per node
NODE_RANKThis node's rank
TORCH_ELASTIC_RDZV_TIMEOUT3600
MIOPEN_USER_DB_PATH/tmp/.miopen/node_${SLURM_PROCID}_rank_${LOCAL_RANK:-0}
HIP_VISIBLE_DEVICESGPU indices for this node's processes
ROCR_VISIBLE_DEVICESGPU indices (not set for Ray-based launchers)
CUDA_VISIBLE_DEVICESGPU indices (not set for Ray-based launchers)
+
+ + +
+

Error types

+

Defined in src/madengine/core/errors.py. All inherit from MADEngineError(Exception) which carries: message, category, context (ErrorContext dataclass), cause, recoverable, suggestions (list). Rich panels are used for display.

+ + + + + + + + + + + + + + +
ClassCategoryWhen raised
ValidationErrorVALIDATIONInvalid CLI args, model field values, context key types.
NetworkErrorCONNECTIONRegistry connectivity, pull failures, MongoDB connection.
AuthenticationErrorAUTHENTICATIONRegistry login failure, invalid credentials format.
ExecutionErrorRUNTIMEContainer run failure, script non-zero exit, timeout. (RuntimeError is an alias.)
BuildErrorBUILDDocker build failure.
DiscoveryErrorDISCOVERYmodels.json parse failure, tag not found, no models matched.
OrchestrationErrorORCHESTRATIONManifest load failure, incompatible build/run state.
RunnerErrorRUNNERContainerRunner internal failure.
ConfigurationErrorCONFIGURATIONslurm_multi registry gate violation, conflicting flags, missing required config.
DeploymentTimeoutErrorTIMEOUTSLURM/K8s job exceeded wall time.
+
+ +

Module reference

- - - + +
LayerPathWhat it contains
+ - - - - + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - + + + + + + + - - - - - - + + - - - + - +
LayerPathContents
CLIcli/app.pyTyper app, cli_main entry, --version handling, rich traceback install.
CLIcli/commands/build.pymadengine build command, registry options, batch builds, --use-image/--build-on-compute.
CLIcli/commands/run.pymadengine run command, manifest loading, --skip-model-run.
CLIcli/commands/discover.pyModel discovery command.
CLIcli/app.pyTyper app, cli_main entry, --version, Rich traceback install.
CLIcli/commands/build.pymadengine build: registry, batch, --use-image, --build-on-compute, mutex validation.
CLIcli/commands/run.pymadengine run: manifest loading, all run flags, --force-mirror-local, --cleanup-perf.
CLIcli/commands/discover.pyModel discovery command, scoped tag parsing.
CLIcli/commands/report.pyreport to-html / to-email sub-app.
CLIcli/commands/database.pyMongoDB upload command.
CLIcli/constants.pyExitCode enum.
CLIcli/validators.pyArgument validation.
Orchorchestration/build_orchestrator.pyBuildOrchestrator.execute(), discover → build, registry login, batch manifest, slurm_multi registry gate.
Orchorchestration/run_orchestrator.pyRunOrchestrator, build phase, target inference, local Docker dispatch, slurm_multi result aggregation.
Orchorchestration/image_filtering.pyTarget-arch / tag filtering of manifest entries.
Depdeployment/factory.pyDeploymentFactory.create(), registers SlurmDeployment + KubernetesDeployment; UserWarning if kubernetes pkg missing.
Depdeployment/base.pyBaseDeployment, DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN, terminal states (COMPLETED/FAILED/CANCELLED).
Depdeployment/kubernetes.pyComposes K8s mixins; orchestrates job lifecycle.
Depdeployment/k8s_pvc.pyPVC creation/deletion + storage-class resolution.
Depdeployment/k8s_results.pyLog/artifact collection, perf aggregation; collector_pod_name().
Depdeployment/k8s_scripts.pyScript extraction, ConfigMap building (carries rocenv_mode, guest_os).
Depdeployment/k8s_template_context.pyAssembles Jinja2 template context.
Depdeployment/k8s_secrets.pysecrets → K8s Secret objects.
Depdeployment/k8s_names.pyName truncation/sanitization helpers.
Depdeployment/kubernetes_launcher_mixin.pySelects K8s template per launcher.
Depdeployment/slurm.pySlurmDeployment; classic SLURM path; routes to slurm_multi when launcher matches.
Depdeployment/slurm_node_selector.pySlurmNodeSelector health/cleanup srun, supports reservation.
Depdeployment/primus_backend.pyPrimus YAML / backend selection.
Depdeployment/common.pyShared deployment helpers, slurm_multi wrapper assembly.
Depdeployment/config_loader.pyLoads and deep-merges preset JSON with user config.
Depdeployment/presets/{k8s,slurm}/defaults.jsonDefault values auto-merged with minimal user configs.
Depdeployment/templates/{kubernetes,slurm}/Jinja2 templates per launcher.
Execexecution/container_runner.pyContainerRunner: local docker run, env injection (MAD_GUEST_OS, MAD_OUTPUT_CSV), tools wiring, perf parsing.
Execexecution/container_runner_helpers.pyLog error pattern scan, timeout resolution.
Execexecution/docker_builder.pyDockerBuilder: build args (incl. MAD_SYSTEM_GPU_ARCHITECTURE), push/tag, shell-quoted everywhere.
Execexecution/dockerfile_utils.pyDockerfile parsing helpers.
Corecore/context.pyContext: ast.literal_eval parse, system detect, GPU vendor/arch, ROCm path; guards against None kfd_renderDs entries on restricted ROCm.
Corecore/additional_context_defaults.pyDefault values merged into context.
Corecore/console.pyConsole: Rich-backed shell wrapper, live output mode.
Corecore/docker.pyDocker wrapper; shlex.quote() on every interpolation.
Corecore/errors.pyMADEngineError + 10 typed errors; create_error_context; Rich panels.
Corecore/auth.pyload_credentials(), login_to_registry() (uses --password-stdin + MAD_REGISTRY_PASSWORD env).
CLIcli/constants.pyExitCode enum, DEFAULT_MANIFEST_FILE, DEFAULT_PERF_OUTPUT, DEFAULT_TIMEOUT=-1.
CLIcli/validators.pyArgument validation: validate_additional_context(), create_args_namespace().
Orchorchestration/build_orchestrator.pyBuildOrchestrator.execute(): discover → context → build → registry gate → manifest. slurm_multi use-image / build-on-compute paths.
Orchorchestration/run_orchestrator.pyRunOrchestrator.execute(): manifest loading, target inference, script copy/cleanup, local/distributed dispatch.
Orchorchestration/image_filtering.pyFilters manifest entries by GPU vendor, GPU arch, skip_gpu_arch field.
Depdeployment/factory.pyDeploymentFactory.create(). Registers SlurmDeployment + KubernetesDeployment. UserWarning if kubernetes package missing.
Depdeployment/base.pyBaseDeployment (Template Method), DeploymentConfig, DeploymentResult (incl. skip_monitoring), DeploymentStatus, PERFORMANCE_LOG_PATTERN.
Depdeployment/kubernetes.pyKubernetesDeployment: composes 6 mixins, orchestrates K8s job lifecycle.
Depdeployment/k8s_pvc.pyPVC creation/deletion, storage-class fallback chain.
Depdeployment/k8s_results.pyLog/artifact collection, perf aggregation, collector_pod_name().
Depdeployment/k8s_scripts.pyScript extraction, ConfigMap building (rocenv_mode, guest_os).
Depdeployment/k8s_template_context.pyAssembles Jinja2 template context for K8s jobs.
Depdeployment/k8s_secrets.pysecrets dict → K8s Secret objects.
Depdeployment/k8s_names.pyName truncation/sanitization helpers for K8s resource names.
Depdeployment/kubernetes_launcher_mixin.pySelects Jinja2 template per launcher; sets MAD_MULTI_NODE_RUNNER for K8s pods.
Depdeployment/slurm.pySlurmDeployment: template prep, sbatch submit, bash-in-salloc, slurm_multi dispatch, monitoring, results collection.
Depdeployment/slurm_node_selector.pySlurmNodeSelector: health/cleanup srun, reservation parameter, node preflight.
Depdeployment/common.pyShared helpers: VALID_LAUNCHERS, slurm_multi wrapper assembly, launcher normalization.
Depdeployment/config_loader.pyConfigLoader: deep-merge, preset loading, target inference. env_vars merged recursively (not replaced).
Depdeployment/primus_backend.pyPrimus YAML / backend selection helper.
Depdeployment/presets/slurm/defaults.jsonSLURM base preset.
Depdeployment/presets/slurm/profiles/single-node.json, multi-node.json.
Depdeployment/presets/k8s/defaults.jsonK8s base preset.
Depdeployment/presets/k8s/gpu-vendors/amd.json, nvidia.json, amd-multi-gpu.json.
Depdeployment/presets/k8s/profiles/single-gpu.json, multi-gpu.json, multi-node.json.
Depdeployment/templates/slurm/job.sh.j2Main sbatch template (~822 lines). Sets all SLURM env vars, runs srun task scripts.
Depdeployment/templates/kubernetes/K8s YAML templates: configmap.yaml.j2, job.yaml.j2, pvc.yaml.j2, pvc-data.yaml.j2, service.yaml.j2.
Execexecution/container_runner.pyContainerRunner: local docker run, AMD/NVIDIA run options, env injection, tools, perf parsing, _run_self_managed(), _generate_local_launcher_command().
Execexecution/container_runner_helpers.pyLog error pattern scan, resolve_run_timeout(), make_run_log_file_path().
Execexecution/docker_builder.pyDockerBuilder: build args, --build-context tools= (conditional), registry push, DOCKER_IMAGE_NAME injection into manifest.
Execexecution/dockerfile_utils.pyDockerfile parsing: GPU vendor from filename + FROM line.
Corecore/context.pyContext: ast.literal_eval parse, GPU vendor/arch detection, ROCm path resolution, MAD_SECRETS* propagation, renderD mapping.
Corecore/additional_context_defaults.pyDefault values merged before user context: DEFAULT_GPU_VENDOR="AMD", DEFAULT_GUEST_OS="UBUNTU".
Corecore/console.pyConsole: Rich-backed shell executor, live output, timeout, secret=True for credential commands.
Corecore/docker.pyDocker wrapper: shlex.quote() on every interpolation, auto stop/remove on __del__.
Corecore/errors.py10-type error hierarchy, ErrorCategory, ErrorContext, ErrorHandler, Rich panel display.
Corecore/auth.pyload_credentials(), login_to_registry() using --password-stdin + MAD_REGISTRY_PASSWORD.
Corecore/timeout.pyTimeout context manager; guards signal.alarm(None) when seconds is 0/None.
Corecore/constants.pyMisc core constants.
Corecore/dataprovider.pyData: local / NAS / S3 / MinIO abstraction.
Utilutils/discover_models.pyDiscoverModels: root, dir, or dynamic discovery; scoped vs unscoped tags.
Utilutils/gpu_tool_factory.pyReturns AMD or NVIDIA tool manager based on vendor.
Utilutils/gpu_tool_manager.pyAbstract GPU tool manager interface.
Utilutils/rocm_tool_manager.pyAMD/ROCm implementation.
Utilutils/nvidia_tool_manager.pyNVIDIA implementation.
Utilutils/gpu_validator.pyROCm install detection, GPU vendor detection.
Utilutils/gpu_config.pyGPU configuration helpers.
Utilutils/rocm_path_resolver.pyHost/in-container ROCm root resolver.
Utilutils/therock_markers.pyShared TheRock detection markers.
Utilutils/config_parser.pyConfigParser: parses additional context + tools config.
Utilutils/path_utils.pyPath helpers.
Corecore/dataprovider.pyData abstraction: local / NAS / S3 / MinIO.
Utilutils/discover_models.pyDiscoverModels: root, dir, dynamic discovery; scoped vs unscoped tags; CustomModel dataclass.
Utilutils/gpu_tool_factory.pySingleton get_gpu_tool_manager(vendor, rocm_path); auto-detects vendor.
Utilutils/gpu_validator.pyGPUVendor enum, ROCmValidator, NVIDIAValidator, GPUValidationResult.
Utilutils/rocm_path_resolver.pyHost + in-container ROCm path resolution chains.
Utilutils/therock_markers.pyShared TheRock detection markers (rocm-sdk, layout probes).
Utilutils/config_parser.pyConfigParser: 5-level config file resolution, CSV/JSON/YAML loading, multi-row result matching.
Utilutils/session_tracker.pySession start/marker tracking.
Utilutils/ops.pyMisc operations.
Utilutils/log_formatting.pyLog formatting helpers.
Utilutils/run_details.pyRun metadata helpers.
Repreporting/update_perf_csv.pyWrites/appends to perf.csv and perf_entry.csv.
Repreporting/csv_to_html.pyHTML report generation.
Repreporting/update_perf_csv.pyWrites/appends perf.csv and perf_entry.csv. PERF_CSV_HEADER (28 columns).
Repreporting/csv_to_html.pyHTML performance report generation.
Repreporting/csv_to_email.pyEmail-friendly consolidated report.
Repreporting/update_perf_super.pySuperset-shaped perf rollups.
DBdatabase/mongodb.pyMongoDB connection + insert; uses datetime.now(timezone.utc).
DBdatabase/mongodb.pyMongoDBConfig.from_env(), UploadOptions, UploadResult; upsert + batch upload.
Scriptsscripts/common/pre_scripts/rocEnvTool/rocenv_tool.py, csv_parser.py, console.py — TheRock-compatible env capture (lite + full modes).
Scriptsscripts/common/tools/GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers.
Scriptsscripts/common/tools/GPU info profilers, amd_smi / rocm_smi utils, rtl_trace wrapper, library tracers (rocblas, miopen, rccl, tensile).
- +

Test layout

unit/

-

Fast, isolated, mocked. ~28 modules including test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py.

+

Fast, isolated, mocked. Key files: test_slurm_multi.py, test_shell_quoting.py, test_error_handling.py, test_k8s.py, test_rocm_path.py, test_validators.py, test_deployment.py, test_container_runner.py.

integration/

Real Docker / GPU / platform calls. Includes test_docker_integration.py, test_container_execution.py, test_gpu_management.py, test_orchestrator_workflows.py, test_profiling_tools_config.py.

e2e/

Full workflows: test_build_workflows.py, test_run_workflows.py, test_profiling_workflows.py, test_data_workflows.py, test_execution_features.py, test_scripting_workflows.py.

-

Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0). Markers: unit, integration, e2e, slow, gpu, amd, nvidia, cpu, requires_docker, requires_models.

+ + + + + + + + + + + + + +
MarkerWhat it selects
unitFast unit tests with no external deps
integrationTests requiring Docker / real GPU calls
e2eFull end-to-end workflow tests
slowLong-running tests
gpuRequires GPU hardware
amd / nvidiaVendor-specific tests
cpuCPU-only tests
requires_dockerTests requiring Docker daemon
requires_modelsTests requiring model files to be present
+

Pytest config lives solely in [tool.pytest.ini_options] in pyproject.toml (minversion=7.0).

- +

Contributing & code style

+
+
+

Style rules

+
    +
  • Formatting: Black (line-length 88), targets py3.8–py3.11
  • +
  • Imports: isort with profile="black"; first-party = madengine
  • +
  • Lint: flake8 + mypy (strict equality, warn unused) + bandit (skips B101)
  • +
  • Docstrings: Google style; type hints required for public functions
  • +
  • Conventional commits: feat:, fix:, docs:, test:, refactor:, style:, perf:, chore:
  • +
+
+
+

Security rules

    -
  • Formatting: Black (line-length 88), targets py38–py311.
  • -
  • Imports: isort with profile = "black"; first-party = madengine.
  • -
  • Lint: flake8 + mypy (strict equality, warn unused, etc.) + bandit (skips B101).
  • -
  • Docstrings: Google style; type hints for public functions.
  • -
  • Conventional commits: feat:, fix:, docs:, test:, refactor:, style:, perf:, chore:.
  • -
  • Pre-commit: pip install pre-commit && pre-commit install.
  • +
  • Use shlex.quote() on every shell interpolation of user-controlled values (image names, paths, container names, build-args)
  • +
  • Registry passwords via --password-stdin (not command-line args); env var MAD_REGISTRY_PASSWORD
  • +
  • Credential JSON must be a dict object — validated at load time (ConfigurationError on wrong type)
  • +
  • MIOPEN_USER_DB_PATH is filtered from deployment_config to prevent leaking temp paths
  • +
  • Never log secret values — log keys only
+
+
- +
-

Recent notable changes

+

Changelog

-[Unreleased] — slurm_multi launcher +[2.1.0] — 2026-05-28 +

Added

    -
  • New slurm_multi SLURM launcher; slurm-multi alias accepted.
  • -
  • madengine build --use-image [IMAGE|auto] and --build-on-compute.
  • -
  • Build registry gate with structured ConfigurationError.
  • -
  • bash-in-salloc execution path when SLURM_JOB_ID is already set.
  • -
  • DeploymentResult.skip_monitoring for synchronous deploys.
  • -
  • SlurmNodeSelector accepts a reservation parameter.
  • -
  • perf.csv aggregation into cwd so the default reporter sees per-job rows.
  • -
  • Contract tests + minimal example config.
  • +
  • slurm_multi self-managed SLURM launcher (PRs #130, #126): alias slurm-multi, parallel docker pull, bash-in-salloc path, _run_self_managed() for local mode
  • +
  • madengine build --use-image [IMAGE|auto] — skip local build
  • +
  • madengine build --build-on-compute — build on compute node + push
  • +
  • slurm_multi registry gate with structured ConfigurationError
  • +
  • DeploymentResult.skip_monitoring for synchronous deploy paths
  • +
  • SlurmNodeSelector.reservation parameter
  • +
  • DockerBuilder: --build-context tools= (conditional on dir existence, PR #131 + #134)
  • +
  • Local MAD_MULTI_NODE_RUNNER via ContainerRunner._generate_local_launcher_command() (PR #126)
  • +
  • Model card distributed/slurm auto-merged into manifest deployment_config
  • +
  • DOCKER_IMAGE_NAME injection into manifest env_vars after successful registry push
  • +
+

Changed

+
    +
  • SLURM env-var escaping: double-quote instead of shlex.quote to preserve spaces/paths (PR #134)
  • +
  • Early DiscoverModels result cached and reused for actual build (no duplicate get_models_json.py runs)
  • +
  • E2E test cleanup defaults include build_manifest.json + perf artefacts
+
-[2.0.3] — rocEnvTool full mode, K8s refactor, security +[2.0.3] — 2026-05-26
    -
  • K8s monolith decomposed into k8s_pvc/k8s_results/k8s_scripts/k8s_template_context mixins.
  • -
  • rocEnvTool "full" mode (lshw, dmidecode, dmesg, modinfo) with guest_os-native installers.
  • -
  • Generic storage_class fallback added; default preset now nfs-banff.
  • -
  • rocm_trace_lite_default tool (RTL default mode).
  • -
  • Security: shlex.quote() on every shell interpolation in core/docker.py, container_runner.py, docker_builder.py, run_orchestrator.py.
  • -
  • Collector pod name mismatch fix (truncated collector-{id[:15]} shared helper).
  • -
  • RPD pre-script: xxd install + sudo/root branch fixes.
  • -
  • CANCELLED added to terminal-state set so scancel'd jobs don't loop forever.
  • -
  • Context guards against None kfd_renderDs on restricted ROCm.
  • +
  • rocEnvTool "full" mode (lshw, dmidecode, dmesg, modinfo)
  • +
  • K8s monolith decomposed into 6 focused mixin modules
  • +
  • Generic storage_class fallback; default preset nfs-banff
  • +
  • rocm_trace_lite_default tool (RTL default mode)
  • +
  • Security: shlex.quote() on every shell interpolation
  • +
  • Collector pod name mismatch fix (shared collector_pod_name() helper)
  • +
  • CANCELLED added to terminal-state set
  • +
  • Local MAD_MULTI_NODE_RUNNER for Docker local (_generate_local_launcher_command())
+
-[2.0.2] / [2.0.1] — credential validation, ROCm auto-detect, GPU arch +[2.0.2] / [2.0.1]
    -
  • load_credentials() validates JSON object type, raises ConfigurationError.
  • -
  • Host ROCm auto-detection via priority chain; in-container ROCm resolved independently.
  • -
  • TheRock layout support (rocm-sdk + markers).
  • -
  • GPU arch auto-detection injected into Docker build args for full-run mode.
  • -
  • Model discovery: scope-based tag selection replaces strict flag.
  • -
  • Shared login_to_registry, centralised credential loading.
  • -
  • Registry password via env + --password-stdin (no more /proc exposure).
  • -
  • Unified PERFORMANCE_LOG_PATTERN across local + deployment paths.
  • +
  • Host ROCm auto-detection via priority chain; in-container ROCm resolved independently
  • +
  • TheRock (rocm-sdk) layout support
  • +
  • GPU arch auto-detection injected into Docker build args
  • +
  • Model discovery: scope-based tag selection replaces strict flag
  • +
  • Registry password via --password-stdin + env var
  • +
  • credential.json type validation
  • +
  • Unified PERFORMANCE_LOG_PATTERN across local + deployment paths
  • +
  • Run-phase host/container env table printed at startup
+
-[2.0.0] — Complete rewrite +[2.0.0] — 2026-04-09 — Complete rewrite
    -
  • Unified madengine CLI; legacy mad-* removed.
  • -
  • 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core).
  • -
  • Multi-target deployment via factory + presets + Jinja2 templates.
  • -
  • Launcher mixin with torchrun / DeepSpeed / Megatron-LM / TorchTitan / Primus / vLLM / SGLang.
  • -
  • Log error pattern scanning; --skip-model-run; batch build manifest.
  • -
  • SLURM nodelist pinning; K8s Secrets management.
  • -
  • Structured errors (10 types) with Rich panels; fixed exit codes.
  • -
  • RuntimeError renamed to ExecutionError (alias preserved).
  • +
  • Unified madengine CLI; legacy mad-* removed
  • +
  • 5-layer architecture (CLI / Orchestration / Deployment / Execution / Core)
  • +
  • Factory + Template Method patterns; DeploymentFactory, BaseDeployment, ConfigLoader
  • +
  • Multi-target deployment: presets + Jinja2 templates per launcher
  • +
  • Launcher matrix: torchrun / DeepSpeed / Megatron / TorchTitan / Primus / vLLM / SGLang
  • +
  • Log error pattern scanning; --skip-model-run; batch build manifest
  • +
  • Structured errors (10 types) with Rich panels; fixed exit codes
  • +
  • SLURM nodelist pinning; K8s Secrets management; data provider abstraction
+ - - - + \ No newline at end of file