ztxtech
diff --git a/‎README.md‎
Lines changed: 5 additions & 0 deletions b/‎README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs_src/user-manual.zh.md‎
Lines changed: 174 additions & 0 deletions b/‎docs_src/user-manual.zh.md‎
Lines changed: 174 additions & 0 deletions
diff --git a/‎examples/template_library/README.md‎
Lines changed: 1 addition & 0 deletions b/‎examples/template_library/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/template_library/basics/exp_fn_contract_matrix.py‎
Lines changed: 90 additions & 0 deletions b/‎examples/template_library/basics/exp_fn_contract_matrix.py‎
Lines changed: 90 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions b/‎mkdocs.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎tests/test_docs_user_manual.py‎
Lines changed: 24 additions & 0 deletions b/‎tests/test_docs_user_manual.py‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎tests/test_templates_smoke.py‎
Lines changed: 44 additions & 1 deletion b/‎tests/test_templates_smoke.py‎
Lines changed: 44 additions & 1 deletion
@@ -189,6 +189,11 @@ analyzer.to_csv("./results_demo/summary.csv", sort_by=["model", "lr"])
 2. 自动生成首页 `index.md` 与 `reference/` API 页面；
 3. `mkdocstrings` 从类/函数 docstring 渲染参数、返回值与示例。
 
+推荐先看用户手册，再查 API：
+
+- [用户手册（开发流程与产物协议）](user-manual.zh.md)
+- [API 参考（函数与类型签名）](reference/)
+
 本地入口：
 
 - 生成脚本：[`scripts/gen_ref_pages.py`](https://github.com/ztxtech/ztxexp/blob/main/scripts/gen_ref_pages.py)
 
@@ -0,0 +1,174 @@
+# 用户手册（中文）
+
+本手册面向**直接使用 ztxexp 开发实验**的用户，而不是仅查询 API 参数。  
+如果你只想知道函数签名，请看 API 参考；如果你想知道一个实验应该怎么落地、产物怎么组织、失败如何排查，请按本手册执行。
+
+## 1. 先理解 `exp_fn` 契约
+
+`ztxexp` 的单次实验函数固定签名：
+
+```python
+def exp_fn(ctx: RunContext) -> dict | None:
+    ...
+```
+
+### 1.1 `ctx` 里最关键的字段
+
+- `ctx.run_id`：本次 run 的唯一 ID（通常也是目录名）。
+- `ctx.run_dir`：本次 run 的目录路径。
+- `ctx.config`：当前配置字典（已经是最终配置，不需要再从 argparse 解析）。
+- `ctx.logger`：当前 run 专属日志器，写入 `run.log`。
+- `ctx.meta`：运行元数据（实验名、分组、标签、种子、环境采集信息等）。
+
+### 1.2 `ctx.log_metric(...)` 用于过程指标
+
+当你希望记录 step 级曲线（例如每个 epoch 的 loss/acc）时，使用：
+
+```python
+ctx.log_metric(step=1, metrics={"loss": 0.92, "acc": 0.71}, split="train", phase="fit")
+```
+
+这会写入 `metrics.jsonl`（每行一个事件），并触发已注册 tracker 的 `on_metric` 回调。
+
+## 2. 返回值与状态矩阵（决定 run 成败）
+
+`exp_fn` 只允许返回 `dict | None`。不同返回/异常对应的行为如下：
+
+| 场景 | 你在 `exp_fn` 中做什么 | run 状态 | 关键产物 |
+| --- | --- | --- | --- |
+| 最终指标返回 | `return {"score": 0.93}` | `succeeded` | `metrics.json` |
+| 仅过程曲线 | `ctx.log_metric(...); return None` | `succeeded` | `metrics.jsonl`（无 `metrics.json`） |
+| 主动跳过 | `raise SkipRun("reason")` | `skipped` | `run.json` + `events.jsonl`（skip 事件） |
+| 业务异常 | 抛出异常（如 `RuntimeError`） | `failed` | `error.log` + `run.json.error_*` |
+| 非法返回值 | `return 123` 等非 `dict|None` | `failed` | `error.log`（`TypeError`） |
+
+### 2.1 关键判定规则
+
+- 成功判定只看：`run.json.status == "succeeded"`。
+- 不再使用旧版 `_SUCCESS` 文件。
+
+## 3. 产物协议矩阵（该写什么、写到哪里）
+
+每个 run 目录遵循 v2 协议，核心结构如下：
+
+```text
+<results_root>/<run_id>/
+  config.json
+  run.json
+  meta.json
+  metrics.json            # 可选
+  metrics.jsonl           # 可选
+  events.jsonl            # 可选
+  artifacts/
+  checkpoints/
+  run.log
+  error.log               # 失败时
+```
+
+| 产物 | 谁写入 | 何时出现 | 必选/可选 | 说明 |
+| --- | --- | --- | --- | --- |
+| `config.json` | 框架 | run 启动时 | 必选 | 当前 run 的最终配置快照。 |
+| `run.json` | 框架 | 启动时创建，结束时回填 | 必选 | 状态机文件，含 `status/start/finish/error` 等。 |
+| `meta.json` | 框架 | 启动时写入，可随重试更新 | 必选（v0.4+） | 复现与治理元数据。 |
+| `metrics.json` | 框架 | `exp_fn` 返回 `dict` 时 | 可选 | 最终指标快照，适合排名/汇总。 |
+| `metrics.jsonl` | 框架 | 调用 `ctx.log_metric` 后 | 可选 | step 级时间序列指标。 |
+| `events.jsonl` | 框架 | run 生命周期中 | 可选 | `start/retry/skip/error/end` 事件流。 |
+| `artifacts/` | 用户 + 框架创建目录 | run 启动时创建目录 | 必选目录 | 业务文件统一放这里（模型、图表、报告等）。 |
+| `checkpoints/` | 用户 + 框架创建目录 | run 启动时创建目录 | 必选目录 | 断点恢复文件建议统一放这里。 |
+| `error.log` | 框架 | run 失败时 | 可选 | 失败堆栈，优先排查入口。 |
+
+## 4. 最终指标、过程指标、业务产物如何分工
+
+- 最终指标（用于横向比较）：`return dict`，自动落到 `metrics.json`。
+- 过程指标（用于画曲线和诊断）：`ctx.log_metric(...)`，落到 `metrics.jsonl`。
+- 业务产物（模型、日志、图表、预测样本）：手动写入 `artifacts/`。
+- checkpoint（恢复训练）：写入 `checkpoints/`。
+
+推荐做法：
+
+1. 在 `metrics.json` 只保留关键汇总指标（如 `best_val_f1`、`test_acc`）。
+2. 在 `metrics.jsonl` 记录细粒度训练过程（每 step/epoch）。
+3. 把大文件和中间物全部放在 `artifacts/` 或 `checkpoints/`，不要污染 run 根目录。
+
+## 5. 用户开发流程（从 0 到可分析）
+
+### 5.1 构建配置
+
+使用 `ExperimentPipeline` 或 `ExpManager` 构建参数空间：
+
+```python
+from ztxexp import ExperimentPipeline
+
+pipeline = (
+    ExperimentPipeline("./results_demo", base_config={"seed": 42})
+    .grid({"lr": [1e-3, 1e-2]})
+    .variants([{"model": "tiny"}, {"model": "base"}])
+    .exclude_completed()
+)
+```
+
+### 5.2 编写 `exp_fn`
+
+```python
+from pathlib import Path
+from ztxexp import RunContext
+
+
+def exp_fn(ctx: RunContext) -> dict | None:
+    lr = float(ctx.config["lr"])
+    model = str(ctx.config["model"])
+
+    # 过程指标
+    ctx.log_metric(step=1, metrics={"loss": 0.8}, split="train", phase="fit")
+
+    # 业务产物
+    artifact = Path(ctx.run_dir) / "artifacts" / "summary.txt"
+    artifact.write_text(f"run={ctx.run_id}, model={model}, lr={lr}\n", encoding="utf-8")
+
+    # 最终指标
+    return {"score": round(1.0 - lr, 4)}
+```
+
+### 5.3 选择执行模式
+
+- `sequential`：先保证正确性，再扩并发。
+- `process_pool`：CPU 密集任务优先考虑。
+- `joblib`：需要与 joblib 生态兼容时使用。
+- `dynamic`：实验特性，按 CPU 阈值动态提交。
+
+### 5.4 分析与清理
+
+```python
+from ztxexp import ResultAnalyzer
+
+analyzer = ResultAnalyzer("./results_demo")
+df = analyzer.to_dataframe(statuses=("succeeded",))
+curve_df = analyzer.to_curve_dataframe(metric_key="loss")
+```
+
+## 6. 调试与排错清单
+
+1. 先看 `run.json.status`，再看 `error.log`。
+2. 结果为空时检查：是否 `return dict`、是否被 `SkipRun`、是否被过滤条件排除。
+3. 曲线缺失时检查：是否调用了 `ctx.log_metric`。
+4. `exclude_completed` 异常时检查：历史目录是否是 v2 协议且成功状态是 `succeeded`。
+5. 并行场景异常先切 `sequential` 复现，再回到并行模式。
+
+## 7. 复制即改：建议优先使用这些模板
+
+- 契约矩阵模板（本手册对应）：
+  - `examples/template_library/basics/exp_fn_contract_matrix.py`
+- 模板库入口：
+  - [示例模板库导航](examples-lib/index.md)
+  - [模板索引表](examples-lib/catalog.md)
+  - [场景复制矩阵](examples-lib/matrix.md)
+
+## 8. 与 API 参考的关系
+
+- 用户手册（本页）：回答“如何用 ztxexp 开发完整实验”。
+- API 参考：回答“某个类/函数的参数、返回值和签名是什么”。
+
+建议阅读顺序：
+
+1. 先看本手册完成第一个可运行实验；
+2. 再按需跳转 API 页面查看细节参数。
@@ -9,6 +9,7 @@
 | `analysis/dataframe_csv_export.py` | DataFrame + CSV 导出 | 将 run 目录聚合为表格并导出 CSV。 |
 | `analysis/leaderboard_comparison.py` | 排行榜对比模板 | 快速生成 Top-K 配置列表，便于版本评审。 |
 | `analysis/pivot_excel_report.py` | 透视表 Excel 报告 | 按模型/超参数维度输出可读的透视表报告。 |
+| `basics/exp_fn_contract_matrix.py` | `exp_fn` 契约矩阵模板 | 一次性演示返回 dict / 返回 None / SkipRun / 异常失败四类结果与产物差异。 |
 | `basics/grid_and_variants.py` | 网格搜索 + 变体实验 | 同时遍历超参数网格和架构变体，适合 ablation 初期。 |
 | `basics/manager_runner_split.py` | 管理器与执行器解耦 | 当你需要先构建配置再交给不同 runner 时使用。 |
 | `basics/minimal_pipeline.py` | 最小可运行实验 | 用于快速验证环境、目录协议和基础执行链路。 |
 
@@ -0,0 +1,90 @@
+"""`exp_fn` 契约矩阵模板。
+
+场景说明：
+1. 用一个模板同时演示 `exp_fn` 的四种关键结果路径：返回 dict、返回 None、SkipRun、异常失败。
+2. 便于团队统一“返回什么、写到哪里、如何判定状态”的约定。
+
+输入配置字段：
+- `scenario`：
+  - `return_metrics`：返回最终指标字典，触发 `metrics.json`。
+  - `stream_only`：仅写 step 指标流并返回 `None`。
+  - `skip`：主动跳过（`SkipRun`），run 状态为 `skipped`。
+  - `fail`：抛出异常，run 状态为 `failed` 并生成 `error.log`。
+- `lr`：示例超参数（可选）。
+
+输出产物差异（由框架协议决定）：
+- 所有场景都会有：`config.json`、`run.json`、`meta.json`、`events.jsonl`、`artifacts/`。
+- `return_metrics` 会额外写入：`metrics.json`。
+- `stream_only` 会写入：`metrics.jsonl`，通常没有 `metrics.json`。
+- `fail` 会写入：`error.log`。
+
+复制后最少改动：
+1. 把 `exp_fn` 中伪指标替换为真实训练/评测逻辑。
+2. 保留 `scenario` 分支用于本地自测，或改成你的业务分支条件。
+3. 将你的模型、样本、报告统一写入 `ctx.run_dir / "artifacts"`。
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from ztxexp import ExperimentPipeline, RunContext, SkipRun
+
+
+def exp_fn(ctx: RunContext) -> dict[str, float] | None:
+    """演示 `exp_fn` 契约的四种典型分支。"""
+    scenario = str(ctx.config.get("scenario", "return_metrics"))
+    lr = float(ctx.config.get("lr", 0.001))
+
+    artifact_payload = {
+        "run_id": ctx.run_id,
+        "scenario": scenario,
+        "config": ctx.config,
+        "note": "replace with your real experiment artifacts",
+    }
+    artifact_path = Path(ctx.run_dir) / "artifacts" / f"{scenario}.json"
+    artifact_path.write_text(
+        json.dumps(artifact_payload, ensure_ascii=False, indent=2),
+        encoding="utf-8",
+    )
+
+    if scenario == "return_metrics":
+        ctx.log_metric(step=1, metrics={"loss": 0.83}, split="train", phase="fit")
+        return {
+            "score": round(1.0 - lr, 4),
+            "best_val_loss": 0.71,
+        }
+
+    if scenario == "stream_only":
+        ctx.log_metric(step=1, metrics={"loss": 0.92}, split="train", phase="fit")
+        ctx.log_metric(step=2, metrics={"loss": 0.78}, split="train", phase="fit")
+        return None
+
+    if scenario == "skip":
+        raise SkipRun("Scenario skip: this config should be skipped by design.")
+
+    if scenario == "fail":
+        raise RuntimeError("Scenario fail: intentional failure for contract demonstration.")
+
+    raise ValueError(f"Unknown scenario: {scenario}")
+
+
+if __name__ == "__main__":
+    pipeline = (
+        ExperimentPipeline(
+            results_root="./results_templates/exp_fn_contract_matrix",
+            base_config={"seed": 42, "task": "exp_fn_contract_matrix"},
+        )
+        .variants(
+            [
+                {"scenario": "return_metrics", "lr": 0.001},
+                {"scenario": "stream_only", "lr": 0.005},
+                {"scenario": "skip", "lr": 0.01},
+                {"scenario": "fail", "lr": 0.02},
+            ]
+        )
+    )
+
+    summary = pipeline.run(exp_fn, mode="sequential")
+    print(summary)
@@ -28,6 +28,7 @@ theme:
 
 nav:
   - 首页: index.md
+  - 用户手册: user-manual.zh.md
   - 迁移指南: migration-v04.zh.md
   - 示例模板库: examples-lib/
   - API 参考: reference/
 
@@ -0,0 +1,24 @@
+from __future__ import annotations
+
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+
+
+def test_user_manual_contains_exp_fn_contract_sections():
+    manual_path = ROOT / "docs_src" / "user-manual.zh.md"
+    assert manual_path.exists()
+
+    content = manual_path.read_text(encoding="utf-8")
+    required_keywords = [
+        "exp_fn(ctx: RunContext) -> dict | None",
+        "返回值与状态矩阵",
+        "产物协议矩阵",
+        "SkipRun",
+        "ctx.log_metric",
+        "metrics.json",
+        "metrics.jsonl",
+        "error.log",
+    ]
+    for keyword in required_keywords:
+        assert keyword in content
@@ -4,7 +4,9 @@
 import uuid
 from pathlib import Path
 
-from ztxexp import ExpRunner, RunContext, RunMetadata, utils
+import pytest
+
+from ztxexp import ExpRunner, RunContext, RunMetadata, SkipRun, utils
 
 ROOT = Path(__file__).resolve().parents[1]
 TEMPLATE_ROOT = ROOT / "examples" / "template_library"
@@ -84,3 +86,44 @@ def test_template_smoke_analysis_category(tmp_path, monkeypatch):
         module = _load_module(path)
         assert hasattr(module, "main")
         module.main()
+
+
+def test_exp_fn_contract_matrix_template(tmp_path):
+    path = TEMPLATE_ROOT / "basics" / "exp_fn_contract_matrix.py"
+    module = _load_module(path)
+    assert hasattr(module, "exp_fn")
+
+    ctx_metrics = _make_ctx(tmp_path, {"scenario": "return_metrics", "lr": 0.001})
+    try:
+        result_metrics = module.exp_fn(ctx_metrics)
+        assert isinstance(result_metrics, dict)
+        assert "score" in result_metrics
+        assert (ctx_metrics.run_dir / "artifacts" / "return_metrics.json").exists()
+    finally:
+        _close_ctx_logger(ctx_metrics)
+
+    ctx_stream = _make_ctx(tmp_path, {"scenario": "stream_only", "lr": 0.005})
+    try:
+        result_stream = module.exp_fn(ctx_stream)
+        assert result_stream is None
+        assert (ctx_stream.run_dir / "artifacts" / "stream_only.json").exists()
+        rows = utils.load_jsonl(ctx_stream.run_dir / "metrics.jsonl", skip_invalid=True)
+        assert len(rows) >= 2
+    finally:
+        _close_ctx_logger(ctx_stream)
+
+    ctx_skip = _make_ctx(tmp_path, {"scenario": "skip", "lr": 0.01})
+    try:
+        with pytest.raises(SkipRun):
+            module.exp_fn(ctx_skip)
+        assert (ctx_skip.run_dir / "artifacts" / "skip.json").exists()
+    finally:
+        _close_ctx_logger(ctx_skip)
+
+    ctx_fail = _make_ctx(tmp_path, {"scenario": "fail", "lr": 0.02})
+    try:
+        with pytest.raises(RuntimeError):
+            module.exp_fn(ctx_fail)
+        assert (ctx_fail.run_dir / "artifacts" / "fail.json").exists()
+    finally:
+        _close_ctx_logger(ctx_fail)