feat(examples): add inference service performance benchmark using OpenClaw + TAU2-bench#1175
Conversation
…nClaw + TAU2-bench Add end-to-end performance benchmark for AReaL inference service using OpenClaw agent on TAU2-bench. Key changes: - Add batchmode benchmark scripts: sweep runner, trajectory collector, SGLang metrics collector, server launcher - Add openclaw_tau2 integration package: OpenClaw agent adapter for TAU2-bench with socket-based cross-process tool execution - Add _normalize_messages_for_chat_template() to ArealOpenAI client: content flattening + tool_calls arguments dict parsing for SGLang token sequence alignment - Add unit tests for message normalization (15 cases)
There was a problem hiding this comment.
Code Review
This pull request introduces a message normalization layer in the ArealOpenAI client to align with SGLang's chat template requirements, specifically addressing content flattening and tool call argument parsing. It also adds a comprehensive benchmarking suite for the inference service, including Slurm integration, metric collection scripts, and a tau2-bench integration package that utilizes a socket-based environment server for agent evaluation. Feedback focuses on several technical improvements: the normalization logic incorrectly handles dictionary content by converting it to a list of keys, and its in-place mutation of messages causes unintended side effects in the interaction cache. Additionally, the socket-based JSON receiver lacks robustness against stream fragmentation, the OpenClaw CLI retry logic fails to catch timeout exceptions, and the benchmark configuration should be updated to use the chat API instead of the legacy completions API to ensure compatibility with the new client logic.
| return { | ||
| "baseUrl": self.llm_base_url or "http://127.0.0.1:30000", | ||
| "apiKey": self.llm_api_key or "dummy", | ||
| "api": "openai-completions", |
There was a problem hiding this comment.
The OpenClaw provider is configured to use openai-completions, but the ArealOpenAI client implementation in this PR primarily focuses on overriding chat.completions and responses. If OpenClaw uses the legacy completions API, it may bypass the chat template normalization and engine logic added here, or fail if the IS Gateway doesn't explicitly support the completions endpoint. It is recommended to use openai-chat for modern chat models like Qwen3.
| "api": "openai-completions", | |
| "api": "openai-chat", |
| if not isinstance(content, list): | ||
| content = list(content) |
There was a problem hiding this comment.
The current logic incorrectly handles cases where message.content is a dictionary (a common format for single-part content in some SDKs). If content is a dictionary, list(content) will return a list of its keys, leading to corrupted content processing. It should check if the content is a mapping and wrap it in a list instead.
| if not isinstance(content, list): | |
| content = list(content) | |
| if isinstance(content, Mapping): | |
| content = [content] | |
| elif not isinstance(content, list): | |
| content = list(content) |
There was a problem hiding this comment.
Please check this comment.
|
|
||
| all_message_list += message_list | ||
|
|
||
| _normalize_messages_for_chat_template(all_message_list) |
There was a problem hiding this comment.
Calling _normalize_messages_for_chat_template here will mutate the message dictionaries in-place. Since all_message_list contains references to parent.messages which are stored in the InteractionCache, this results in side effects where cached interactions are modified. This can lead to unexpected behavior if the same interaction is reused or exported later. Consider deep-copying the messages before normalization.
all_message_list = [deepcopy(m) for m in all_message_list]
_normalize_messages_for_chat_template(all_message_list)| def _recv_json(sock: socket.socket) -> dict[str, Any] | None: | ||
| chunks: list[bytes] = [] | ||
| while True: | ||
| data = sock.recv(4096) | ||
| if not data: | ||
| break | ||
| chunks.append(data) | ||
| if b"\n" in data: | ||
| break | ||
| return json.loads(b"".join(chunks).split(b"\n", 1)[0].decode("utf-8")) if chunks else None |
There was a problem hiding this comment.
The _recv_json function is not robust for stream-oriented sockets. If multiple JSON messages are received in a single recv call, or if a message is followed by the start of another, the split(b"\n", 1)[0] logic will discard the subsequent data. While the current client implementation uses a new connection per request, the server loop (which uses while request := _recv_json(...)) is susceptible to data loss if a client sends multiple requests over the same connection. Consider using a buffered reader or a proper framing handler.
| except subprocess.TimeoutExpired as exc: | ||
| self._cleanup_lock_files(agent_id) | ||
| raise OpenClawServiceError(f"OpenClaw CLI timeout after {self.timeout}s") from exc |
There was a problem hiding this comment.
The retry logic in the chat method does not cover subprocess.TimeoutExpired. If the OpenClaw CLI times out (which can happen under high concurrency in a benchmark), the method immediately raises an error instead of attempting the remaining retries. Timeouts should be included in the retry loop to improve benchmark stability.
There was a problem hiding this comment.
Pull request overview
Adds an end-to-end performance benchmarking harness for the AReaL inference service using an OpenClaw-driven TAU²-bench workload, and updates the ArealOpenAI client preprocessing to better match SGLang’s chat-template tokenization behavior.
Changes:
- Added batchmode benchmarking scripts (server launch, sweep runner, trajectory collection, metrics collection) and accompanying README.
- Added a TAU²-bench + OpenClaw socket-tool integration layer (environment socket server, evaluator glue, task runners, OpenClaw agent/service/workspace utilities).
- Added
_normalize_messages_for_chat_template()to the ArealOpenAI client plus new unit tests for message normalization.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
areal/experimental/openai/client.py |
Adds message normalization prior to apply_chat_template for SGLang-aligned tokenization. |
tests/test_openai_client_normalize.py |
Adds unit tests covering content flattening and tool-call-argument parsing behavior. |
examples/experimental/inference_service/batchmode/README.md |
Documents the benchmark setup, workflow, and sample results. |
examples/experimental/inference_service/batchmode/start_servers.sh |
Slurm launcher for agent/user SGLang servers. |
examples/experimental/inference_service/batchmode/sweep.sh |
Runs end-to-end IS sweep across concurrency/trials inside Singularity. |
examples/experimental/inference_service/batchmode/collect_trajectories.py |
Orchestrates concurrent task runs via IS sessions and exports trajectories. |
examples/experimental/inference_service/batchmode/collect_metrics.py |
Snapshots/diffs/monitors SGLang Prometheus metrics for throughput reporting. |
examples/experimental/inference_service/batchmode/worker.py |
Runs a single TAU² task using OpenClaw agent with socket-based tool execution. |
examples/experimental/inference_service/batchmode/tau2/pyproject.toml |
Defines an installable integration project + tau2 plugin entry point. |
examples/experimental/inference_service/batchmode/tau2/__init__.py |
Plugin-style registration helpers and re-exports for the integration package. |
examples/experimental/inference_service/batchmode/tau2/task_runner.py |
Non-socket task runner wrapper around shared implementation. |
examples/experimental/inference_service/batchmode/tau2/task_runner_socket.py |
Socket-server-enabled task runner and CLI wrapper with env/tool wiring. |
examples/experimental/inference_service/batchmode/tau2/tau2_env/__init__.py |
Exposes socket environment + evaluator utilities. |
examples/experimental/inference_service/batchmode/tau2/tau2_env/environment_socket.py |
Implements the socket server and generates OpenClaw-callable tool scripts. |
examples/experimental/inference_service/batchmode/tau2/tau2_env/evaluator.py |
Evaluates runs by comparing environment state/assertions and merges reward breakdowns. |
examples/experimental/inference_service/batchmode/tau2/openclaw/agent.py |
Implements a TAU² LocalAgent backed by OpenClaw CLI + socket tool injection. |
examples/experimental/inference_service/batchmode/tau2/openclaw/service.py |
Wraps OpenClaw CLI invocation and parses its JSON outputs. |
examples/experimental/inference_service/batchmode/tau2/openclaw/workspace_manager.py |
Manages isolated OpenClaw workspaces and tool/skill generation. |
examples/experimental/inference_service/batchmode/tau2/openclaw/__init__.py |
Re-exports OpenClaw integration components and a compatibility config module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not isinstance(content, list): | ||
| content = list(content) | ||
| parts = [] | ||
| for part in content: |
There was a problem hiding this comment.
content = list(content) will produce incorrect results for mapping-like content values (e.g., a dict becomes a list of keys) and for bytes/bytearray (becomes a list of ints). Since _ensure_message_dict_list() accepts arbitrary mappings/iterables, this normalization can silently corrupt message content before apply_chat_template. Consider handling Mapping separately (treat as a single part) and avoiding list() for bytes/bytearray, or otherwise restricting this branch to known OpenAI list-content formats.
| if not isinstance(content, list): | |
| content = list(content) | |
| parts = [] | |
| for part in content: | |
| if isinstance(content, list): | |
| content_parts = content | |
| elif isinstance(content, Mapping): | |
| content_parts = [content] | |
| elif isinstance(content, (bytes, bytearray)): | |
| content_parts = [content] | |
| elif isinstance(content, Iterable): | |
| content_parts = list(content) | |
| else: | |
| content_parts = [content] | |
| parts = [] | |
| for part in content_parts: |
There was a problem hiding this comment.
Please check this comment.
|
|
||
| # Entry point for tau2 plugin system | ||
| [project.entry-points."tau2.plugins"] | ||
| openclaw = "openclaw_tau2.register:register_plugin" |
There was a problem hiding this comment.
The package metadata declares an entry point openclaw_tau2.register:register_plugin, but there is no openclaw_tau2 package directory in this tree (the code lives under a top-level tau2/ package). As-is, installing this project will fail at runtime when the entry point is loaded. Either rename the Python package to openclaw_tau2 (and adjust discovery/paths) or update the entry point to match the actual importable module.
| openclaw = "openclaw_tau2.register:register_plugin" | |
| openclaw = "tau2.register:register_plugin" |
| import subprocess | ||
|
|
||
| from openclaw_tau2 import run_task_with_socket_server | ||
| from tau2.registry import registry | ||
| from tau2.run import load_tasks |
There was a problem hiding this comment.
This script imports run_task_with_socket_server from openclaw_tau2, but the PR does not add an openclaw_tau2 module/package (the integration code is under examples/.../batchmode/tau2/). Running the worker will raise ModuleNotFoundError. Align this import with the actual package/module name produced by the integration package (or rename the integration package accordingly).
| import sys | ||
| import types | ||
|
|
||
| from loguru import logger | ||
|
|
||
| from .task_runner import run_task | ||
| from .task_runner_socket import run_task_with_socket_server | ||
| from .tau2_env import ( | ||
| EnvironmentSocketServer, | ||
| OpenClawEnvironmentEvaluator, | ||
| create_openclaw_tool_script, | ||
| evaluate_simulation_with_environment, |
There was a problem hiding this comment.
This integration package is implemented as a top-level Python package named tau2, which is very likely to conflict with (and shadow) the real tau2 module from tau2-bench when both are installed (your own scripts also pip install -e ${TAU2_DIR}). This can cause imports like from tau2.registry import registry to resolve to the wrong package. Consider renaming this package namespace (e.g., openclaw_tau2) and updating internal imports accordingly.
| --wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c ' | ||
| python -m sglang.launch_server \ | ||
| --model-path ${MODEL_PATH} \ | ||
| --served-model-name ${MODEL_NAME} \ | ||
| --tp 8 \ | ||
| --port ${USER_PORT} \ | ||
| --host 0.0.0.0 \ | ||
| --context-length ${CONTEXT_LENGTH} \ | ||
| --tool-call-parser qwen25 \ | ||
| --enable-metrics \ | ||
| --enable-deterministic-inference \ | ||
| --disable-radix-cache | ||
| '") |
There was a problem hiding this comment.
User SGLang section says radix cache is ON, but the submitted job includes --disable-radix-cache for the user server as well. Either remove that flag for the user server or update the comments/output so the script reflects the actual behavior (this impacts how baseline vs target comparisons are interpreted).
| Usage: | ||
| python run_single_worker.py \ | ||
| --domain retail \ | ||
| --task-index 0 \ | ||
| --agent-endpoint http://127.0.0.1:30000/v1 \ | ||
| --user-endpoint http://<node>:30001/v1 \ | ||
| --model Qwen3-235B-A22B-Instruct-2507 \ | ||
| --output-dir /tmp/results |
There was a problem hiding this comment.
The docstring usage example references python run_single_worker.py, but this file is worker.py. Updating the example avoids confusion when users try to follow the instructions.
| tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src") | ||
| if os.path.isdir(tau2_src) and tau2_src not in sys.path: | ||
| sys.path.insert(0, tau2_src) |
There was a problem hiding this comment.
Defaulting TAU2_SRC_PATH to the literal string ${TAU2_DIR}/src won’t expand the environment variable, so os.path.isdir() will always fail unless TAU2_SRC_PATH is explicitly set. Consider using os.path.expandvars() (or deriving from TAU2_DIR directly) so the fallback works as intended.
| tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src") | |
| if os.path.isdir(tau2_src) and tau2_src not in sys.path: | |
| sys.path.insert(0, tau2_src) | |
| tau2_src = os.environ.get("TAU2_SRC_PATH") | |
| if not tau2_src: | |
| tau2_dir = os.environ.get("TAU2_DIR") | |
| if tau2_dir: | |
| tau2_src = os.path.join(tau2_dir, "src") | |
| if tau2_src: | |
| tau2_src = os.path.expandvars(tau2_src) | |
| if os.path.isdir(tau2_src) and tau2_src not in sys.path: | |
| sys.path.insert(0, tau2_src) |
| worker_id: int, | ||
| work_dir: Path, | ||
| ) -> dict: | ||
| """Run tau2 task via run_single_worker.py subprocess (same pattern as v2). |
There was a problem hiding this comment.
The run_tau2_task_subprocess() docstring mentions run_single_worker.py, but the subprocess actually runs worker.py via WORKER_SCRIPT. Updating the docstring will keep the implementation/docs aligned.
| """Run tau2 task via run_single_worker.py subprocess (same pattern as v2). | |
| """Run a tau2 task via the worker subprocess referenced by `WORKER_SCRIPT`. |
| python collect_sglang_metrics.py snapshot http://127.0.0.1:30000 | ||
|
|
||
| # Diff two snapshot files → JSON with deltas + derived throughput | ||
| python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120 |
There was a problem hiding this comment.
The usage examples in this header refer to collect_sglang_metrics.py, but the file name is collect_metrics.py. Please update the example commands so they can be copy/pasted as-is.
| python collect_sglang_metrics.py snapshot http://127.0.0.1:30000 | |
| # Diff two snapshot files → JSON with deltas + derived throughput | |
| python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120 | |
| python collect_metrics.py snapshot http://127.0.0.1:30000 | |
| # Diff two snapshot files → JSON with deltas + derived throughput | |
| python collect_metrics.py diff pre.json post.json --wall-clock 120 |
| def create_openclaw_tool_script(tool_name: str, server_config: dict) -> str: | ||
| host = json.dumps(server_config["host"]) | ||
| tool = json.dumps(tool_name) | ||
| return f"""#!/usr/bin/env python |
There was a problem hiding this comment.
create_openclaw_tool_script() emits helper scripts with shebang #!/usr/bin/env python, which may resolve to Python 2 or not exist in some environments. Since the integration targets Python 3.10+, consider using python3 in the shebang to make tool execution more reliable.
| return f"""#!/usr/bin/env python | |
| return f"""#!/usr/bin/env python3 |
nuzant
left a comment
There was a problem hiding this comment.
This example is meant to be a openclaw + tau2 example with benchmark. There are several issues that needs to be fixed before it could be merged:
- In
examples/experimental/inference_service/batchmode/sweep.sh, the inference services are launched directly launched with bash command. To showcase and benchmark the full stack inference service you need to use inference gateway controller to collect the results directly. Checkonline_rollout.py, and you can reuse it in your benchmark scripts. - The file structure should be reorganized. A better organization will be:
examples/experimental/inference_service/
openclaw_tau2/ # current batchmode/tau2/ folder, for openclaw + tau2 environment
benchmark/ # scripts for benchmarking SGLang vs inference service
... # launching script, collecting metrics, evaluation and so on
... # online, human-in-the-loop and offline demos. Reuse them in the benchmark
README.md # add a section in current README.md
# about online mode with opencode + tau2:
# 1. introduction to openclaw + tau2 environments
# 2. guide on how to run the benchmark
# 3. benchmark results
| if not isinstance(content, list): | ||
| content = list(content) |
There was a problem hiding this comment.
Please check this comment.
| if not isinstance(content, list): | ||
| content = list(content) | ||
| parts = [] | ||
| for part in content: |
There was a problem hiding this comment.
Please check this comment.
|
|
||
| all_message_list += message_list | ||
|
|
||
| _normalize_messages_for_chat_template(all_message_list) |
There was a problem hiding this comment.
Use copy in _normalize_messages_for_chat_template and avoid modifying messages inplace.
| has_images = len(image_data) > 0 | ||
|
|
||
| tokenizer_messages = messages_for_tokenizer if has_images else messages_list | ||
| _normalize_messages_for_chat_template(tokenizer_messages) |
| has_images = len(image_data) > 0 | ||
|
|
||
| tokenizer_messages = messages_for_tokenizer if has_images else messages_list | ||
| _normalize_messages_for_chat_template(tokenizer_messages) |
| def _normalize_messages_for_chat_template(messages: list[dict[str, Any]]) -> None: | ||
| """Copied from areal.experimental.openai.client to avoid heavy import chain.""" |
There was a problem hiding this comment.
Do not copy, directly import. Otherwise the test will break when this function is modified in the future.
| AGENT_JOBID=$(sbatch --parsable \ | ||
| --job-name=agent-sglang \ | ||
| --nodes=1 \ | ||
| --cpus-per-task=100 \ | ||
| --gres=gpu:8 \ | ||
| --mem=1500G \ | ||
| --time="${HOURS}:00:00" \ | ||
| --output="${LOG_DIR}/agent-sglang-%j.log" \ | ||
| --wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c ' |
There was a problem hiding this comment.
Since this is only a single-node benchmark with 8 GPUs, do not assume that users have slurm environment. Make both start_servers.sh and sweep.sh directly runnable inside a single GPU node with docker + official AReaL image.
There was a problem hiding this comment.
Also, python script will be much easier to read and understand compared to shell script. Could you rewrite these scripts into python?
| singularity exec --writable-tmpfs --no-home --nv \ | ||
| -B /storage/openpsi \ | ||
| "$CONTAINER" bash -c ' |
There was a problem hiding this comment.
/storage/opensi is a hard-coded path in our internal cluster. Check if there are any other ones and clean them up.
| @@ -0,0 +1,82 @@ | |||
| [project] | |||
There was a problem hiding this comment.
Translate Chinese comments. Also, we do not need a pyproject.toml here. Write the install commands in the README, make sure that users can succesfully run the scripts after installing following them.
Description
Add end-to-end performance benchmark for AReaL inference service using OpenClaw agent on TAU²-bench.
Key changes:
openclaw_tau2integration package: OpenClaw agent adapter for TAU²-bench with socket-based cross-process tool execution_normalize_messages_for_chat_template()toArealOpenAIclient: content flattening +tool_callsarguments dict parsing for SGLang token sequence alignmentRelated Issue
N/A — new benchmarking infrastructure
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prAdditional Context
Benchmark results for Qwen3-235B-A22B on TAU²-bench airline domain are included in the README, showing the IS layer adds negligible overhead (< 5% latency) while enabling RL training data collection.