feat(examples): add inference service performance benchmark using OpenClaw + TAU2-bench by Le8r0nJames · Pull Request #1175 · inclusionAI/AReaL

Le8r0nJames · 2026-04-13T13:09:02Z

Description

Add end-to-end performance benchmark for AReaL inference service using OpenClaw agent on TAU²-bench.

Key changes:

Add batchmode benchmark scripts: sweep runner, trajectory collector, SGLang metrics collector, server launcher
Add openclaw_tau2 integration package: OpenClaw agent adapter for TAU²-bench with socket-based cross-process tool execution
Add _normalize_messages_for_chat_template() to ArealOpenAI client: content flattening + tool_calls arguments dict parsing for SGLang token sequence alignment
Add unit tests for message normalization (15 cases)

Related Issue

N/A — new benchmarking infrastructure

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Additional Context

Benchmark results for Qwen3-235B-A22B on TAU²-bench airline domain are included in the README, showing the IS layer adds negligible overhead (< 5% latency) while enabling RL training data collection.

…nClaw + TAU2-bench Add end-to-end performance benchmark for AReaL inference service using OpenClaw agent on TAU2-bench. Key changes: - Add batchmode benchmark scripts: sweep runner, trajectory collector, SGLang metrics collector, server launcher - Add openclaw_tau2 integration package: OpenClaw agent adapter for TAU2-bench with socket-based cross-process tool execution - Add _normalize_messages_for_chat_template() to ArealOpenAI client: content flattening + tool_calls arguments dict parsing for SGLang token sequence alignment - Add unit tests for message normalization (15 cases)

gemini-code-assist

Code Review

This pull request introduces a message normalization layer in the ArealOpenAI client to align with SGLang's chat template requirements, specifically addressing content flattening and tool call argument parsing. It also adds a comprehensive benchmarking suite for the inference service, including Slurm integration, metric collection scripts, and a tau2-bench integration package that utilizes a socket-based environment server for agent evaluation. Feedback focuses on several technical improvements: the normalization logic incorrectly handles dictionary content by converting it to a list of keys, and its in-place mutation of messages causes unintended side effects in the interaction cache. Additionally, the socket-based JSON receiver lacks robustness against stream fragmentation, the OpenClaw CLI retry logic fails to catch timeout exceptions, and the benchmark configuration should be updated to use the chat API instead of the legacy completions API to ensure compatibility with the new client logic.

gemini-code-assist · 2026-04-13T13:12:19Z

+        return {
+            "baseUrl": self.llm_base_url or "http://127.0.0.1:30000",
+            "apiKey": self.llm_api_key or "dummy",
+            "api": "openai-completions",


The OpenClaw provider is configured to use openai-completions, but the ArealOpenAI client implementation in this PR primarily focuses on overriding chat.completions and responses. If OpenClaw uses the legacy completions API, it may bypass the chat template normalization and engine logic added here, or fail if the IS Gateway doesn't explicitly support the completions endpoint. It is recommended to use openai-chat for modern chat models like Qwen3.

Suggested change

"api": "openai-completions",

"api": "openai-chat",

gemini-code-assist · 2026-04-13T13:12:19Z

+            if not isinstance(content, list):
+                content = list(content)


The current logic incorrectly handles cases where message.content is a dictionary (a common format for single-part content in some SDKs). If content is a dictionary, list(content) will return a list of its keys, leading to corrupted content processing. It should check if the content is a mapping and wrap it in a list instead.

Suggested change

if not isinstance(content, list):

content = list(content)

if isinstance(content, Mapping):

content = [content]

elif not isinstance(content, list):

content = list(content)

Please check this comment.

gemini-code-assist · 2026-04-13T13:12:19Z


    all_message_list += message_list

+    _normalize_messages_for_chat_template(all_message_list)


Calling _normalize_messages_for_chat_template here will mutate the message dictionaries in-place. Since all_message_list contains references to parent.messages which are stored in the InteractionCache, this results in side effects where cached interactions are modified. This can lead to unexpected behavior if the same interaction is reused or exported later. Consider deep-copying the messages before normalization.

all_message_list = [deepcopy(m) for m in all_message_list] _normalize_messages_for_chat_template(all_message_list)

gemini-code-assist · 2026-04-13T13:12:19Z

+def _recv_json(sock: socket.socket) -> dict[str, Any] | None:
+    chunks: list[bytes] = []
+    while True:
+        data = sock.recv(4096)
+        if not data:
+            break
+        chunks.append(data)
+        if b"\n" in data:
+            break
+    return json.loads(b"".join(chunks).split(b"\n", 1)[0].decode("utf-8")) if chunks else None


The _recv_json function is not robust for stream-oriented sockets. If multiple JSON messages are received in a single recv call, or if a message is followed by the start of another, the split(b"\n", 1)[0] logic will discard the subsequent data. While the current client implementation uses a new connection per request, the server loop (which uses while request := _recv_json(...)) is susceptible to data loss if a client sends multiple requests over the same connection. Consider using a buffered reader or a proper framing handler.

gemini-code-assist · 2026-04-13T13:12:20Z

+            except subprocess.TimeoutExpired as exc:
+                self._cleanup_lock_files(agent_id)
+                raise OpenClawServiceError(f"OpenClaw CLI timeout after {self.timeout}s") from exc


The retry logic in the chat method does not cover subprocess.TimeoutExpired. If the OpenClaw CLI times out (which can happen under high concurrency in a benchmark), the method immediately raises an error instead of attempting the remaining retries. Timeouts should be included in the retry loop to improve benchmark stability.

Copilot

Pull request overview

Adds an end-to-end performance benchmarking harness for the AReaL inference service using an OpenClaw-driven TAU²-bench workload, and updates the ArealOpenAI client preprocessing to better match SGLang’s chat-template tokenization behavior.

Changes:

Added batchmode benchmarking scripts (server launch, sweep runner, trajectory collection, metrics collection) and accompanying README.
Added a TAU²-bench + OpenClaw socket-tool integration layer (environment socket server, evaluator glue, task runners, OpenClaw agent/service/workspace utilities).
Added _normalize_messages_for_chat_template() to the ArealOpenAI client plus new unit tests for message normalization.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`areal/experimental/openai/client.py`	Adds message normalization prior to `apply_chat_template` for SGLang-aligned tokenization.
`tests/test_openai_client_normalize.py`	Adds unit tests covering content flattening and tool-call-argument parsing behavior.
`examples/experimental/inference_service/batchmode/README.md`	Documents the benchmark setup, workflow, and sample results.
`examples/experimental/inference_service/batchmode/start_servers.sh`	Slurm launcher for agent/user SGLang servers.
`examples/experimental/inference_service/batchmode/sweep.sh`	Runs end-to-end IS sweep across concurrency/trials inside Singularity.
`examples/experimental/inference_service/batchmode/collect_trajectories.py`	Orchestrates concurrent task runs via IS sessions and exports trajectories.
`examples/experimental/inference_service/batchmode/collect_metrics.py`	Snapshots/diffs/monitors SGLang Prometheus metrics for throughput reporting.
`examples/experimental/inference_service/batchmode/worker.py`	Runs a single TAU² task using OpenClaw agent with socket-based tool execution.
`examples/experimental/inference_service/batchmode/tau2/pyproject.toml`	Defines an installable integration project + tau2 plugin entry point.
`examples/experimental/inference_service/batchmode/tau2/__init__.py`	Plugin-style registration helpers and re-exports for the integration package.
`examples/experimental/inference_service/batchmode/tau2/task_runner.py`	Non-socket task runner wrapper around shared implementation.
`examples/experimental/inference_service/batchmode/tau2/task_runner_socket.py`	Socket-server-enabled task runner and CLI wrapper with env/tool wiring.
`examples/experimental/inference_service/batchmode/tau2/tau2_env/__init__.py`	Exposes socket environment + evaluator utilities.
`examples/experimental/inference_service/batchmode/tau2/tau2_env/environment_socket.py`	Implements the socket server and generates OpenClaw-callable tool scripts.
`examples/experimental/inference_service/batchmode/tau2/tau2_env/evaluator.py`	Evaluates runs by comparing environment state/assertions and merges reward breakdowns.
`examples/experimental/inference_service/batchmode/tau2/openclaw/agent.py`	Implements a TAU² `LocalAgent` backed by OpenClaw CLI + socket tool injection.
`examples/experimental/inference_service/batchmode/tau2/openclaw/service.py`	Wraps OpenClaw CLI invocation and parses its JSON outputs.
`examples/experimental/inference_service/batchmode/tau2/openclaw/workspace_manager.py`	Manages isolated OpenClaw workspaces and tool/skill generation.
`examples/experimental/inference_service/batchmode/tau2/openclaw/__init__.py`	Re-exports OpenClaw integration components and a compatibility `config` module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T13:15:49Z

+            if not isinstance(content, list):
+                content = list(content)
+            parts = []
+            for part in content:


content = list(content) will produce incorrect results for mapping-like content values (e.g., a dict becomes a list of keys) and for bytes/bytearray (becomes a list of ints). Since _ensure_message_dict_list() accepts arbitrary mappings/iterables, this normalization can silently corrupt message content before apply_chat_template. Consider handling Mapping separately (treat as a single part) and avoiding list() for bytes/bytearray, or otherwise restricting this branch to known OpenAI list-content formats.

Suggested change

if not isinstance(content, list):

content = list(content)

parts = []

for part in content:

if isinstance(content, list):

content_parts = content

elif isinstance(content, Mapping):

content_parts = [content]

elif isinstance(content, (bytes, bytearray)):

content_parts = [content]

elif isinstance(content, Iterable):

content_parts = list(content)

else:

content_parts = [content]

parts = []

for part in content_parts:

Please check this comment.

Copilot · 2026-04-13T13:15:50Z

+
+# Entry point for tau2 plugin system
+[project.entry-points."tau2.plugins"]
+openclaw = "openclaw_tau2.register:register_plugin"


The package metadata declares an entry point openclaw_tau2.register:register_plugin, but there is no openclaw_tau2 package directory in this tree (the code lives under a top-level tau2/ package). As-is, installing this project will fail at runtime when the entry point is loaded. Either rename the Python package to openclaw_tau2 (and adjust discovery/paths) or update the entry point to match the actual importable module.

Suggested change

openclaw = "openclaw_tau2.register:register_plugin"

openclaw = "tau2.register:register_plugin"

Copilot · 2026-04-13T13:15:50Z

+    import subprocess
+
+    from openclaw_tau2 import run_task_with_socket_server
+    from tau2.registry import registry
+    from tau2.run import load_tasks


This script imports run_task_with_socket_server from openclaw_tau2, but the PR does not add an openclaw_tau2 module/package (the integration code is under examples/.../batchmode/tau2/). Running the worker will raise ModuleNotFoundError. Align this import with the actual package/module name produced by the integration package (or rename the integration package accordingly).

Copilot · 2026-04-13T13:15:50Z

+import sys
+import types
+
+from loguru import logger
+
+from .task_runner import run_task
+from .task_runner_socket import run_task_with_socket_server
+from .tau2_env import (
+    EnvironmentSocketServer,
+    OpenClawEnvironmentEvaluator,
+    create_openclaw_tool_script,
+    evaluate_simulation_with_environment,


This integration package is implemented as a top-level Python package named tau2, which is very likely to conflict with (and shadow) the real tau2 module from tau2-bench when both are installed (your own scripts also pip install -e ${TAU2_DIR}). This can cause imports like from tau2.registry import registry to resolve to the wrong package. Consider renaming this package namespace (e.g., openclaw_tau2) and updating internal imports accordingly.

Copilot · 2026-04-13T13:15:50Z

+        --wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c '
+python -m sglang.launch_server \
+    --model-path ${MODEL_PATH} \
+    --served-model-name ${MODEL_NAME} \
+    --tp 8 \
+    --port ${USER_PORT} \
+    --host 0.0.0.0 \
+    --context-length ${CONTEXT_LENGTH} \
+    --tool-call-parser qwen25 \
+    --enable-metrics \
+    --enable-deterministic-inference \
+    --disable-radix-cache
+'")


User SGLang section says radix cache is ON, but the submitted job includes --disable-radix-cache for the user server as well. Either remove that flag for the user server or update the comments/output so the script reflects the actual behavior (this impacts how baseline vs target comparisons are interpreted).

Copilot · 2026-04-13T13:15:51Z

+Usage:
+    python run_single_worker.py \
+        --domain retail \
+        --task-index 0 \
+        --agent-endpoint http://127.0.0.1:30000/v1 \
+        --user-endpoint http://<node>:30001/v1 \
+        --model Qwen3-235B-A22B-Instruct-2507 \
+        --output-dir /tmp/results


The docstring usage example references python run_single_worker.py, but this file is worker.py. Updating the example avoids confusion when users try to follow the instructions.

Copilot · 2026-04-13T13:15:51Z

+    tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src")
+    if os.path.isdir(tau2_src) and tau2_src not in sys.path:
+        sys.path.insert(0, tau2_src)


Defaulting TAU2_SRC_PATH to the literal string ${TAU2_DIR}/src won’t expand the environment variable, so os.path.isdir() will always fail unless TAU2_SRC_PATH is explicitly set. Consider using os.path.expandvars() (or deriving from TAU2_DIR directly) so the fallback works as intended.

Suggested change

tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src")

if os.path.isdir(tau2_src) and tau2_src not in sys.path:

sys.path.insert(0, tau2_src)

tau2_src = os.environ.get("TAU2_SRC_PATH")

if not tau2_src:

tau2_dir = os.environ.get("TAU2_DIR")

if tau2_dir:

tau2_src = os.path.join(tau2_dir, "src")

if tau2_src:

tau2_src = os.path.expandvars(tau2_src)

if os.path.isdir(tau2_src) and tau2_src not in sys.path:

sys.path.insert(0, tau2_src)

Copilot · 2026-04-13T13:15:51Z

+    worker_id: int,
+    work_dir: Path,
+) -> dict:
+    """Run tau2 task via run_single_worker.py subprocess (same pattern as v2).


The run_tau2_task_subprocess() docstring mentions run_single_worker.py, but the subprocess actually runs worker.py via WORKER_SCRIPT. Updating the docstring will keep the implementation/docs aligned.

Suggested change

"""Run tau2 task via run_single_worker.py subprocess (same pattern as v2).

"""Run a tau2 task via the worker subprocess referenced by `WORKER_SCRIPT`.

Copilot · 2026-04-13T13:15:52Z

+    python collect_sglang_metrics.py snapshot http://127.0.0.1:30000
+
+    # Diff two snapshot files → JSON with deltas + derived throughput
+    python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120


The usage examples in this header refer to collect_sglang_metrics.py, but the file name is collect_metrics.py. Please update the example commands so they can be copy/pasted as-is.

Suggested change

python collect_sglang_metrics.py snapshot http://127.0.0.1:30000

# Diff two snapshot files → JSON with deltas + derived throughput

python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120

python collect_metrics.py snapshot http://127.0.0.1:30000

# Diff two snapshot files → JSON with deltas + derived throughput

python collect_metrics.py diff pre.json post.json --wall-clock 120

Copilot · 2026-04-13T13:15:52Z

+def create_openclaw_tool_script(tool_name: str, server_config: dict) -> str:
+    host = json.dumps(server_config["host"])
+    tool = json.dumps(tool_name)
+    return f"""#!/usr/bin/env python


create_openclaw_tool_script() emits helper scripts with shebang #!/usr/bin/env python, which may resolve to Python 2 or not exist in some environments. Since the integration targets Python 3.10+, consider using python3 in the shebang to make tool execution more reliable.

Suggested change

return f"""#!/usr/bin/env python

return f"""#!/usr/bin/env python3

nuzant

This example is meant to be a openclaw + tau2 example with benchmark. There are several issues that needs to be fixed before it could be merged:

In examples/experimental/inference_service/batchmode/sweep.sh, the inference services are launched directly launched with bash command. To showcase and benchmark the full stack inference service you need to use inference gateway controller to collect the results directly. Check online_rollout.py, and you can reuse it in your benchmark scripts.
The file structure should be reorganized. A better organization will be:

examples/experimental/inference_service/
    openclaw_tau2/ # current batchmode/tau2/ folder, for openclaw + tau2 environment
    benchmark/ # scripts for benchmarking SGLang vs inference service
        ... # launching script, collecting metrics, evaluation and so on
    ... # online, human-in-the-loop and offline demos.  Reuse them in the benchmark
    README.md # add a section in current README.md 
        # about online mode with opencode + tau2: 
        # 1. introduction to openclaw + tau2 environments
        # 2. guide on how to run the benchmark
        # 3. benchmark results

nuzant · 2026-04-14T06:58:47Z

+            if not isinstance(content, list):
+                content = list(content)


Please check this comment.

nuzant · 2026-04-14T06:58:57Z

+            if not isinstance(content, list):
+                content = list(content)
+            parts = []
+            for part in content:


Please check this comment.

nuzant · 2026-04-14T07:51:10Z


    all_message_list += message_list

+    _normalize_messages_for_chat_template(all_message_list)


Use copy in _normalize_messages_for_chat_template and avoid modifying messages inplace.

nuzant · 2026-04-14T07:51:17Z

        has_images = len(image_data) > 0

        tokenizer_messages = messages_for_tokenizer if has_images else messages_list
+        _normalize_messages_for_chat_template(tokenizer_messages)


Same as above.

nuzant · 2026-04-14T07:51:23Z

        has_images = len(image_data) > 0

        tokenizer_messages = messages_for_tokenizer if has_images else messages_list
+        _normalize_messages_for_chat_template(tokenizer_messages)


Same as above.

nuzant · 2026-04-14T07:52:46Z

+def _normalize_messages_for_chat_template(messages: list[dict[str, Any]]) -> None:
+    """Copied from areal.experimental.openai.client to avoid heavy import chain."""


Do not copy, directly import. Otherwise the test will break when this function is modified in the future.

nuzant · 2026-04-14T08:07:41Z

+AGENT_JOBID=$(sbatch --parsable \
+    --job-name=agent-sglang \
+    --nodes=1 \
+    --cpus-per-task=100 \
+    --gres=gpu:8 \
+    --mem=1500G \
+    --time="${HOURS}:00:00" \
+    --output="${LOG_DIR}/agent-sglang-%j.log" \
+    --wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c '


Since this is only a single-node benchmark with 8 GPUs, do not assume that users have slurm environment. Make both start_servers.sh and sweep.sh directly runnable inside a single GPU node with docker + official AReaL image.

Also, python script will be much easier to read and understand compared to shell script. Could you rewrite these scripts into python?

nuzant · 2026-04-14T08:13:27Z

+singularity exec --writable-tmpfs --no-home --nv \
+    -B /storage/openpsi \
+    "$CONTAINER" bash -c '


/storage/opensi is a hard-coded path in our internal cluster. Check if there are any other ones and clean them up.

nuzant · 2026-04-14T08:16:38Z

@@ -0,0 +1,82 @@
+[project]


Translate Chinese comments. Also, we do not need a pyproject.toml here. Write the install commands in the README, make sure that users can succesfully run the scripts after installing following them.

Copilot AI review requested due to automatic review settings April 13, 2026 13:09

Copilot started reviewing on behalf of Le8r0nJames April 13, 2026 13:09 View session

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

Copilot AI reviewed Apr 13, 2026

View reviewed changes

nuzant reviewed Apr 14, 2026

View reviewed changes


		all_message_list += message_list

		_normalize_messages_for_chat_template(all_message_list)

-            if not isinstance(content, list):
-                content = list(content)
-            parts = []
-            for part in content:
+            if isinstance(content, list):
+                content_parts = content
+            elif isinstance(content, Mapping):
+                content_parts = [content]
+            elif isinstance(content, (bytes, bytearray)):
+                content_parts = [content]
+            elif isinstance(content, Iterable):
+                content_parts = list(content)
+            else:
+                content_parts = [content]
+            parts = []
+            for part in content_parts:

	openclaw = "openclaw_tau2.register:register_plugin"
	openclaw = "tau2.register:register_plugin"

-    tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src")
-    if os.path.isdir(tau2_src) and tau2_src not in sys.path:
-        sys.path.insert(0, tau2_src)
+    tau2_src = os.environ.get("TAU2_SRC_PATH")
+    if not tau2_src:
+        tau2_dir = os.environ.get("TAU2_DIR")
+        if tau2_dir:
+            tau2_src = os.path.join(tau2_dir, "src")
+    if tau2_src:
+        tau2_src = os.path.expandvars(tau2_src)
+        if os.path.isdir(tau2_src) and tau2_src not in sys.path:
+            sys.path.insert(0, tau2_src)

	"""Run tau2 task via run_single_worker.py subprocess (same pattern as v2).
	"""Run a tau2 task via the worker subprocess referenced by `WORKER_SCRIPT`.

	return f"""#!/usr/bin/env python
	return f"""#!/usr/bin/env python3

		def _normalize_messages_for_chat_template(messages: list[dict[str, Any]]) -> None:
		"""Copied from areal.experimental.openai.client to avoid heavy import chain."""

Conversation

Le8r0nJames commented Apr 13, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

nuzant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment