Skip to content

feat(examples): add inference service performance benchmark using OpenClaw + TAU2-bench#1175

Open
Le8r0nJames wants to merge 1 commit intoinclusionAI:mainfrom
Le8r0nJames:zjw/inference-service-benchmark
Open

feat(examples): add inference service performance benchmark using OpenClaw + TAU2-bench#1175
Le8r0nJames wants to merge 1 commit intoinclusionAI:mainfrom
Le8r0nJames:zjw/inference-service-benchmark

Conversation

@Le8r0nJames
Copy link
Copy Markdown
Collaborator

Description

Add end-to-end performance benchmark for AReaL inference service using OpenClaw agent on TAU²-bench.

Key changes:

  • Add batchmode benchmark scripts: sweep runner, trajectory collector, SGLang metrics collector, server launcher
  • Add openclaw_tau2 integration package: OpenClaw agent adapter for TAU²-bench with socket-based cross-process tool execution
  • Add _normalize_messages_for_chat_template() to ArealOpenAI client: content flattening + tool_calls arguments dict parsing for SGLang token sequence alignment
  • Add unit tests for message normalization (15 cases)

Related Issue

N/A — new benchmarking infrastructure

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Additional Context

Benchmark results for Qwen3-235B-A22B on TAU²-bench airline domain are included in the README, showing the IS layer adds negligible overhead (< 5% latency) while enabling RL training data collection.

…nClaw + TAU2-bench

Add end-to-end performance benchmark for AReaL inference service using
OpenClaw agent on TAU2-bench.

Key changes:
- Add batchmode benchmark scripts: sweep runner, trajectory collector,
  SGLang metrics collector, server launcher
- Add openclaw_tau2 integration package: OpenClaw agent adapter for
  TAU2-bench with socket-based cross-process tool execution
- Add _normalize_messages_for_chat_template() to ArealOpenAI client:
  content flattening + tool_calls arguments dict parsing for SGLang
  token sequence alignment
- Add unit tests for message normalization (15 cases)
Copilot AI review requested due to automatic review settings April 13, 2026 13:09
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a message normalization layer in the ArealOpenAI client to align with SGLang's chat template requirements, specifically addressing content flattening and tool call argument parsing. It also adds a comprehensive benchmarking suite for the inference service, including Slurm integration, metric collection scripts, and a tau2-bench integration package that utilizes a socket-based environment server for agent evaluation. Feedback focuses on several technical improvements: the normalization logic incorrectly handles dictionary content by converting it to a list of keys, and its in-place mutation of messages causes unintended side effects in the interaction cache. Additionally, the socket-based JSON receiver lacks robustness against stream fragmentation, the OpenClaw CLI retry logic fails to catch timeout exceptions, and the benchmark configuration should be updated to use the chat API instead of the legacy completions API to ensure compatibility with the new client logic.

return {
"baseUrl": self.llm_base_url or "http://127.0.0.1:30000",
"apiKey": self.llm_api_key or "dummy",
"api": "openai-completions",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The OpenClaw provider is configured to use openai-completions, but the ArealOpenAI client implementation in this PR primarily focuses on overriding chat.completions and responses. If OpenClaw uses the legacy completions API, it may bypass the chat template normalization and engine logic added here, or fail if the IS Gateway doesn't explicitly support the completions endpoint. It is recommended to use openai-chat for modern chat models like Qwen3.

Suggested change
"api": "openai-completions",
"api": "openai-chat",

Comment on lines +156 to +157
if not isinstance(content, list):
content = list(content)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic incorrectly handles cases where message.content is a dictionary (a common format for single-part content in some SDKs). If content is a dictionary, list(content) will return a list of its keys, leading to corrupted content processing. It should check if the content is a mapping and wrap it in a list instead.

Suggested change
if not isinstance(content, list):
content = list(content)
if isinstance(content, Mapping):
content = [content]
elif not isinstance(content, list):
content = list(content)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment.


all_message_list += message_list

_normalize_messages_for_chat_template(all_message_list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling _normalize_messages_for_chat_template here will mutate the message dictionaries in-place. Since all_message_list contains references to parent.messages which are stored in the InteractionCache, this results in side effects where cached interactions are modified. This can lead to unexpected behavior if the same interaction is reused or exported later. Consider deep-copying the messages before normalization.

    all_message_list = [deepcopy(m) for m in all_message_list]
    _normalize_messages_for_chat_template(all_message_list)

Comment on lines +12 to +21
def _recv_json(sock: socket.socket) -> dict[str, Any] | None:
chunks: list[bytes] = []
while True:
data = sock.recv(4096)
if not data:
break
chunks.append(data)
if b"\n" in data:
break
return json.loads(b"".join(chunks).split(b"\n", 1)[0].decode("utf-8")) if chunks else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _recv_json function is not robust for stream-oriented sockets. If multiple JSON messages are received in a single recv call, or if a message is followed by the start of another, the split(b"\n", 1)[0] logic will discard the subsequent data. While the current client implementation uses a new connection per request, the server loop (which uses while request := _recv_json(...)) is susceptible to data loss if a client sends multiple requests over the same connection. Consider using a buffered reader or a proper framing handler.

Comment on lines +199 to +201
except subprocess.TimeoutExpired as exc:
self._cleanup_lock_files(agent_id)
raise OpenClawServiceError(f"OpenClaw CLI timeout after {self.timeout}s") from exc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic in the chat method does not cover subprocess.TimeoutExpired. If the OpenClaw CLI times out (which can happen under high concurrency in a benchmark), the method immediately raises an error instead of attempting the remaining retries. Timeouts should be included in the retry loop to improve benchmark stability.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end performance benchmarking harness for the AReaL inference service using an OpenClaw-driven TAU²-bench workload, and updates the ArealOpenAI client preprocessing to better match SGLang’s chat-template tokenization behavior.

Changes:

  • Added batchmode benchmarking scripts (server launch, sweep runner, trajectory collection, metrics collection) and accompanying README.
  • Added a TAU²-bench + OpenClaw socket-tool integration layer (environment socket server, evaluator glue, task runners, OpenClaw agent/service/workspace utilities).
  • Added _normalize_messages_for_chat_template() to the ArealOpenAI client plus new unit tests for message normalization.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
areal/experimental/openai/client.py Adds message normalization prior to apply_chat_template for SGLang-aligned tokenization.
tests/test_openai_client_normalize.py Adds unit tests covering content flattening and tool-call-argument parsing behavior.
examples/experimental/inference_service/batchmode/README.md Documents the benchmark setup, workflow, and sample results.
examples/experimental/inference_service/batchmode/start_servers.sh Slurm launcher for agent/user SGLang servers.
examples/experimental/inference_service/batchmode/sweep.sh Runs end-to-end IS sweep across concurrency/trials inside Singularity.
examples/experimental/inference_service/batchmode/collect_trajectories.py Orchestrates concurrent task runs via IS sessions and exports trajectories.
examples/experimental/inference_service/batchmode/collect_metrics.py Snapshots/diffs/monitors SGLang Prometheus metrics for throughput reporting.
examples/experimental/inference_service/batchmode/worker.py Runs a single TAU² task using OpenClaw agent with socket-based tool execution.
examples/experimental/inference_service/batchmode/tau2/pyproject.toml Defines an installable integration project + tau2 plugin entry point.
examples/experimental/inference_service/batchmode/tau2/__init__.py Plugin-style registration helpers and re-exports for the integration package.
examples/experimental/inference_service/batchmode/tau2/task_runner.py Non-socket task runner wrapper around shared implementation.
examples/experimental/inference_service/batchmode/tau2/task_runner_socket.py Socket-server-enabled task runner and CLI wrapper with env/tool wiring.
examples/experimental/inference_service/batchmode/tau2/tau2_env/__init__.py Exposes socket environment + evaluator utilities.
examples/experimental/inference_service/batchmode/tau2/tau2_env/environment_socket.py Implements the socket server and generates OpenClaw-callable tool scripts.
examples/experimental/inference_service/batchmode/tau2/tau2_env/evaluator.py Evaluates runs by comparing environment state/assertions and merges reward breakdowns.
examples/experimental/inference_service/batchmode/tau2/openclaw/agent.py Implements a TAU² LocalAgent backed by OpenClaw CLI + socket tool injection.
examples/experimental/inference_service/batchmode/tau2/openclaw/service.py Wraps OpenClaw CLI invocation and parses its JSON outputs.
examples/experimental/inference_service/batchmode/tau2/openclaw/workspace_manager.py Manages isolated OpenClaw workspaces and tool/skill generation.
examples/experimental/inference_service/batchmode/tau2/openclaw/__init__.py Re-exports OpenClaw integration components and a compatibility config module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +156 to +159
if not isinstance(content, list):
content = list(content)
parts = []
for part in content:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content = list(content) will produce incorrect results for mapping-like content values (e.g., a dict becomes a list of keys) and for bytes/bytearray (becomes a list of ints). Since _ensure_message_dict_list() accepts arbitrary mappings/iterables, this normalization can silently corrupt message content before apply_chat_template. Consider handling Mapping separately (treat as a single part) and avoiding list() for bytes/bytearray, or otherwise restricting this branch to known OpenAI list-content formats.

Suggested change
if not isinstance(content, list):
content = list(content)
parts = []
for part in content:
if isinstance(content, list):
content_parts = content
elif isinstance(content, Mapping):
content_parts = [content]
elif isinstance(content, (bytes, bytearray)):
content_parts = [content]
elif isinstance(content, Iterable):
content_parts = list(content)
else:
content_parts = [content]
parts = []
for part in content_parts:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment.


# Entry point for tau2 plugin system
[project.entry-points."tau2.plugins"]
openclaw = "openclaw_tau2.register:register_plugin"
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package metadata declares an entry point openclaw_tau2.register:register_plugin, but there is no openclaw_tau2 package directory in this tree (the code lives under a top-level tau2/ package). As-is, installing this project will fail at runtime when the entry point is loaded. Either rename the Python package to openclaw_tau2 (and adjust discovery/paths) or update the entry point to match the actual importable module.

Suggested change
openclaw = "openclaw_tau2.register:register_plugin"
openclaw = "tau2.register:register_plugin"

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +75
import subprocess

from openclaw_tau2 import run_task_with_socket_server
from tau2.registry import registry
from tau2.run import load_tasks
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script imports run_task_with_socket_server from openclaw_tau2, but the PR does not add an openclaw_tau2 module/package (the integration code is under examples/.../batchmode/tau2/). Running the worker will raise ModuleNotFoundError. Align this import with the actual package/module name produced by the integration package (or rename the integration package accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +12
import sys
import types

from loguru import logger

from .task_runner import run_task
from .task_runner_socket import run_task_with_socket_server
from .tau2_env import (
EnvironmentSocketServer,
OpenClawEnvironmentEvaluator,
create_openclaw_tool_script,
evaluate_simulation_with_environment,
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration package is implemented as a top-level Python package named tau2, which is very likely to conflict with (and shadow) the real tau2 module from tau2-bench when both are installed (your own scripts also pip install -e ${TAU2_DIR}). This can cause imports like from tau2.registry import registry to resolve to the wrong package. Consider renaming this package namespace (e.g., openclaw_tau2) and updating internal imports accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +106
--wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c '
python -m sglang.launch_server \
--model-path ${MODEL_PATH} \
--served-model-name ${MODEL_NAME} \
--tp 8 \
--port ${USER_PORT} \
--host 0.0.0.0 \
--context-length ${CONTEXT_LENGTH} \
--tool-call-parser qwen25 \
--enable-metrics \
--enable-deterministic-inference \
--disable-radix-cache
'")
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User SGLang section says radix cache is ON, but the submitted job includes --disable-radix-cache for the user server as well. Either remove that flag for the user server or update the comments/output so the script reflects the actual behavior (this impacts how baseline vs target comparisons are interpreted).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +12
Usage:
python run_single_worker.py \
--domain retail \
--task-index 0 \
--agent-endpoint http://127.0.0.1:30000/v1 \
--user-endpoint http://<node>:30001/v1 \
--model Qwen3-235B-A22B-Instruct-2507 \
--output-dir /tmp/results
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring usage example references python run_single_worker.py, but this file is worker.py. Updating the example avoids confusion when users try to follow the instructions.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +69
tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src")
if os.path.isdir(tau2_src) and tau2_src not in sys.path:
sys.path.insert(0, tau2_src)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting TAU2_SRC_PATH to the literal string ${TAU2_DIR}/src won’t expand the environment variable, so os.path.isdir() will always fail unless TAU2_SRC_PATH is explicitly set. Consider using os.path.expandvars() (or deriving from TAU2_DIR directly) so the fallback works as intended.

Suggested change
tau2_src = os.environ.get("TAU2_SRC_PATH", "${TAU2_DIR}/src")
if os.path.isdir(tau2_src) and tau2_src not in sys.path:
sys.path.insert(0, tau2_src)
tau2_src = os.environ.get("TAU2_SRC_PATH")
if not tau2_src:
tau2_dir = os.environ.get("TAU2_DIR")
if tau2_dir:
tau2_src = os.path.join(tau2_dir, "src")
if tau2_src:
tau2_src = os.path.expandvars(tau2_src)
if os.path.isdir(tau2_src) and tau2_src not in sys.path:
sys.path.insert(0, tau2_src)

Copilot uses AI. Check for mistakes.
worker_id: int,
work_dir: Path,
) -> dict:
"""Run tau2 task via run_single_worker.py subprocess (same pattern as v2).
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run_tau2_task_subprocess() docstring mentions run_single_worker.py, but the subprocess actually runs worker.py via WORKER_SCRIPT. Updating the docstring will keep the implementation/docs aligned.

Suggested change
"""Run tau2 task via run_single_worker.py subprocess (same pattern as v2).
"""Run a tau2 task via the worker subprocess referenced by `WORKER_SCRIPT`.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +9
python collect_sglang_metrics.py snapshot http://127.0.0.1:30000

# Diff two snapshot files → JSON with deltas + derived throughput
python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage examples in this header refer to collect_sglang_metrics.py, but the file name is collect_metrics.py. Please update the example commands so they can be copy/pasted as-is.

Suggested change
python collect_sglang_metrics.py snapshot http://127.0.0.1:30000
# Diff two snapshot files → JSON with deltas + derived throughput
python collect_sglang_metrics.py diff pre.json post.json --wall-clock 120
python collect_metrics.py snapshot http://127.0.0.1:30000
# Diff two snapshot files → JSON with deltas + derived throughput
python collect_metrics.py diff pre.json post.json --wall-clock 120

Copilot uses AI. Check for mistakes.
def create_openclaw_tool_script(tool_name: str, server_config: dict) -> str:
host = json.dumps(server_config["host"])
tool = json.dumps(tool_name)
return f"""#!/usr/bin/env python
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_openclaw_tool_script() emits helper scripts with shebang #!/usr/bin/env python, which may resolve to Python 2 or not exist in some environments. Since the integration targets Python 3.10+, consider using python3 in the shebang to make tool execution more reliable.

Suggested change
return f"""#!/usr/bin/env python
return f"""#!/usr/bin/env python3

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@nuzant nuzant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is meant to be a openclaw + tau2 example with benchmark. There are several issues that needs to be fixed before it could be merged:

  1. In examples/experimental/inference_service/batchmode/sweep.sh, the inference services are launched directly launched with bash command. To showcase and benchmark the full stack inference service you need to use inference gateway controller to collect the results directly. Check online_rollout.py, and you can reuse it in your benchmark scripts.
  2. The file structure should be reorganized. A better organization will be:
examples/experimental/inference_service/
    openclaw_tau2/ # current batchmode/tau2/ folder, for openclaw + tau2 environment
    benchmark/ # scripts for benchmarking SGLang vs inference service
        ... # launching script, collecting metrics, evaluation and so on
    ... # online, human-in-the-loop and offline demos.  Reuse them in the benchmark
    README.md # add a section in current README.md 
        # about online mode with opencode + tau2: 
        # 1. introduction to openclaw + tau2 environments
        # 2. guide on how to run the benchmark
        # 3. benchmark results

Comment on lines +156 to +157
if not isinstance(content, list):
content = list(content)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment.

Comment on lines +156 to +159
if not isinstance(content, list):
content = list(content)
parts = []
for part in content:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment.


all_message_list += message_list

_normalize_messages_for_chat_template(all_message_list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use copy in _normalize_messages_for_chat_template and avoid modifying messages inplace.

has_images = len(image_data) > 0

tokenizer_messages = messages_for_tokenizer if has_images else messages_list
_normalize_messages_for_chat_template(tokenizer_messages)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

has_images = len(image_data) > 0

tokenizer_messages = messages_for_tokenizer if has_images else messages_list
_normalize_messages_for_chat_template(tokenizer_messages)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines +13 to +14
def _normalize_messages_for_chat_template(messages: list[dict[str, Any]]) -> None:
"""Copied from areal.experimental.openai.client to avoid heavy import chain."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not copy, directly import. Otherwise the test will break when this function is modified in the future.

Comment on lines +51 to +59
AGENT_JOBID=$(sbatch --parsable \
--job-name=agent-sglang \
--nodes=1 \
--cpus-per-task=100 \
--gres=gpu:8 \
--mem=1500G \
--time="${HOURS}:00:00" \
--output="${LOG_DIR}/agent-sglang-%j.log" \
--wrap "singularity exec --nv --no-home --writable-tmpfs --bind /storage:/storage ${CONTAINER} bash -c '
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only a single-node benchmark with 8 GPUs, do not assume that users have slurm environment. Make both start_servers.sh and sweep.sh directly runnable inside a single GPU node with docker + official AReaL image.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, python script will be much easier to read and understand compared to shell script. Could you rewrite these scripts into python?

Comment on lines +22 to +24
singularity exec --writable-tmpfs --no-home --nv \
-B /storage/openpsi \
"$CONTAINER" bash -c '
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/storage/opensi is a hard-coded path in our internal cluster. Check if there are any other ones and clean them up.

@@ -0,0 +1,82 @@
[project]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translate Chinese comments. Also, we do not need a pyproject.toml here. Write the install commands in the README, make sure that users can succesfully run the scripts after installing following them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants