Skip to content

Initialize and reuse V1 renderer pools efficiently#1791

Closed
xeophon wants to merge 1 commit into
mainfrom
codex/v1-renderer-pool-off-loop
Closed

Initialize and reuse V1 renderer pools efficiently#1791
xeophon wants to merge 1 commit into
mainfrom
codex/v1-renderer-pool-off-loop

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Initialize V1 training renderer pools through one shared background task, and reuse one explicitly pinned renderer client across LoRA/checkpoint sampling IDs. Cold tokenizer construction no longer blocks the env-server event loop, while adapters that share a base tokenizer no longer retain duplicate pools.

Dependency

This PR must not merge until PrimeIntellect-ai/renderers#91 is merged, released, and resolved by Verifiers. That upstream change removes the process-wide fastokens patch window, which is required before renderer construction can safely overlap unrelated tokenizer work.

Design

Off-loop initialization

The first TrainClient caller stores an asyncio.Task before yielding, and that task runs create_renderer_pool through asyncio.to_thread. Concurrent first requests share one initialization. Waiters use asyncio.shield, so cancelling one rollout does not cancel startup for others; the completed task caches either the renderer pool or its startup failure.

Pinned renderer reuse

EnvServer caches a native TrainClientConfig under its serialized config plus an explicit renderer_model_name. LoRA/checkpoint request models that use the same pinned base tokenizer therefore share one client and pool. Unpinned training clients and eval clients remain keyed by request model.

The request model is still stored in each RolloutContext and passed to generation, so adapter routing is unchanged.

Performance

Off-loop initialization, using 64 concurrent first callers and a 200 ms synchronous initializer:

Metric Before After
Maximum event-loop gap 211.1 ms 4.3 ms
Initialization wall time 210.3 ms 206.4 ms
Factory invocations 1 1

Pinned-client reuse, using eight adapter IDs, 64 requests per adapter, concurrency 128, and a locally cached Qwen tokenizer:

Metric Before After
Cached clients / pool factory calls 8 1
Initialization wall time 2.926 s 1.461 s
Retained process RSS 1383.3 MiB 365.8 MiB
Incremental retained RSS 1262.3 MiB 256.7 MiB

Note

Medium Risk
Touches rollout hot paths (async pool lifecycle and shared clients across adapter IDs); behavior is covered by new tests but depends on correct shielding and cache-key semantics for training vs eval.

Overview
TrainClient now builds renderer pools on a background thread via a single asyncio.Task, so concurrent first callers share one initialization and the env-server loop is not blocked by synchronous create_renderer_pool. Waiters use asyncio.shield so rollout cancellation does not abort startup; a failed init is cached and replayed for later calls.

EnvServer client cache keys training clients with an explicit renderer_model_name when set, so multiple LoRA/adapter request models share one TrainClient and pool while RolloutContext still passes the per-request model into generation.

Adds tests/v1/test_serve.py covering cache key behavior, one-pool routing for many pinned-adapter requests, and single-shot failure caching under concurrency.

Reviewed by Cursor Bugbot for commit 86ff846. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Initialize and reuse V1 renderer pools efficiently across concurrent callers

  • TrainClient._renderer_pool in train.py is now async and uses asyncio.create_task + asyncio.shield to ensure the renderer pool is created exactly once, even under concurrent access.
  • Initialization failures are cached and re-raised to all awaiters without re-invoking create_renderer_pool.
  • EnvServer._client in server.py now keys TrainClientConfig cache entries by renderer_model_name (when set), so clients with the same pinned renderer model are shared across different adapters/models.
  • Behavioral Change: unpinned TrainClientConfig clients retain per-model caching; pinned clients now share a single TrainClient instance and its renderer pool across adapters.

Macroscope summarized 86ff846.

@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR modifies concurrency patterns for renderer pool initialization and client caching. An unresolved review comment raises a substantive concern about a missing upstream dependency fix that could cause concurrent tokenizer loading issues at runtime.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c16b823f2d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/clients/train.py
@xeophon xeophon force-pushed the codex/v1-renderer-pool-off-loop branch from c16b823 to 0726607 Compare June 21, 2026 10:12
@xeophon xeophon changed the title Initialize V1 renderer pools off-loop Document renderer pool initialization constraint Jun 21, 2026
macroscopeapp[bot]
macroscopeapp Bot previously approved these changes Jun 21, 2026
@xeophon xeophon force-pushed the codex/v1-renderer-pool-off-loop branch from 0726607 to 0528816 Compare June 21, 2026 10:28
@xeophon xeophon changed the title Document renderer pool initialization constraint Initialize V1 renderer pools off-loop Jun 21, 2026
@macroscopeapp macroscopeapp Bot dismissed their stale review June 21, 2026 10:28

Dismissing prior approval to re-evaluate 0528816

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0528816. Configure here.

Comment thread verifiers/v1/clients/train.py
@xeophon xeophon changed the title Initialize V1 renderer pools off-loop Initialize and reuse V1 renderer pools efficiently Jun 22, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f165371f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +199 to +200
self._pool_task = asyncio.create_task(
asyncio.to_thread(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin renderers before offloading pool creation

This moves create_renderer_pool into a worker thread, but the package metadata still allows renderers>=0.1.8.dev40 and the lockfile still resolves 0.1.8.dev43; the required upstream fix (renderers#91) is still open. Fresh evidence beyond the prior comment is that this commit did not bump or lock the dependency, so installs resolving the current range can still run the old process-wide fastokens/Transformers patch concurrently with environment or harness tokenizer loads during the first train request.

Useful? React with 👍 / 👎.

@xeophon xeophon changed the base branch from feat/nano-as-v1 to main June 23, 2026 04:10
@xeophon xeophon force-pushed the codex/v1-renderer-pool-off-loop branch from 1f16537 to 86ff846 Compare June 23, 2026 04:17
@xeophon xeophon closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant