Skip to content

feat(infra): Support for proxy server through RayScheduler#1161

Open
hlyli wants to merge 6 commits intoinclusionAI:mainfrom
hlyli:ray-proxy
Open

feat(infra): Support for proxy server through RayScheduler#1161
hlyli wants to merge 6 commits intoinclusionAI:mainfrom
hlyli:ray-proxy

Conversation

@hlyli
Copy link
Copy Markdown
Contributor

@hlyli hlyli commented Apr 10, 2026

Description

This commit enables proxy servers by introducing a RayHTTPLauncher actor for forked HTTP workers. The RayHTTPLauncher will read the command parameter that was previously unused in the RayRPCServer and launch the command through POpen, much like how a LocalScheduler launches RPCServers. Communication between the RayHTTPLauncher and the proxy will be through HTTP as to not touch the preexisting proxy server code.

An additional small tweak is that I added a __repr__ to the ray actors to also print their actor names so as to more easily distinguish between rollout instances.

Coauthored with @ActuallyEdward

Related Issue

#963

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

/review-pr was done with codex
RayHTTPLauncher is only tested for the proxy server. Other HTTP-style servers may be compatible but are untested. However, I hope that this can more easily enable other types of servers.


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a base RayServer class and a new RayHTTPLauncher to support launching proxy and HTTP servers within Ray actors, enabling proxy worker support in the RayScheduler. Key changes include refactoring RayRPCServer, updating the scheduler to handle forked workers with custom commands, and optimizing resource allocations for internal Ray tasks. Review feedback highlights several areas for improvement: the HTTP retry logic in RayHTTPLauncher lacks proper exception handling and robust status code checks, the rpc_meta parameter is currently ignored in remote calls, and the heuristic for selecting the actor class is considered fragile. Additionally, there is dead code in the launcher's cleanup logic and opportunities to modernize Ray API usage.

Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
Comment thread areal/infra/rpc/ray_rpc_server.py
Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
Comment on lines +277 to +280
if command and "rpc.rpc_server" not in command:
actor_cls = RayHTTPLauncher
else:
actor_cls = RayRPCServer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heuristic used to decide between RayHTTPLauncher and RayRPCServer based on the presence of "rpc.rpc_server" in the command string is fragile. It would be better to pass an explicit flag or use a more robust mechanism to determine the worker type.

@hlyli
Copy link
Copy Markdown
Contributor Author

hlyli commented Apr 10, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a base RayServer class and a new RayHTTPLauncher to support launching HTTP-based proxy servers within Ray actors. It also updates the RayScheduler to handle these forked HTTP workers and removes previous restrictions on proxy workers in Ray environments. Feedback includes concerns regarding subprocess output handling in Ray, incorrect retry logic for client-side HTTP errors, and potential blocking of the actor thread during retries. Additionally, there are suggestions to improve the robustness of actor class selection and to revert a change in default GPU allocation for forked workers to avoid regressions.

Comment on lines +374 to +376
self.worker_process = subprocess.Popen(
cmd, env=_env, stdout=sys.stdout, stderr=subprocess.STDOUT
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sys.stdout as the stdout argument for subprocess.Popen within a Ray actor can be problematic. In many Ray environments, sys.stdout is redirected to a custom stream object that does not have a valid file descriptor (fileno()), which will cause Popen to raise an UnsupportedOperation or ValueError. It is safer to use None to inherit the standard output or redirect to a specific log file.

Suggested change
self.worker_process = subprocess.Popen(
cmd, env=_env, stdout=sys.stdout, stderr=subprocess.STDOUT
)
self.worker_process = subprocess.Popen(
cmd, env=_env, stdout=None, stderr=subprocess.STDOUT
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't face such an issue when running with Ray. I can change to None if needed.

Comment thread areal/infra/rpc/ray_rpc_server.py Outdated
f"(attempt {attempt}/{max_retries}): {last_error}. "
f"Retrying in {delay:.1f}s..."
)
time.sleep(delay)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using time.sleep inside a Ray actor method blocks the actor's execution thread. This prevents the actor from responding to other incoming messages, such as ping() or destroy(), which could lead to unnecessary timeouts in the scheduler. If the actor is intended to be responsive during retries, consider using asyncio.sleep (if the actor is async) or reducing the retry delay.

Comment on lines +277 to +280
if command and "rpc.rpc_server" not in command:
actor_cls = RayHTTPLauncher
else:
actor_cls = RayRPCServer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to select the actor class based on the presence of "rpc.rpc_server" in the command string is brittle. This assumes that any forked worker not using the default RPC server module must be an HTTP server. It would be more robust to explicitly pass the desired launcher type or use a more reliable detection mechanism.

schedulings.append(SchedulingSpec(cpu=0, mem=0, gpu=1, port_count=1))
# use 0 gpu to prevent any scheduling issues since forks so far only use cpu
# future forks that require gpu should change fork implementation to accept a scheduling spec
schedulings.append(SchedulingSpec(cpu=0, mem=0, gpu=0, port_count=1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the default GPU allocation for forked workers from 1 to 0 is a regression for any forked workers that require GPU access (e.g., a reference model forked from an actor to share GPU memory). If gpu=0 is specified, Ray will not set the CUDA_VISIBLE_DEVICES environment variable for the forked actor, which may cause it to fail or incorrectly access all GPUs on the node. While this works for the current proxy server use case, it breaks the fork_workers API for other GPU-dependent tasks.

Suggested change
schedulings.append(SchedulingSpec(cpu=0, mem=0, gpu=0, port_count=1))
schedulings.append(SchedulingSpec(cpu=0, mem=0, gpu=1, port_count=1))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave this for a future implementation to support gpu colocation for forked workers. Current proxy is colocated with rollout, which often has gpu > 1, which means gpu colocation is not supported through ray.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different idea - I will always use 0 gpus when forking and simply copy the device env var from the parent to the forked worker. This will prevent any scheduling issues especially for multi-gpu workers. I can address this in another PR.

@hlyli hlyli changed the title Support for proxy server through RayScheduler feat(infra): Support for proxy server through RayScheduler Apr 10, 2026
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only use Ray to schedule and create workers. The RayRPCServer should be a guard or daemon process where we create the initial HTTP RPC server. Ray should not involve any futher forking, engine creation, and calling.

The create_workers method of RayScheduler should schedule and create a RayRPCServer, as in the current code. Then, upon initialization, RayRPCServer should launch a subprocess that runs @areal/infra/rpc/rpc_server.py . Then, every call to this worker will be redirected to the HTTP subprocess, which will then share the same logic with the current local and slurm scheduler.

Should we have a discussion before proceeding?

@hlyli
Copy link
Copy Markdown
Contributor Author

hlyli commented Apr 14, 2026

I think I agree with that. We also had some internal discussion within our group to possibly retire the current RayRPCServer and move to a full HTTP design so as to make maintaining Ray easier when new features are added to the RPCServer.
Perhaps I could try to schedule some discussion with some of our team members as well.

@hlyli
Copy link
Copy Markdown
Contributor Author

hlyli commented Apr 14, 2026

We can close the PR for now. We agree with the idea and can discuss in this week's meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants