CI: backends for vllm testing are redundant and cause CUDA context lock

By the looks of it, we tend to spin up 3 vllm servers for 3 separate modules:
- `test/backends/test_openai_vllm.py`: inherently a sub-process so CUDA context cleanup happens by default.
- `test/backends/test_vllm.py` and `test/backends/test_vllm_tools.py`: 2 different backends that seem to require 80% of a GPU reserved running 2 different models (Qwen3 and Mistral).

When the latter two run sequentially in the same process, CUDA drivers hold GPU memory at the process level. Even after cleanup, memory fragments and the second backend initialization fails with OOM errors.

**Proposed solution**: Spin up a single shared session-scoped `LocalVLLMBackend` in `conftest.py` (might have to figure out why we need a subprocess in `test_openai_vllm.py` or if there is a way for us to just point to a running vllm instance in that test) and reuse that across all vllm dependent modules. This:
- Eliminates fragmentation (only one backend initialization per session)
- Speeds up tests (no repeated model loading)
- Stays compatible with `--isolate-heavy` for process isolation (if we want to use if for some reason) 

**Trade-off**: All vLLM tests would use the same model (should be fine IMHO to use granite micro)

**Extra points**: This could also give us the opportunity to run tests grouped by backends (hf, vllm, ollama, etc) enabling us have aggressive startup and tear down for each group. We could add a `--group-by-backend` flag that runs all tests for one backend together with coordinated lifecycle management.

_Originally posted by @avinash2692 in https://github.com/generative-computing/mellea/pull/605#pullrequestreview-3918264868_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: backends for vllm testing are redundant and cause CUDA context lock #625

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CI: backends for vllm testing are redundant and cause CUDA context lock #625

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions