Skip to content

CI: backends for vllm testing are redundant and cause CUDA context lock #625

@avinash2692

Description

@avinash2692

By the looks of it, we tend to spin up 3 vllm servers for 3 separate modules:

  • test/backends/test_openai_vllm.py: inherently a sub-process so CUDA context cleanup happens by default.
  • test/backends/test_vllm.py and test/backends/test_vllm_tools.py: 2 different backends that seem to require 80% of a GPU reserved running 2 different models (Qwen3 and Mistral).

When the latter two run sequentially in the same process, CUDA drivers hold GPU memory at the process level. Even after cleanup, memory fragments and the second backend initialization fails with OOM errors.

Proposed solution: Spin up a single shared session-scoped LocalVLLMBackend in conftest.py (might have to figure out why we need a subprocess in test_openai_vllm.py or if there is a way for us to just point to a running vllm instance in that test) and reuse that across all vllm dependent modules. This:

  • Eliminates fragmentation (only one backend initialization per session)
  • Speeds up tests (no repeated model loading)
  • Stays compatible with --isolate-heavy for process isolation (if we want to use if for some reason)

Trade-off: All vLLM tests would use the same model (should be fine IMHO to use granite micro)

Extra points: This could also give us the opportunity to run tests grouped by backends (hf, vllm, ollama, etc) enabling us have aggressive startup and tear down for each group. We could add a --group-by-backend flag that runs all tests for one backend together with coordinated lifecycle management.

Originally posted by @avinash2692 in #605 (review)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions