- [5] https://github.com/vllm-project/vllm/commit/e7d9d9c08c79b386f6d0477e87b77a572390317d
- [6] https://github.com/vllm-project/vllm/commit/0b98ba15c744f1dfb0ea4f2135e85ca23d572ae1
- [7] https://arxiv.org/abs/2309.06180
- [8] https://github.com/vllm-project/vllm/blob/main/LICENSE
- [9] vllm-project/vllm#835
- [11] https://github.com/vllm-project/flash-attention
- [12] https://github.com/flashinfer-ai/flashinfer
- [13] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#chunked-prefill_1
- [14] https://docs.vllm.ai/en/v0.9.2/features/spec_decode.html
- [16] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html
- [17] https://platform.openai.com/docs/api-reference/models/list
- [18] https://platform.openai.com/docs/api-reference/chat/create
- [20] https://docs.vllm.ai/en/v0.9.2/features/tool_calling.html
- [21] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-tool-calls
- [22] https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html
- [23] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L247-L252
- [24] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L497-L506
- [25] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L538
- [26] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/engine/async_llm_engine.py#L473-L484
- [27] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/__init__.py#L100-L139
- [28] vllm-project/vllm#22740
- [29] vllm-project/vllm#22772
- [30] https://docs.vllm.ai/en/v0.9.2/features/reasoning_outputs.html
- [31] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-thinking-content
- [32] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/engine/arg_utils.py#L626-L634
- [34] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html#chat-template_1
- [35] https://github.com/vllm-project/vllm/tree/main/examples
- [36] https://arxiv.org/abs/1706.03762
- [37] https://huggingface.co/blog/not-lain/kv-caching
- [38] https://docs.vllm.ai/en/v0.9.2/design/kernel/paged_attention.html
- [39] https://blog.vllm.ai/2023/06/20/vllm.html
- [40] https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb
- [41] https://www.anyscale.com/blog/continuous-batching-llm-inference
- [42] https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/batcher.html
- [43] https://docs.vllm.ai/en/v0.9.2/design/arch_overview.html
- [44] https://docs.vllm.ai/en/v0.9.2/usage/v1_guide.html
- [45] https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- [46] vllm-project/vllm#18571
- [50] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#parallelism-strategies
- [51] https://www.youtube.com/watch?v=4i76hmmnJEo
- [52] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/run_cluster.html
- [53] https://docs.vllm.ai/en/v0.9.2/usage/security.html
- [54] https://docs.pytorch.org/docs/stable/distributed.html#common-environment-variables
- [55] https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable
- [56] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html
- [57] https://blog.vllm.ai/2025/02/17/distributed-inference.html
- [58] https://zerohertz.github.io/distributed-computing-rdma-roce/
- [59] https://developer.nvidia.com/gpudirect
- [61] https://docs.vllm.ai/en/v0.9.2/deployment/k8s.html
- [62] https://github.com/vllm-project/aibrix
- [63] https://arxiv.org/abs/2504.03648
- [64] https://blog.vllm.ai/2025/02/21/aibrix-release.html
- [65] https://aibrix.readthedocs.io/latest/
- [66] https://github.com/vllm-project/production-stack
- [67] https://docs.vllm.ai/en/v0.9.2/deployment/integrations/production-stack.html
- [68] https://blog.vllm.ai/2025/01/21/stack-release.html
- [69] https://blog.vllm.ai/production-stack/
- [70] https://docs.vllm.ai/en/v0.9.2/deployment/frameworks/lws.html
- [71] https://github.com/kubernetes-sigs/lws
- [72] https://docs.vllm.ai/en/v0.9.2/usage/metrics.html
- [73] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/prometheus_grafana.html
- [74] https://docs.vllm.ai/en/v0.9.2/cli/index.html#bench
- [75] https://docs.vllm.ai/en/v0.9.2/contributing/profiling.html
- [76] https://github.com/vllm-project/vllm/tree/v0.9.2/benchmarks
- [77] https://github.com/vllm-project/guidellm
- [78] https://arxiv.org/pdf/2502.06494