LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
-
Updated
Apr 24, 2026 - Python
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.
A from scratch LLM inference engine build in PyTorch with custom GPT2/LLaMA/ transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks
Fork of OpenAI and Anthropic compatible server for Apple Silicon. Native MLX backend, 500+ tok/s. Run LLMs and vision-language models with continuous batching, MCP tool calling, and multimodal support.
OpenAI-compatible server with continuous batching for MLX on Apple Silicon
Adaptive LLM inference scheduler simulation — continuous batching, priority preemption, KV-cache routing, and speculative decoding in Python/asyncio.
High-Performance LLM Inference Engine with PagedAttention & Continuous Batching | 高性能LLM推理引擎 - 内存浪费<5%, 吞吐率+50%
Process batches of large language model tasks efficiently using multithreading in C++ for faster and scalable LLM workflows.
Mistral-7B inference server hitting around 15x throughput gains through continuous batching, priority scheduling, and CPU-optimized execution. Exposes a FastAPI surface with per-request latency and queue metrics.
Add a description, image, and links to the continuous-batching topic page so that developers can more easily learn about it.
To associate your repository with the continuous-batching topic, visit your repo's landing page and select "manage topics."