Version: 1.0 Last Updated: 2026-05-31 Module Path: src/llm/
The LLM module provides inference execution, routing, model and adapter lifecycle control, streaming, policy enforcement, and runtime safety surfaces.
| Surface | Source files |
|---|---|
| Core inference engines | src/llm/async_inference_engine.cpp, src/llm/inference_engine_enhanced.cpp |
| Scheduling and queueing | src/llm/shared_worker_pool.cpp, src/llm/continuous_batch_scheduler.cpp |
| Routing and orchestration | src/llm/model_router.cpp, src/llm/ai_orchestrator.cpp |
| Model and plugin lifecycle | src/llm/llm_plugin_manager.cpp, src/llm/model_loader.cpp, src/llm/model_downloader.cpp |
| Adapter and LoRA lifecycle | src/llm/multi_lora_manager.cpp, src/llm/adapter_load_balancer.cpp, src/llm/lora_router.cpp |
| Streaming and response shaping | src/llm/streaming_handler.cpp, src/llm/openai_compat_adapter.cpp |
| Policy and safety controls | src/llm/prompt_policy.cpp, src/llm/llm_security_utils.cpp, src/llm/production_validator.cpp |
| Caching and resource controls | src/llm/llm_response_cache.cpp, src/llm/kv_cache_buffer.cpp, src/llm/token_quota_manager.cpp |
- Request enters engine submit path.
- Policy/guard checks run before backend inference call.
- Router and scheduling choose model/worker execution path.
- Plugin/backend executes inference with cache/adapter/resource controls.
- Result is emitted as full response or stream callback frames.
| Direction | Integration |
|---|---|
| Used by | API handlers, orchestration layers, AI runtime features |
| Uses | LLM backends/plugins, optional acceleration stacks, module-local safety controls |
| Exposes | inference APIs, routing hooks, adapter lifecycle hooks, streaming callbacks |
- Shared worker and scheduler components coordinate concurrent inference jobs.
- Engine and manager components coordinate lifecycle state for models/adapters/plugins.
- Streaming callbacks and cancellation paths are coordinated with request lifecycle state.
- Some distributed and federated execution paths depend on deployment wiring and are not default-on.
- Runtime behavior can vary across backend and acceleration configurations.
- Cross-node and topology-sensitive benchmark coverage remains an ongoing hardening area.
- Verified files:
- src/llm/async_inference_engine.cpp
- src/llm/inference_engine_enhanced.cpp
- src/llm/shared_worker_pool.cpp
- src/llm/continuous_batch_scheduler.cpp
- src/llm/model_router.cpp
- src/llm/llm_plugin_manager.cpp
- src/llm/multi_lora_manager.cpp
- src/llm/streaming_handler.cpp
- src/llm/openai_compat_adapter.cpp
- src/llm/prompt_policy.cpp
- src/llm/llm_security_utils.cpp
- src/llm/token_quota_manager.cpp
- Verified interfaces and behavior:
- request submit/schedule/execute flow
- routing and lifecycle integration points
- streaming and policy guard control surfaces
- Note:
- Wave B tracking issue:
https://github.com/makr-code/ThemisDB/issues/5039 - dependent Wave A issue:
https://github.com/makr-code/ThemisDB/issues/5038 - follow-on Wave C issue:
https://github.com/makr-code/ThemisDB/issues/5040
- Wave B tracking issue: