You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
What's Changed
added docker build system
add inference cancel for LLM/VLM. Cancel streaming mid flight by closing client connection, resetting KV cache to before that request. Before, canceling mid flight left inference running in a background thread until organic completion. in practice, this burns GPU cycles. Not anymore! Note: cancel does not work for non-streaming (stream=False). In stream=False case generation will continue until you reach max_tokens, OOM, or inference completes on its own for whatever reason.