Skip to content

2.0.3

Choose a tag to compare

@SearchSavior SearchSavior released this 06 Feb 03:25
· 97 commits to main since this release
0b92514

What's Changed

  • added docker build system
  • add inference cancel for LLM/VLM. Cancel streaming mid flight by closing client connection, resetting KV cache to before that request. Before, canceling mid flight left inference running in a background thread until organic completion. in practice, this burns GPU cycles. Not anymore! Note: cancel does not work for non-streaming (stream=False). In stream=False case generation will continue until you reach max_tokens, OOM, or inference completes on its own for whatever reason.