A friendly CLI wrapper for llama.cpp
Manage, download, and serve local LLMs with a single command.
Think of it as an ollama-like experience built on top of llama-server.
- Background server — start/stop/restart
llama-serveras a daemon - Multi-model routing — preset-based configuration with automatic model load/unload
- Interactive downloads — search HuggingFace, pick a quant, download with progress and resume
- Rich terminal UI — tables, panels, interactive selectors, and live search
- GGUF inspector — view model metadata, architecture, and sampling parameters
- Server props — inspect active sampling parameters on loaded models
- Sampling sync — automatically applies GGUF-recommended sampling params to your preset
- Per-model settings — context size, GPU layers, flash attention, and more
- Idle model unloading — background watchdog automatically unloads models after configurable idle timeout
- VRAM tracking — automatically parses server logs to show memory usage per model
- Auto-sync — preset file stays in sync with the llama.cpp cache automatically
Model listing — llb models
Interactive download — llb download
Model info — llb info
pipx install llama-buddyOr with uv:
uv tool install llama-buddyThis installs the llb command into an isolated environment and adds it to your PATH.
- Python 3.10+
- llama.cpp installed and
llama-serveron yourPATH
# Download a model (interactive search)
llb download
# Or specify directly
llb download mistralai/Ministral-3-3B-Instruct-2512-GGUF:Q4_K_M
# Start the server
llb start
# List all models
llb models
# Chat with a model (uses llama-cli)
llb chat
# Inspect model metadata
llb info
# Show active sampling params for a loaded model
llb props
# Apply GGUF-recommended sampling params to all models
llb info --apply-sampling
# Configure settings (interactive TUI)
llb settings
# Open the web UI in your browser
llb open
# Stop the server
llb stop| Command | Description |
|---|---|
llb start |
Start llama-server in the background. Extra args are forwarded. |
llb stop |
Stop the running server. |
llb restart |
Restart the server. |
llb status |
Show whether the server is running. |
llb models |
List all models with status, size, VRAM usage, and grouping. Supports --sort size. |
llb download [model] |
Download a model. Interactive HF search when no model given. |
llb remove [model] |
Remove a model with confirmation dialog. --keep-files to preserve GGUFs. |
llb info [model] |
Show GGUF metadata. Interactive selector when no model given. |
llb info --apply-sampling [model] |
Write GGUF sampling params into the preset. All models when no model given. |
llb props [model] |
Show active server sampling params for a loaded model. |
llb settings |
Interactive editor for global and per-model settings. |
llb chat [model] |
Interactive chat via llama-cli. Model selector when no model given. |
llb open |
Open the llama-server web UI in your browser. |
llb logs |
Tail the server log file. |
Config files live in ~/.config/llama/:
| File | Purpose |
|---|---|
models.ini |
Model preset file — sections are HF repo IDs, auto-synced with cache |
settings.json |
Global server settings (port, context size, GPU layers, etc.) |
vram.json |
Cached per-model VRAM usage (parsed from server logs) |
server.pid |
PID of the running server |
server.log |
Server stdout/stderr |
Run llb settings and select Model Settings to configure per-model overrides:
- Context size, GPU layers, flash attention
- Custom aliases
- Any
llama-serverparameter
# Clone and install
git clone https://github.com/thilomichael/llama-buddy.git
cd llama-buddy
uv sync
# Run
uv run llb <command>
# Test
uv run pytest
# Lint
uv run ruff check src/ tests/MIT