Minimal, CPU-only llama.cpp inference server in a container. Podman-first. Rootless. Daemonless.
Run any GGUF model with an OpenAI-compatible API. No GPU required. No daemon. No root.
podman run -v ./models:/models:Z -p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/your-model.ggufWorks with Docker too — just swap
podmanfordocker.
The official llama.cpp repo only publishes GPU-accelerated container images (CUDA, Vulkan, ROCm). If you're running on a VPS, Raspberry Pi, or any CPU-only machine, you're left building from source.
This image solves that:
- ~95MB compressed (Ubuntu 24.04 minimal + statically-linked binary)
- Zero GPU dependencies — runs anywhere
- Rootless by default — runs as non-root
llamauser - Daemonless — Podman needs no background daemon, unlike Docker
- Weekly automated builds from latest llama.cpp
master - OpenAI-compatible API at
/v1/chat/completions - GBNF grammar support — force structured output at the token level
Download any GGUF model from HuggingFace:
| Model | Size | RAM needed | Good for |
|---|---|---|---|
| Qwen3-0.6B | 500MB | ~1GB | Edge/IoT, simple tasks |
| Qwen3-1.7B | 1.4GB | ~2GB | Chatbots, classification |
| Qwen3-4B | 2.5GB | ~4GB | General purpose, tool use |
| Gemma-3-4B | 2.3GB | ~4GB | Instruction following |
| Llama-3.1-8B | 4.9GB | ~7GB | High quality, reasoning |
mkdir -p models
curl -L -o models/Qwen3-4B-Q4_K_M.gguf \
"https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"podman run -d --name llama \
-v ./models:/models:Z \
-p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/Qwen3-4B-Q4_K_M.gguf \
--ctx-size 32768 \
--flash-attn on \
--threads 4Note: The
:Zvolume flag handles SELinux relabeling on Fedora/RHEL/CentOS. On other distros you can omit it.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain containers in one sentence."}
],
"temperature": 0.7,
"max_tokens": 100
}'podman logs llama # View logs
podman stats llama # Monitor CPU/RAM
podman stop llama # Stop
podman start llama # Start again
podman rm llama # Remove| Podman | Docker | |
|---|---|---|
| Daemon | None — each container is a process | Requires dockerd daemon |
| Root | Rootless by default | Rootless optional, root by default |
| Systemd | Native podman generate systemd |
Needs extra config |
| SELinux | First-class :Z / :z support |
Supported but less common |
| OCI compliant | Yes | Yes |
| CLI compatible | alias docker=podman works |
— |
| Socket activation | Yes (systemd) | Via socket file |
| Fork/exec | Direct fork — no daemon middleman | Everything routes through daemon |
For an LLM server that should just start, run, and not die — Podman's daemonless architecture means one less thing that can crash.
Implements the OpenAI Chat Completions API — drop-in compatible with any OpenAI SDK:
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completion (OpenAI-compatible) |
/v1/completions |
POST | Text completion |
/health |
GET | Health check ({"status":"ok"}) |
/metrics |
GET | Prometheus metrics (with --metrics) |
/slots |
GET | Active inference slot info |
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 256,
"stream": false,
"grammar": "..."
}The killer feature of llama.cpp: GBNF grammars enforce output structure at the token level. The model literally cannot produce output that doesn't match your schema.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Rate this movie: Inception"}],
"max_tokens": 100,
"grammar": "root ::= \"{\" ws \"\\\"rating\\\":\" ws num \",\" ws \"\\\"review\\\":\" ws string ws \"}\"\nnum ::= [1-9] | \"10\"\nstring ::= \"\\\"\" [^\"]+ \"\\\"\"\nws ::= [ \\t\\n]*"
}'Guaranteed output:
{"rating": 9, "review": "A mind-bending masterpiece that redefines sci-fi cinema."}No regex. No "please respond in JSON". No retries. Guaranteed at the token level.
See examples/structured-output.json for a full example and the GBNF docs for grammar syntax.
| Flag | Default | Description |
|---|---|---|
--ctx-size |
2048 | Context window. Higher = more RAM. |
--threads |
auto | CPU threads for inference. Set to physical cores. |
--flash-attn on |
auto | Flash Attention — faster, less memory. |
--cache-type-k q8_0 |
f16 | KV cache key quantization. Saves ~50% cache RAM. |
--cache-type-v q4_0 |
f16 | KV cache value quantization. Saves ~75% cache RAM. |
--batch-size |
2048 | Prompt processing batch size. |
--ubatch-size |
512 | Micro-batch size. |
--cont-batching |
off | Continuous batching for concurrent requests. |
--metrics |
off | Expose Prometheus metrics at /metrics. |
--n-predict |
-1 | Max tokens to generate (-1 = unlimited). |
RAM ≈ model_size + kv_cache_size
| Context | f16 (default) | q8_0 keys + q4_0 values | Savings |
|---|---|---|---|
| 4K | ~256MB | ~96MB | 62% |
| 8K | ~512MB | ~192MB | 62% |
| 32K | ~2GB | ~750MB | 62% |
| 128K | ~8GB | ~3GB | 62% |
Tiny (1-2GB RAM) — Raspberry Pi, edge:
podman run -v ./models:/models:Z -p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/model-0.5b.gguf --ctx-size 4096 --threads 2Small VPS (4GB RAM):
podman run -v ./models:/models:Z -p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/model-4b-q4.gguf --ctx-size 8192 --threads 4 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q4_0Server (16GB+ RAM):
podman run -v ./models:/models:Z -p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/model-8b-q4.gguf --ctx-size 32768 --threads 8 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q4_0 \
--cont-batching --batch-size 512 --metricsPodman integrates natively with systemd — no daemon needed:
# Create the container (don't start yet)
podman create --name llama \
-v ./models:/models:Z -p 8080:8080 \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/your-model.gguf --ctx-size 32768 --flash-attn on --threads 4
# Generate a systemd unit
podman generate systemd --new --name llama > ~/.config/systemd/user/llama.service
# Enable and start
systemctl --user daemon-reload
systemctl --user enable --now llama.service
# Check status
systemctl --user status llama.service
# Survives logout (lingering)
loginctl enable-linger $USERNow your LLM server starts on boot, restarts on crash, and needs zero root access.
See examples/podman-compose.yml
podman-compose up -dSee examples/kubernetes.yaml — includes init container for automatic model download.
kubectl apply -f examples/kubernetes.yamlRun the LLM server alongside your app in a Podman pod:
# Create a pod
podman pod create --name ai-stack -p 8080:8080 -p 3000:3000
# Add llama-server
podman run -d --pod ai-stack --name llama \
-v ./models:/models:Z \
ghcr.io/karthik-sivadas/llama-server-podman:latest \
--model /models/Qwen3-4B-Q4_K_M.gguf --ctx-size 32768
# Add your app (can reach llama at localhost:8080 inside the pod)
podman run -d --pod ai-stack --name my-app my-app-image# Latest llama.cpp
podman build -t llama-server .
# Pin to a specific release
podman build --build-arg LLAMA_CPP_VERSION=b4726 -t llama-server .┌──────────────────────────────────────────────────┐
│ Podman Container (rootless) │
│ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ llama-server │ │ /models/model.gguf │ │
│ │ (static bin) │◄───│ (mounted volume) │ │
│ │ :8080 │ └──────────────────────────┘ │
│ └──────┬──────┘ │
│ │ OpenAI-compatible API │
│ user: llama (non-root) │
└─────────┼──────────────────────────────────────────┘
│
┌─────▼─────┐
│ Clients │ curl, Python, JS, any OpenAI SDK
└───────────┘
- Build: Clones llama.cpp, compiles
llama-serverwithBUILD_SHARED_LIBS=OFF(static linking) - Runtime: Ubuntu 24.04 minimal + binary +
libgomp(OpenMP threading) - Model: Mounted via volume — no model data baked into the image
- User: Runs as
llama(non-root), works with Podman's rootless mode
- Non-root — runs as
llamauser inside the container - Rootless Podman — no root on the host either
- No model data baked into the image
- No network tools installed (no curl, wget, nc)
- Static binary — minimal attack surface, no shared library hijacking
- Weekly Trivy scans via GitHub Actions
- Base image:
ubuntu:24.04LTS (regularly patched) - No daemon — nothing listening on a socket that could be exploited
| This image | Ollama | |
|---|---|---|
| Container runtime | Podman-first (Docker compatible) | Docker-focused |
| Image size | ~95MB | ~500MB+ |
| Runtime overhead | ~5MB (static C++ binary) | ~200MB (Go runtime + daemon) |
| Root required | No (rootless Podman) | No (but daemon pattern) |
| KV cache quantization | ✅ Full control | ❌ Not exposed |
| GBNF grammars | ✅ Per-request structured output | ❌ Limited JSON mode |
| Systemd integration | ✅ Native via podman generate systemd |
Manual |
| Model management | BYO (mount volume) | Built-in pull/push |
| Multi-model | One model per container | Auto load/unload |
| GPU support | ❌ CPU only | ✅ CUDA, ROCm, Metal |
Use this if: You want maximum control, minimal overhead, structured output, rootless security, and Podman-native workflow.
Use Ollama if: You want simplicity and don't need GBNF grammars or fine-grained tuning.
MIT — see LICENSE.
llama.cpp is also MIT licensed. Model weights have their own licenses — check the model card on HuggingFace.
Built by Karthik Sivadas · Running in production on a single VPS powering karthik.page's AI terminal