🦙 llama-server-podman

Minimal, CPU-only llama.cpp inference server in a container. Podman-first. Rootless. Daemonless.

Run any GGUF model with an OpenAI-compatible API. No GPU required. No daemon. No root.

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/your-model.gguf

Works with Docker too — just swap podman for docker.

Why this exists

The official llama.cpp repo only publishes GPU-accelerated container images (CUDA, Vulkan, ROCm). If you're running on a VPS, Raspberry Pi, or any CPU-only machine, you're left building from source.

This image solves that:

~95MB compressed (Ubuntu 24.04 minimal + statically-linked binary)
Zero GPU dependencies — runs anywhere
Rootless by default — runs as non-root llama user
Daemonless — Podman needs no background daemon, unlike Docker
Weekly automated builds from latest llama.cpp master
OpenAI-compatible API at /v1/chat/completions
GBNF grammar support — force structured output at the token level

Quick Start

1. Get a model

Download any GGUF model from HuggingFace:

Model	Size	RAM needed	Good for
Qwen3-0.6B	500MB	~1GB	Edge/IoT, simple tasks
Qwen3-1.7B	1.4GB	~2GB	Chatbots, classification
Qwen3-4B	2.5GB	~4GB	General purpose, tool use
Gemma-3-4B	2.3GB	~4GB	Instruction following
Llama-3.1-8B	4.9GB	~7GB	High quality, reasoning

mkdir -p models
curl -L -o models/Qwen3-4B-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"

2. Run

podman run -d --name llama \
  -v ./models:/models:Z \
  -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/Qwen3-4B-Q4_K_M.gguf \
  --ctx-size 32768 \
  --flash-attn on \
  --threads 4

Note: The :Z volume flag handles SELinux relabeling on Fedora/RHEL/CentOS. On other distros you can omit it.

3. Use

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain containers in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

4. Manage

podman logs llama          # View logs
podman stats llama         # Monitor CPU/RAM
podman stop llama          # Stop
podman start llama         # Start again
podman rm llama            # Remove

Why Podman?

	Podman	Docker
Daemon	None — each container is a process	Requires dockerd daemon
Root	Rootless by default	Rootless optional, root by default
Systemd	Native `podman generate systemd`	Needs extra config
SELinux	First-class `:Z` / `:z` support	Supported but less common
OCI compliant	Yes	Yes
CLI compatible	`alias docker=podman` works	—
Socket activation	Yes (systemd)	Via socket file
Fork/exec	Direct fork — no daemon middleman	Everything routes through daemon

For an LLM server that should just start, run, and not die — Podman's daemonless architecture means one less thing that can crash.

API Reference

Implements the OpenAI Chat Completions API — drop-in compatible with any OpenAI SDK:

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completion (OpenAI-compatible)
`/v1/completions`	POST	Text completion
`/health`	GET	Health check (`{"status":"ok"}`)
`/metrics`	GET	Prometheus metrics (with `--metrics`)
`/slots`	GET	Active inference slot info

Request body

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 256,
  "stream": false,
  "grammar": "..."
}

Structured Output (GBNF Grammar)

The killer feature of llama.cpp: GBNF grammars enforce output structure at the token level. The model literally cannot produce output that doesn't match your schema.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Rate this movie: Inception"}],
    "max_tokens": 100,
    "grammar": "root ::= \"{\" ws \"\\\"rating\\\":\" ws num \",\" ws \"\\\"review\\\":\" ws string ws \"}\"\nnum ::= [1-9] | \"10\"\nstring ::= \"\\\"\" [^\"]+ \"\\\"\"\nws ::= [ \\t\\n]*"
  }'

Guaranteed output:

{"rating": 9, "review": "A mind-bending masterpiece that redefines sci-fi cinema."}

No regex. No "please respond in JSON". No retries. Guaranteed at the token level.

See examples/structured-output.json for a full example and the GBNF docs for grammar syntax.

Performance Tuning

Key flags

Flag	Default	Description
`--ctx-size`	2048	Context window. Higher = more RAM.
`--threads`	auto	CPU threads for inference. Set to physical cores.
`--flash-attn on`	auto	Flash Attention — faster, less memory.
`--cache-type-k q8_0`	f16	KV cache key quantization. Saves ~50% cache RAM.
`--cache-type-v q4_0`	f16	KV cache value quantization. Saves ~75% cache RAM.
`--batch-size`	2048	Prompt processing batch size.
`--ubatch-size`	512	Micro-batch size.
`--cont-batching`	off	Continuous batching for concurrent requests.
`--metrics`	off	Expose Prometheus metrics at `/metrics`.
`--n-predict`	-1	Max tokens to generate (-1 = unlimited).

Memory estimates

RAM ≈ model_size + kv_cache_size

Context	f16 (default)	q8_0 keys + q4_0 values	Savings
4K	~256MB	~96MB	62%
8K	~512MB	~192MB	62%
32K	~2GB	~750MB	62%
128K	~8GB	~3GB	62%

Recommended configs

Tiny (1-2GB RAM) — Raspberry Pi, edge:

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-0.5b.gguf --ctx-size 4096 --threads 2

Small VPS (4GB RAM):

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-4b-q4.gguf --ctx-size 8192 --threads 4 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q4_0

Server (16GB+ RAM):

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-8b-q4.gguf --ctx-size 32768 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q4_0 \
  --cont-batching --batch-size 512 --metrics

Running as a Systemd Service

Podman integrates natively with systemd — no daemon needed:

# Create the container (don't start yet)
podman create --name llama \
  -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/your-model.gguf --ctx-size 32768 --flash-attn on --threads 4

# Generate a systemd unit
podman generate systemd --new --name llama > ~/.config/systemd/user/llama.service

# Enable and start
systemctl --user daemon-reload
systemctl --user enable --now llama.service

# Check status
systemctl --user status llama.service

# Survives logout (lingering)
loginctl enable-linger $USER

Now your LLM server starts on boot, restarts on crash, and needs zero root access.

Deployment Examples

Podman Compose

See examples/podman-compose.yml

podman-compose up -d

Kubernetes / K3s

See examples/kubernetes.yaml — includes init container for automatic model download.

kubectl apply -f examples/kubernetes.yaml

Podman Pod (multi-container)

Run the LLM server alongside your app in a Podman pod:

# Create a pod
podman pod create --name ai-stack -p 8080:8080 -p 3000:3000

# Add llama-server
podman run -d --pod ai-stack --name llama \
  -v ./models:/models:Z \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/Qwen3-4B-Q4_K_M.gguf --ctx-size 32768

# Add your app (can reach llama at localhost:8080 inside the pod)
podman run -d --pod ai-stack --name my-app my-app-image

Building from Source

# Latest llama.cpp
podman build -t llama-server .

# Pin to a specific release
podman build --build-arg LLAMA_CPP_VERSION=b4726 -t llama-server .

Architecture

┌──────────────────────────────────────────────────┐
│              Podman Container (rootless)           │
│                                                    │
│  ┌─────────────┐    ┌──────────────────────────┐  │
│  │ llama-server │    │ /models/model.gguf       │  │
│  │ (static bin) │◄───│ (mounted volume)         │  │
│  │  :8080       │    └──────────────────────────┘  │
│  └──────┬──────┘                                   │
│         │ OpenAI-compatible API                     │
│    user: llama (non-root)                          │
└─────────┼──────────────────────────────────────────┘
          │
    ┌─────▼─────┐
    │  Clients   │  curl, Python, JS, any OpenAI SDK
    └───────────┘

Build: Clones llama.cpp, compiles llama-server with BUILD_SHARED_LIBS=OFF (static linking)
Runtime: Ubuntu 24.04 minimal + binary + libgomp (OpenMP threading)
Model: Mounted via volume — no model data baked into the image
User: Runs as llama (non-root), works with Podman's rootless mode

Security

Non-root — runs as llama user inside the container
Rootless Podman — no root on the host either
No model data baked into the image
No network tools installed (no curl, wget, nc)
Static binary — minimal attack surface, no shared library hijacking
Weekly Trivy scans via GitHub Actions
Base image: ubuntu:24.04 LTS (regularly patched)
No daemon — nothing listening on a socket that could be exploited

Compared to Ollama

	This image	Ollama
Container runtime	Podman-first (Docker compatible)	Docker-focused
Image size	~95MB	~500MB+
Runtime overhead	~5MB (static C++ binary)	~200MB (Go runtime + daemon)
Root required	No (rootless Podman)	No (but daemon pattern)
KV cache quantization	✅ Full control	❌ Not exposed
GBNF grammars	✅ Per-request structured output	❌ Limited JSON mode
Systemd integration	✅ Native via `podman generate systemd`	Manual
Model management	BYO (mount volume)	Built-in pull/push
Multi-model	One model per container	Auto load/unload
GPU support	❌ CPU only	✅ CUDA, ROCm, Metal

Use this if: You want maximum control, minimal overhead, structured output, rootless security, and Podman-native workflow.

Use Ollama if: You want simplicity and don't need GBNF grammars or fine-grained tuning.

License

MIT — see LICENSE.

llama.cpp is also MIT licensed. Model weights have their own licenses — check the model card on HuggingFace.

Built by Karthik Sivadas · Running in production on a single VPS powering karthik.page's AI terminal

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
examples		examples
.containerignore		.containerignore
.dockerignore		.dockerignore
Containerfile		Containerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦙 llama-server-podman

Why this exists

Quick Start

1. Get a model

2. Run

3. Use

4. Manage

Why Podman?

API Reference

Request body

Structured Output (GBNF Grammar)

Performance Tuning

Key flags

Memory estimates

Recommended configs

Running as a Systemd Service

Deployment Examples

Podman Compose

Kubernetes / K3s

Podman Pod (multi-container)

Building from Source

Architecture

Security

Compared to Ollama

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦙 llama-server-podman

Why this exists

Quick Start

1. Get a model

2. Run

3. Use

4. Manage

Why Podman?

API Reference

Request body

Structured Output (GBNF Grammar)

Performance Tuning

Key flags

Memory estimates

Recommended configs

Running as a Systemd Service

Deployment Examples

Podman Compose

Kubernetes / K3s

Podman Pod (multi-container)

Building from Source

Architecture

Security

Compared to Ollama

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages