Skip to content

karthik-sivadas/llama-server-podman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦙 llama-server-podman

Minimal, CPU-only llama.cpp inference server in a container. Podman-first. Rootless. Daemonless.

Run any GGUF model with an OpenAI-compatible API. No GPU required. No daemon. No root.

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/your-model.gguf

Works with Docker too — just swap podman for docker.

Why this exists

The official llama.cpp repo only publishes GPU-accelerated container images (CUDA, Vulkan, ROCm). If you're running on a VPS, Raspberry Pi, or any CPU-only machine, you're left building from source.

This image solves that:

  • ~95MB compressed (Ubuntu 24.04 minimal + statically-linked binary)
  • Zero GPU dependencies — runs anywhere
  • Rootless by default — runs as non-root llama user
  • Daemonless — Podman needs no background daemon, unlike Docker
  • Weekly automated builds from latest llama.cpp master
  • OpenAI-compatible API at /v1/chat/completions
  • GBNF grammar support — force structured output at the token level

Quick Start

1. Get a model

Download any GGUF model from HuggingFace:

Model Size RAM needed Good for
Qwen3-0.6B 500MB ~1GB Edge/IoT, simple tasks
Qwen3-1.7B 1.4GB ~2GB Chatbots, classification
Qwen3-4B 2.5GB ~4GB General purpose, tool use
Gemma-3-4B 2.3GB ~4GB Instruction following
Llama-3.1-8B 4.9GB ~7GB High quality, reasoning
mkdir -p models
curl -L -o models/Qwen3-4B-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"

2. Run

podman run -d --name llama \
  -v ./models:/models:Z \
  -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/Qwen3-4B-Q4_K_M.gguf \
  --ctx-size 32768 \
  --flash-attn on \
  --threads 4

Note: The :Z volume flag handles SELinux relabeling on Fedora/RHEL/CentOS. On other distros you can omit it.

3. Use

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain containers in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

4. Manage

podman logs llama          # View logs
podman stats llama         # Monitor CPU/RAM
podman stop llama          # Stop
podman start llama         # Start again
podman rm llama            # Remove

Why Podman?

Podman Docker
Daemon None — each container is a process Requires dockerd daemon
Root Rootless by default Rootless optional, root by default
Systemd Native podman generate systemd Needs extra config
SELinux First-class :Z / :z support Supported but less common
OCI compliant Yes Yes
CLI compatible alias docker=podman works
Socket activation Yes (systemd) Via socket file
Fork/exec Direct fork — no daemon middleman Everything routes through daemon

For an LLM server that should just start, run, and not die — Podman's daemonless architecture means one less thing that can crash.

API Reference

Implements the OpenAI Chat Completions API — drop-in compatible with any OpenAI SDK:

Endpoint Method Description
/v1/chat/completions POST Chat completion (OpenAI-compatible)
/v1/completions POST Text completion
/health GET Health check ({"status":"ok"})
/metrics GET Prometheus metrics (with --metrics)
/slots GET Active inference slot info

Request body

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 256,
  "stream": false,
  "grammar": "..."
}

Structured Output (GBNF Grammar)

The killer feature of llama.cpp: GBNF grammars enforce output structure at the token level. The model literally cannot produce output that doesn't match your schema.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Rate this movie: Inception"}],
    "max_tokens": 100,
    "grammar": "root ::= \"{\" ws \"\\\"rating\\\":\" ws num \",\" ws \"\\\"review\\\":\" ws string ws \"}\"\nnum ::= [1-9] | \"10\"\nstring ::= \"\\\"\" [^\"]+ \"\\\"\"\nws ::= [ \\t\\n]*"
  }'

Guaranteed output:

{"rating": 9, "review": "A mind-bending masterpiece that redefines sci-fi cinema."}

No regex. No "please respond in JSON". No retries. Guaranteed at the token level.

See examples/structured-output.json for a full example and the GBNF docs for grammar syntax.

Performance Tuning

Key flags

Flag Default Description
--ctx-size 2048 Context window. Higher = more RAM.
--threads auto CPU threads for inference. Set to physical cores.
--flash-attn on auto Flash Attention — faster, less memory.
--cache-type-k q8_0 f16 KV cache key quantization. Saves ~50% cache RAM.
--cache-type-v q4_0 f16 KV cache value quantization. Saves ~75% cache RAM.
--batch-size 2048 Prompt processing batch size.
--ubatch-size 512 Micro-batch size.
--cont-batching off Continuous batching for concurrent requests.
--metrics off Expose Prometheus metrics at /metrics.
--n-predict -1 Max tokens to generate (-1 = unlimited).

Memory estimates

RAM ≈ model_size + kv_cache_size

Context f16 (default) q8_0 keys + q4_0 values Savings
4K ~256MB ~96MB 62%
8K ~512MB ~192MB 62%
32K ~2GB ~750MB 62%
128K ~8GB ~3GB 62%

Recommended configs

Tiny (1-2GB RAM) — Raspberry Pi, edge:

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-0.5b.gguf --ctx-size 4096 --threads 2

Small VPS (4GB RAM):

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-4b-q4.gguf --ctx-size 8192 --threads 4 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q4_0

Server (16GB+ RAM):

podman run -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/model-8b-q4.gguf --ctx-size 32768 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q4_0 \
  --cont-batching --batch-size 512 --metrics

Running as a Systemd Service

Podman integrates natively with systemd — no daemon needed:

# Create the container (don't start yet)
podman create --name llama \
  -v ./models:/models:Z -p 8080:8080 \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/your-model.gguf --ctx-size 32768 --flash-attn on --threads 4

# Generate a systemd unit
podman generate systemd --new --name llama > ~/.config/systemd/user/llama.service

# Enable and start
systemctl --user daemon-reload
systemctl --user enable --now llama.service

# Check status
systemctl --user status llama.service

# Survives logout (lingering)
loginctl enable-linger $USER

Now your LLM server starts on boot, restarts on crash, and needs zero root access.

Deployment Examples

Podman Compose

See examples/podman-compose.yml

podman-compose up -d

Kubernetes / K3s

See examples/kubernetes.yaml — includes init container for automatic model download.

kubectl apply -f examples/kubernetes.yaml

Podman Pod (multi-container)

Run the LLM server alongside your app in a Podman pod:

# Create a pod
podman pod create --name ai-stack -p 8080:8080 -p 3000:3000

# Add llama-server
podman run -d --pod ai-stack --name llama \
  -v ./models:/models:Z \
  ghcr.io/karthik-sivadas/llama-server-podman:latest \
  --model /models/Qwen3-4B-Q4_K_M.gguf --ctx-size 32768

# Add your app (can reach llama at localhost:8080 inside the pod)
podman run -d --pod ai-stack --name my-app my-app-image

Building from Source

# Latest llama.cpp
podman build -t llama-server .

# Pin to a specific release
podman build --build-arg LLAMA_CPP_VERSION=b4726 -t llama-server .

Architecture

┌──────────────────────────────────────────────────┐
│              Podman Container (rootless)           │
│                                                    │
│  ┌─────────────┐    ┌──────────────────────────┐  │
│  │ llama-server │    │ /models/model.gguf       │  │
│  │ (static bin) │◄───│ (mounted volume)         │  │
│  │  :8080       │    └──────────────────────────┘  │
│  └──────┬──────┘                                   │
│         │ OpenAI-compatible API                     │
│    user: llama (non-root)                          │
└─────────┼──────────────────────────────────────────┘
          │
    ┌─────▼─────┐
    │  Clients   │  curl, Python, JS, any OpenAI SDK
    └───────────┘
  1. Build: Clones llama.cpp, compiles llama-server with BUILD_SHARED_LIBS=OFF (static linking)
  2. Runtime: Ubuntu 24.04 minimal + binary + libgomp (OpenMP threading)
  3. Model: Mounted via volume — no model data baked into the image
  4. User: Runs as llama (non-root), works with Podman's rootless mode

Security

  • Non-root — runs as llama user inside the container
  • Rootless Podman — no root on the host either
  • No model data baked into the image
  • No network tools installed (no curl, wget, nc)
  • Static binary — minimal attack surface, no shared library hijacking
  • Weekly Trivy scans via GitHub Actions
  • Base image: ubuntu:24.04 LTS (regularly patched)
  • No daemon — nothing listening on a socket that could be exploited

Compared to Ollama

This image Ollama
Container runtime Podman-first (Docker compatible) Docker-focused
Image size ~95MB ~500MB+
Runtime overhead ~5MB (static C++ binary) ~200MB (Go runtime + daemon)
Root required No (rootless Podman) No (but daemon pattern)
KV cache quantization ✅ Full control ❌ Not exposed
GBNF grammars ✅ Per-request structured output ❌ Limited JSON mode
Systemd integration ✅ Native via podman generate systemd Manual
Model management BYO (mount volume) Built-in pull/push
Multi-model One model per container Auto load/unload
GPU support ❌ CPU only ✅ CUDA, ROCm, Metal

Use this if: You want maximum control, minimal overhead, structured output, rootless security, and Podman-native workflow.

Use Ollama if: You want simplicity and don't need GBNF grammars or fine-grained tuning.

License

MIT — see LICENSE.

llama.cpp is also MIT licensed. Model weights have their own licenses — check the model card on HuggingFace.


Built by Karthik Sivadas · Running in production on a single VPS powering karthik.page's AI terminal

About

Minimal CPU-only llama.cpp inference server container. Run any GGUF model with an OpenAI-compatible API. No GPU required. Podman-first.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors