|
| 1 | +# TAGLINE |
| 2 | + |
| 3 | +Local AI text generation server and inference engine |
| 4 | + |
| 5 | +# TLDR |
| 6 | + |
| 7 | +**Launch** with a GGUF model file |
| 8 | + |
| 9 | +```koboldcpp --model [path/to/model.gguf]``` |
| 10 | + |
| 11 | +**Launch** with GPU acceleration using CUDA |
| 12 | + |
| 13 | +```koboldcpp --model [path/to/model.gguf] --usecuda --gpulayers [35]``` |
| 14 | + |
| 15 | +**Launch** with Vulkan GPU support on a custom port |
| 16 | + |
| 17 | +```koboldcpp --model [path/to/model.gguf] --usevulkan --gpulayers [35] --port [8080]``` |
| 18 | + |
| 19 | +**Run a single prompt** without starting a server |
| 20 | + |
| 21 | +```koboldcpp --model [path/to/model.gguf] --prompt "[What is the meaning of life?]"``` |
| 22 | + |
| 23 | +**Launch in CLI interactive mode** without the web UI |
| 24 | + |
| 25 | +```koboldcpp --model [path/to/model.gguf] --cli``` |
| 26 | + |
| 27 | +**Load a saved configuration** file |
| 28 | + |
| 29 | +```koboldcpp --config [path/to/config.kcpps]``` |
| 30 | + |
| 31 | +# SYNOPSIS |
| 32 | + |
| 33 | +**koboldcpp** [_options_] [**--model** _model_path_] |
| 34 | + |
| 35 | +# PARAMETERS |
| 36 | + |
| 37 | +**--model** _path_ |
| 38 | +> Specify the GGUF/GGML model file to load |
| 39 | +
|
| 40 | +**--config** _file_ |
| 41 | +> Load a .kcpps configuration file |
| 42 | +
|
| 43 | +**--usecuda** |
| 44 | +> Enable NVIDIA CUDA GPU acceleration |
| 45 | +
|
| 46 | +**--usevulkan** |
| 47 | +> Enable Vulkan GPU acceleration (AMD/NVIDIA) |
| 48 | +
|
| 49 | +**--gpulayers** _n_ |
| 50 | +> Number of model layers to offload to GPU |
| 51 | +
|
| 52 | +**--threads** _n_ |
| 53 | +> Set CPU thread count for inference |
| 54 | +
|
| 55 | +**--contextsize** _n_ |
| 56 | +> Set maximum context length in tokens |
| 57 | +
|
| 58 | +**--port** _n_ |
| 59 | +> Change server port (default: 5001) |
| 60 | +
|
| 61 | +**--host** _addr_ |
| 62 | +> Bind to a specific IP address |
| 63 | +
|
| 64 | +**--multiuser** _n_ |
| 65 | +> Enable multiuser mode with _n_ concurrent slots |
| 66 | +
|
| 67 | +**--password** _key_ |
| 68 | +> Require API authentication with the given key |
| 69 | +
|
| 70 | +**--cli** |
| 71 | +> Launch interactive command-line interface without starting a server |
| 72 | +
|
| 73 | +**--prompt** _text_ |
| 74 | +> Run a single prompt, print output, and exit |
| 75 | +
|
| 76 | +**--benchmark** |
| 77 | +> Run performance benchmarking mode |
| 78 | +
|
| 79 | +**--flashattention** |
| 80 | +> Enable flash attention for improved performance |
| 81 | +
|
| 82 | +**--smartcontext** |
| 83 | +> Enable smart context handling to reduce reprocessing |
| 84 | +
|
| 85 | +**--usemmap** |
| 86 | +> Enable memory-mapped file I/O for model loading |
| 87 | +
|
| 88 | +**--usemlock** |
| 89 | +> Force model to remain in RAM (prevent swapping) |
| 90 | +
|
| 91 | +**--ssl** |
| 92 | +> Enable SSL for HTTPS connections |
| 93 | +
|
| 94 | +**--remotetunnel** |
| 95 | +> Enable remote tunnel access for sharing the server |
| 96 | +
|
| 97 | +**--sdmodel** _path_ |
| 98 | +> Load a Stable Diffusion model for image generation |
| 99 | +
|
| 100 | +**--noavx2** |
| 101 | +> Compatibility mode for older CPUs without AVX2 |
| 102 | +
|
| 103 | +**--showgui** |
| 104 | +> Show the GUI launcher even when command-line flags are used |
| 105 | +
|
| 106 | +**--help** |
| 107 | +> Display all available commands |
| 108 | +
|
| 109 | +# DESCRIPTION |
| 110 | + |
| 111 | +**koboldcpp** is a self-contained AI text generation server that runs large language models locally. Built on top of **llama.cpp**, it provides a bundled web UI (KoboldAI Lite) and supports all GGML and GGUF model formats. It requires no external dependencies and runs as a single executable. |
| 112 | + |
| 113 | +The server exposes an API compatible with KoboldAI and OpenAI formats, making it usable with a wide range of frontends and applications. It supports CPU inference as well as GPU acceleration through **CUDA** (NVIDIA), **Vulkan** (AMD/NVIDIA), and **Metal** (Apple Silicon). |
| 114 | + |
| 115 | +Beyond text generation, koboldcpp supports **image generation** (Stable Diffusion), **speech recognition** (Whisper), and **text-to-speech**, all within the same executable. The bundled web UI offers multiple interaction modes including chat, instruct, adventure, and story writing. |
| 116 | + |
| 117 | +# CONFIGURATION |
| 118 | + |
| 119 | +When launched without arguments, koboldcpp opens a **GUI launcher** for interactive configuration. Settings can be saved to and loaded from **.kcpps** configuration files. Command-line flags override GUI settings when both are used. |
| 120 | + |
| 121 | +Key configuration considerations include **GPU layer offloading** (more layers on GPU means faster inference but requires more VRAM), **context size** (larger contexts use more memory), and **thread count** (typically set to the number of physical CPU cores). |
| 122 | + |
| 123 | +# CAVEATS |
| 124 | + |
| 125 | +Model files can be very large (several GB to over 100 GB) and require significant RAM or VRAM. GPU acceleration requires appropriate drivers and hardware support. Performance varies significantly based on model size, quantization level, and available hardware. The Vulkan backend is more broadly compatible but generally slower than CUDA on NVIDIA hardware. Flash attention requires compatible model architectures. |
| 126 | + |
| 127 | +# HISTORY |
| 128 | + |
| 129 | +KoboldCpp was created by a developer known as **LostRuins** (alias **Concedo**) and first released on **March 16, 2023** as a fork of **llama.cpp** combined with the KoboldAI interface. It was designed to provide a simple, self-contained way to run large language models locally without complex setup. The project grew rapidly alongside the open-source LLM movement, continuously adding features like multi-modal support, GPU backends, and image generation capabilities. It is licensed under **AGPL-3.0**. |
| 130 | + |
| 131 | +# SEE ALSO |
| 132 | + |
| 133 | +[llama](/man/llama)(1), [ollama](/man/ollama)(1) |
0 commit comments