|
| 1 | +--- |
| 2 | +title: Deployment Frameworks |
| 3 | +description: NexusGate compatibility with LLM deployment frameworks |
| 4 | +icon: Server |
| 5 | +--- |
| 6 | + |
| 7 | +This page explains how to integrate NexusGate with various local model deployment frameworks. |
| 8 | + |
| 9 | +<Callout type="info"> |
| 10 | +Self-hosted model services are configured as upstream providers through the NexusGate web console. Add them in the **Upstreams** page using their OpenAI-compatible API endpoint. |
| 11 | +</Callout> |
| 12 | + |
| 13 | +## vLLM |
| 14 | + |
| 15 | +vLLM is a high-performance LLM inference framework that supports optimizations like PagedAttention. |
| 16 | + |
| 17 | +### Deploy vLLM |
| 18 | + |
| 19 | +```bash |
| 20 | +# Using Docker |
| 21 | +docker run --runtime nvidia --gpus all \ |
| 22 | + -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| 23 | + -p 8000:8000 \ |
| 24 | + vllm/vllm-openai:latest \ |
| 25 | + --model Qwen/Qwen2.5-7B-Instruct \ |
| 26 | + --served-model-name qwen2.5-7b |
| 27 | +``` |
| 28 | + |
| 29 | +### Configure in NexusGate |
| 30 | + |
| 31 | +In the NexusGate console, add a new upstream provider: |
| 32 | + |
| 33 | +| Field | Value | |
| 34 | +|-------|-------| |
| 35 | +| Provider Type | OpenAI Compatible | |
| 36 | +| Base URL | `http://localhost:8000/v1` | |
| 37 | +| API Key | Your vLLM API Key (if configured) | |
| 38 | +| Models | `qwen2.5-7b` | |
| 39 | + |
| 40 | +### Compatibility |
| 41 | + |
| 42 | +| Feature | Status | |
| 43 | +|---------|--------| |
| 44 | +| Chat Completions | Supported | |
| 45 | +| Streaming | Supported | |
| 46 | +| Usage Statistics | Supported | |
| 47 | +| Function Calling | Supported (model-dependent) | |
| 48 | +| Vision | Supported (requires vision model) | |
| 49 | + |
| 50 | +## SGLang |
| 51 | + |
| 52 | +SGLang is an efficient LLM serving framework focused on structured generation. |
| 53 | + |
| 54 | +### Deploy SGLang |
| 55 | + |
| 56 | +```bash |
| 57 | +# Install |
| 58 | +pip install sglang[all] |
| 59 | + |
| 60 | +# Start service |
| 61 | +python -m sglang.launch_server \ |
| 62 | + --model-path Qwen/Qwen2.5-7B-Instruct \ |
| 63 | + --port 8000 \ |
| 64 | + --host 0.0.0.0 |
| 65 | +``` |
| 66 | + |
| 67 | +### Configure in NexusGate |
| 68 | + |
| 69 | +In the NexusGate console, add a new upstream provider: |
| 70 | + |
| 71 | +| Field | Value | |
| 72 | +|-------|-------| |
| 73 | +| Provider Type | OpenAI Compatible | |
| 74 | +| Base URL | `http://localhost:8000/v1` | |
| 75 | +| Models | `Qwen/Qwen2.5-7B-Instruct` | |
| 76 | + |
| 77 | +### Compatibility |
| 78 | + |
| 79 | +| Feature | Status | |
| 80 | +|---------|--------| |
| 81 | +| Chat Completions | Supported | |
| 82 | +| Streaming | Supported | |
| 83 | +| Usage Statistics | Supported | |
| 84 | +| Function Calling | Partial support | |
| 85 | +| Vision | Supported (requires vision model) | |
| 86 | + |
| 87 | +### Known Issues |
| 88 | + |
| 89 | +SGLang's `reasoning_tokens` return format may differ from standard. Consider this when using reasoning models. |
| 90 | + |
| 91 | +## TGI (Text Generation Inference) |
| 92 | + |
| 93 | +TGI is a high-performance inference service from Hugging Face. |
| 94 | + |
| 95 | +### Deploy TGI |
| 96 | + |
| 97 | +```bash |
| 98 | +docker run --gpus all --shm-size 1g \ |
| 99 | + -p 8080:80 \ |
| 100 | + -v ~/.cache/huggingface:/data \ |
| 101 | + ghcr.io/huggingface/text-generation-inference:latest \ |
| 102 | + --model-id Qwen/Qwen2.5-7B-Instruct \ |
| 103 | + --max-input-length 4096 \ |
| 104 | + --max-total-tokens 8192 |
| 105 | +``` |
| 106 | + |
| 107 | +### Configure in NexusGate |
| 108 | + |
| 109 | +In the NexusGate console, add a new upstream provider: |
| 110 | + |
| 111 | +| Field | Value | |
| 112 | +|-------|-------| |
| 113 | +| Provider Type | OpenAI Compatible | |
| 114 | +| Base URL | `http://localhost:8080/v1` | |
| 115 | +| Models | `Qwen/Qwen2.5-7B-Instruct` | |
| 116 | + |
| 117 | +### Compatibility |
| 118 | + |
| 119 | +| Feature | Status | |
| 120 | +|---------|--------| |
| 121 | +| Chat Completions | Supported | |
| 122 | +| Streaming | Supported | |
| 123 | +| Usage Statistics | Supported | |
| 124 | +| Function Calling | Model-dependent | |
| 125 | +| Vision | Not supported | |
| 126 | + |
| 127 | +## Ollama |
| 128 | + |
| 129 | +Ollama is an easy-to-use local model runtime tool. |
| 130 | + |
| 131 | +### Deploy Ollama |
| 132 | + |
| 133 | +```bash |
| 134 | +# Install Ollama |
| 135 | +curl -fsSL https://ollama.com/install.sh | sh |
| 136 | + |
| 137 | +# Pull model |
| 138 | +ollama pull llama3.2 |
| 139 | + |
| 140 | +# Start service (default port 11434) |
| 141 | +ollama serve |
| 142 | +``` |
| 143 | + |
| 144 | +### Configure in NexusGate |
| 145 | + |
| 146 | +In the NexusGate console, add a new upstream provider: |
| 147 | + |
| 148 | +| Field | Value | |
| 149 | +|-------|-------| |
| 150 | +| Provider Type | Ollama | |
| 151 | +| Base URL | `http://localhost:11434/v1` | |
| 152 | +| Models | `llama3.2`, `qwen2.5`, `deepseek-r1` | |
| 153 | + |
| 154 | +### Compatibility |
| 155 | + |
| 156 | +| Feature | Status | |
| 157 | +|---------|--------| |
| 158 | +| Chat Completions | Supported | |
| 159 | +| Streaming | Supported | |
| 160 | +| Usage Statistics | Supported | |
| 161 | +| Function Calling | Model-dependent | |
| 162 | +| Vision | Supported (requires vision model) | |
| 163 | + |
| 164 | +### Ollama Special Configuration |
| 165 | + |
| 166 | +Enable cross-origin requests if needed: |
| 167 | + |
| 168 | +```bash |
| 169 | +# Set environment variable |
| 170 | +export OLLAMA_ORIGINS="*" |
| 171 | +``` |
| 172 | + |
| 173 | +## llama.cpp |
| 174 | + |
| 175 | +llama.cpp provides lightweight CPU/GPU inference capabilities. |
| 176 | + |
| 177 | +### Deploy llama.cpp Server |
| 178 | + |
| 179 | +```bash |
| 180 | +# Build |
| 181 | +git clone https://github.com/ggerganov/llama.cpp |
| 182 | +cd llama.cpp |
| 183 | +make LLAMA_CUDA=1 |
| 184 | + |
| 185 | +# Start service |
| 186 | +./llama-server \ |
| 187 | + -m models/qwen2.5-7b.gguf \ |
| 188 | + --host 0.0.0.0 \ |
| 189 | + --port 8080 |
| 190 | +``` |
| 191 | + |
| 192 | +### Configure in NexusGate |
| 193 | + |
| 194 | +In the NexusGate console, add a new upstream provider: |
| 195 | + |
| 196 | +| Field | Value | |
| 197 | +|-------|-------| |
| 198 | +| Provider Type | OpenAI Compatible | |
| 199 | +| Base URL | `http://localhost:8080/v1` | |
| 200 | +| Models | `qwen2.5-7b` | |
| 201 | + |
| 202 | +### Compatibility |
| 203 | + |
| 204 | +| Feature | Status | |
| 205 | +|---------|--------| |
| 206 | +| Chat Completions | Supported | |
| 207 | +| Streaming | Supported | |
| 208 | +| Usage Statistics | Supported | |
| 209 | +| Function Calling | Not supported | |
| 210 | +| Vision | Multimodal model needed | |
| 211 | + |
| 212 | +## MindIE |
| 213 | + |
| 214 | +MindIE is Huawei's AI inference engine. |
| 215 | + |
| 216 | +### Configure in NexusGate |
| 217 | + |
| 218 | +In the NexusGate console, add a new upstream provider: |
| 219 | + |
| 220 | +| Field | Value | |
| 221 | +|-------|-------| |
| 222 | +| Provider Type | OpenAI Compatible | |
| 223 | +| Base URL | `http://localhost:8000/v1` | |
| 224 | +| Models | Your model name | |
| 225 | + |
| 226 | +### Compatibility |
| 227 | + |
| 228 | +| Feature | Status | |
| 229 | +|---------|--------| |
| 230 | +| Chat Completions | Supported | |
| 231 | +| Streaming | Supported | |
| 232 | +| Usage Statistics | Partial (version-dependent) | |
| 233 | +| Function Calling | Partial support | |
| 234 | +| Vision | Supported | |
| 235 | + |
| 236 | +## Performance Tuning Tips |
| 237 | + |
| 238 | +### vLLM |
| 239 | + |
| 240 | +```bash |
| 241 | +# Optimization parameters |
| 242 | +--tensor-parallel-size 2 # Multi-GPU parallelism |
| 243 | +--gpu-memory-utilization 0.9 # GPU memory usage |
| 244 | +--max-num-seqs 256 # Max concurrent sequences |
| 245 | +``` |
| 246 | + |
| 247 | +### Ollama |
| 248 | + |
| 249 | +```bash |
| 250 | +# Environment variables |
| 251 | +OLLAMA_NUM_PARALLEL=4 # Parallel requests |
| 252 | +OLLAMA_MAX_LOADED_MODELS=2 # Max loaded models |
| 253 | +``` |
| 254 | + |
| 255 | +### llama.cpp |
| 256 | + |
| 257 | +```bash |
| 258 | +# Start parameters |
| 259 | +--n-gpu-layers 35 # GPU layers |
| 260 | +--batch-size 512 # Batch size |
| 261 | +--ctx-size 8192 # Context length |
| 262 | +``` |
| 263 | + |
| 264 | +## FAQ |
| 265 | + |
| 266 | +### Q: Model list shows incorrectly? |
| 267 | + |
| 268 | +Local deployment frameworks' `/models` endpoint may return different formats. Manually specify the model names when configuring the upstream in NexusGate. |
| 269 | + |
| 270 | +### Q: Usage statistics inaccurate? |
| 271 | + |
| 272 | +Some frameworks' token counting implementation may be inconsistent. This is a known limitation of certain self-hosted inference frameworks. |
| 273 | + |
| 274 | +### Q: Streaming response interrupted? |
| 275 | + |
| 276 | +1. Check network connection stability |
| 277 | +2. Increase timeout configuration |
| 278 | +3. Check model service logs |
| 279 | + |
| 280 | +## Related Links |
| 281 | + |
| 282 | +- [vLLM Documentation](https://docs.vllm.ai/) |
| 283 | +- [SGLang Documentation](https://github.com/sgl-project/sglang) |
| 284 | +- [TGI Documentation](https://huggingface.co/docs/text-generation-inference) |
| 285 | +- [Ollama Documentation](https://ollama.com/) |
| 286 | +- [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp) |
0 commit comments