Skip to content

Commit 3c799ba

Browse files
pescnclaude
andcommitted
docs: add comprehensive bilingual documentation
- Add guides: introduction and getting-started (en/zh) - Add integrations: chat assistants, dev tools, productivity, translation - Add compatibility: deployment, matrix, upstream providers - Update index page with feature cards Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent e1026dd commit 3c799ba

68 files changed

Lines changed: 9193 additions & 30 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
---
2+
title: Deployment Frameworks
3+
description: NexusGate compatibility with LLM deployment frameworks
4+
icon: Server
5+
---
6+
7+
This page explains how to integrate NexusGate with various local model deployment frameworks.
8+
9+
<Callout type="info">
10+
Self-hosted model services are configured as upstream providers through the NexusGate web console. Add them in the **Upstreams** page using their OpenAI-compatible API endpoint.
11+
</Callout>
12+
13+
## vLLM
14+
15+
vLLM is a high-performance LLM inference framework that supports optimizations like PagedAttention.
16+
17+
### Deploy vLLM
18+
19+
```bash
20+
# Using Docker
21+
docker run --runtime nvidia --gpus all \
22+
-v ~/.cache/huggingface:/root/.cache/huggingface \
23+
-p 8000:8000 \
24+
vllm/vllm-openai:latest \
25+
--model Qwen/Qwen2.5-7B-Instruct \
26+
--served-model-name qwen2.5-7b
27+
```
28+
29+
### Configure in NexusGate
30+
31+
In the NexusGate console, add a new upstream provider:
32+
33+
| Field | Value |
34+
|-------|-------|
35+
| Provider Type | OpenAI Compatible |
36+
| Base URL | `http://localhost:8000/v1` |
37+
| API Key | Your vLLM API Key (if configured) |
38+
| Models | `qwen2.5-7b` |
39+
40+
### Compatibility
41+
42+
| Feature | Status |
43+
|---------|--------|
44+
| Chat Completions | Supported |
45+
| Streaming | Supported |
46+
| Usage Statistics | Supported |
47+
| Function Calling | Supported (model-dependent) |
48+
| Vision | Supported (requires vision model) |
49+
50+
## SGLang
51+
52+
SGLang is an efficient LLM serving framework focused on structured generation.
53+
54+
### Deploy SGLang
55+
56+
```bash
57+
# Install
58+
pip install sglang[all]
59+
60+
# Start service
61+
python -m sglang.launch_server \
62+
--model-path Qwen/Qwen2.5-7B-Instruct \
63+
--port 8000 \
64+
--host 0.0.0.0
65+
```
66+
67+
### Configure in NexusGate
68+
69+
In the NexusGate console, add a new upstream provider:
70+
71+
| Field | Value |
72+
|-------|-------|
73+
| Provider Type | OpenAI Compatible |
74+
| Base URL | `http://localhost:8000/v1` |
75+
| Models | `Qwen/Qwen2.5-7B-Instruct` |
76+
77+
### Compatibility
78+
79+
| Feature | Status |
80+
|---------|--------|
81+
| Chat Completions | Supported |
82+
| Streaming | Supported |
83+
| Usage Statistics | Supported |
84+
| Function Calling | Partial support |
85+
| Vision | Supported (requires vision model) |
86+
87+
### Known Issues
88+
89+
SGLang's `reasoning_tokens` return format may differ from standard. Consider this when using reasoning models.
90+
91+
## TGI (Text Generation Inference)
92+
93+
TGI is a high-performance inference service from Hugging Face.
94+
95+
### Deploy TGI
96+
97+
```bash
98+
docker run --gpus all --shm-size 1g \
99+
-p 8080:80 \
100+
-v ~/.cache/huggingface:/data \
101+
ghcr.io/huggingface/text-generation-inference:latest \
102+
--model-id Qwen/Qwen2.5-7B-Instruct \
103+
--max-input-length 4096 \
104+
--max-total-tokens 8192
105+
```
106+
107+
### Configure in NexusGate
108+
109+
In the NexusGate console, add a new upstream provider:
110+
111+
| Field | Value |
112+
|-------|-------|
113+
| Provider Type | OpenAI Compatible |
114+
| Base URL | `http://localhost:8080/v1` |
115+
| Models | `Qwen/Qwen2.5-7B-Instruct` |
116+
117+
### Compatibility
118+
119+
| Feature | Status |
120+
|---------|--------|
121+
| Chat Completions | Supported |
122+
| Streaming | Supported |
123+
| Usage Statistics | Supported |
124+
| Function Calling | Model-dependent |
125+
| Vision | Not supported |
126+
127+
## Ollama
128+
129+
Ollama is an easy-to-use local model runtime tool.
130+
131+
### Deploy Ollama
132+
133+
```bash
134+
# Install Ollama
135+
curl -fsSL https://ollama.com/install.sh | sh
136+
137+
# Pull model
138+
ollama pull llama3.2
139+
140+
# Start service (default port 11434)
141+
ollama serve
142+
```
143+
144+
### Configure in NexusGate
145+
146+
In the NexusGate console, add a new upstream provider:
147+
148+
| Field | Value |
149+
|-------|-------|
150+
| Provider Type | Ollama |
151+
| Base URL | `http://localhost:11434/v1` |
152+
| Models | `llama3.2`, `qwen2.5`, `deepseek-r1` |
153+
154+
### Compatibility
155+
156+
| Feature | Status |
157+
|---------|--------|
158+
| Chat Completions | Supported |
159+
| Streaming | Supported |
160+
| Usage Statistics | Supported |
161+
| Function Calling | Model-dependent |
162+
| Vision | Supported (requires vision model) |
163+
164+
### Ollama Special Configuration
165+
166+
Enable cross-origin requests if needed:
167+
168+
```bash
169+
# Set environment variable
170+
export OLLAMA_ORIGINS="*"
171+
```
172+
173+
## llama.cpp
174+
175+
llama.cpp provides lightweight CPU/GPU inference capabilities.
176+
177+
### Deploy llama.cpp Server
178+
179+
```bash
180+
# Build
181+
git clone https://github.com/ggerganov/llama.cpp
182+
cd llama.cpp
183+
make LLAMA_CUDA=1
184+
185+
# Start service
186+
./llama-server \
187+
-m models/qwen2.5-7b.gguf \
188+
--host 0.0.0.0 \
189+
--port 8080
190+
```
191+
192+
### Configure in NexusGate
193+
194+
In the NexusGate console, add a new upstream provider:
195+
196+
| Field | Value |
197+
|-------|-------|
198+
| Provider Type | OpenAI Compatible |
199+
| Base URL | `http://localhost:8080/v1` |
200+
| Models | `qwen2.5-7b` |
201+
202+
### Compatibility
203+
204+
| Feature | Status |
205+
|---------|--------|
206+
| Chat Completions | Supported |
207+
| Streaming | Supported |
208+
| Usage Statistics | Supported |
209+
| Function Calling | Not supported |
210+
| Vision | Multimodal model needed |
211+
212+
## MindIE
213+
214+
MindIE is Huawei's AI inference engine.
215+
216+
### Configure in NexusGate
217+
218+
In the NexusGate console, add a new upstream provider:
219+
220+
| Field | Value |
221+
|-------|-------|
222+
| Provider Type | OpenAI Compatible |
223+
| Base URL | `http://localhost:8000/v1` |
224+
| Models | Your model name |
225+
226+
### Compatibility
227+
228+
| Feature | Status |
229+
|---------|--------|
230+
| Chat Completions | Supported |
231+
| Streaming | Supported |
232+
| Usage Statistics | Partial (version-dependent) |
233+
| Function Calling | Partial support |
234+
| Vision | Supported |
235+
236+
## Performance Tuning Tips
237+
238+
### vLLM
239+
240+
```bash
241+
# Optimization parameters
242+
--tensor-parallel-size 2 # Multi-GPU parallelism
243+
--gpu-memory-utilization 0.9 # GPU memory usage
244+
--max-num-seqs 256 # Max concurrent sequences
245+
```
246+
247+
### Ollama
248+
249+
```bash
250+
# Environment variables
251+
OLLAMA_NUM_PARALLEL=4 # Parallel requests
252+
OLLAMA_MAX_LOADED_MODELS=2 # Max loaded models
253+
```
254+
255+
### llama.cpp
256+
257+
```bash
258+
# Start parameters
259+
--n-gpu-layers 35 # GPU layers
260+
--batch-size 512 # Batch size
261+
--ctx-size 8192 # Context length
262+
```
263+
264+
## FAQ
265+
266+
### Q: Model list shows incorrectly?
267+
268+
Local deployment frameworks' `/models` endpoint may return different formats. Manually specify the model names when configuring the upstream in NexusGate.
269+
270+
### Q: Usage statistics inaccurate?
271+
272+
Some frameworks' token counting implementation may be inconsistent. This is a known limitation of certain self-hosted inference frameworks.
273+
274+
### Q: Streaming response interrupted?
275+
276+
1. Check network connection stability
277+
2. Increase timeout configuration
278+
3. Check model service logs
279+
280+
## Related Links
281+
282+
- [vLLM Documentation](https://docs.vllm.ai/)
283+
- [SGLang Documentation](https://github.com/sgl-project/sglang)
284+
- [TGI Documentation](https://huggingface.co/docs/text-generation-inference)
285+
- [Ollama Documentation](https://ollama.com/)
286+
- [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp)

0 commit comments

Comments
 (0)