Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Commit 583335e

Browse files
vdwaraknnazneenn
andauthored
Added support for Vision models (#141)
Co-authored-by: nazneenn <nazneen.nighar.sultana@intel.com>
1 parent a7916e5 commit 583335e

4 files changed

Lines changed: 82 additions & 20 deletions

File tree

PyTorch/vLLM_Tutorials/Deploying_vLLM/README.md

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ This folder contains scripts and configuration files that can be used to build a
1818
|Qwen/Qwen2.5-32B-Instruct |1|
1919
|Qwen/Qwen2.5-72B-Instruct |4|
2020
|Qwen/Qwen2.5-7B-Instruct |1|
21+
|meta-llama/Llama-3.2-11B-Vision-Instruct |1|
22+
|meta-llama/Llama-3.2-90B-Vision-Instruct |4|
2123
## Quick Start
2224
To run these models on your Gaudi machine:
2325

@@ -53,7 +55,7 @@ docker build -f Dockerfile-1.21.1-ub24-vllm-v0.7.2+Gaudi $BUILD_ARGS -t vllm-v0.
5355
> You can do this by adding parameters to the docker run command.
5456
> Example: "-e HF_HOME=/mnt/huggingface -v /mnt/huggingface:/mnt"
5557
56-
5) Start the vLLM server with a default context of 4K and default TP from the table above
58+
5) Start the vLLM server with a default context (4k for text and 8k for vision models) and default TP as per the table above
5759
```bash
5860
docker run -it --rm \
5961
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy \
@@ -83,7 +85,7 @@ curl -s --noproxy '*' http://${target}:8000/v1/completions -H 'Content-Type: app
8385
</code>
8486
&nbsp;
8587

86-
8) (Optional) Run the perftest.sh command in a **separate terminal** for obtaining basic metrics like the example below for Gaudi3:
88+
8.1) (Optional: For text based models) Run the perftest.sh command in a **separate terminal** for obtaining basic metrics like the example below for Gaudi3:
8789
```bash
8890
docker exec vllm-server /root/scripts/perftest.sh
8991
```
@@ -142,6 +144,34 @@ P90 ITL (ms): 61.32
142144
> OUTPUT_TOKENS=2048
143145
> CONCURRENT_REQUESTS=64
144146
147+
8.2) (Optional: For vision models) Run the perftest_vision.sh command in a **separate terminal** for obtaining basic metrics like the example below for Gaudi3:
148+
```bash
149+
docker exec vllm-server /root/scripts/perftest_vision.sh
150+
```
151+
<pre>
152+
# meta-llama/Llama-3.2-11B-Vision-Instruct
153+
============ Serving Benchmark Result ============
154+
Successful requests: 500
155+
Benchmark duration (s): 121.53
156+
Total input tokens: 31710
157+
Total generated tokens: 64000
158+
Request throughput (req/s): 4.11
159+
Output token throughput (tok/s): 526.63
160+
Total Token throughput (tok/s): 787.56
161+
---------------Time to First Token----------------
162+
Mean TTFT (ms): 5642.06
163+
Median TTFT (ms): 5589.81
164+
P90 TTFT (ms): 8825.33
165+
-----Time per Output Token (excl. 1st token)------
166+
Mean TPOT (ms): 74.14
167+
Median TPOT (ms): 72.15
168+
P90 TPOT (ms): 101.27
169+
---------------Inter-token Latency----------------
170+
Mean ITL (ms): 73.56
171+
Median ITL (ms): 34.46
172+
P90 ITL (ms): 88.77
173+
==================================================
174+
145175
9) Optionally, you can run perftest.sh with custom parameters like so:
146176
```bash
147177
## Usage: docker exec vllm-server /root/scripts/perftest.sh <INPUT_TOKENS> <OUTPUT_TOKENS> <CONCURRENT_REQUESTS>
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/bin/bash
2+
3+
## Edit the following variables to test for alternate performance scenarios
4+
DATASET=$1
5+
NUM_PROMPTS=$2
6+
CONCURRENT_REQ=$3
7+
DATASET=${DATASET:-"lmarena-ai/vision-arena-bench-v0.1"}
8+
NUM_PROMPTS=${NUM_PROMPTS:-500}
9+
CONCURRENT_REQ=${CONCURRENT_REQ:-64}
10+
11+
cd /root
12+
python3 vllm-fork/benchmarks/benchmark_serving.py \
13+
--model $MODEL \
14+
--base-url http://localhost:8000 \
15+
--backend openai-chat \
16+
--endpoint /v1/chat/completions \
17+
--dataset-name hf \
18+
--dataset-path $DATASET \
19+
--hf-split train \
20+
--num-prompts $NUM_PROMPTS \
21+
--max-concurrency $CONCURRENT_REQ \
22+
--metric-percentiles 90 \
23+
2>&1 | tee -a perftest_dataset${DATASET}_prompts${NUM_PROMPTS}_user${CONCURRENT_REQ}.log
Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
1-
MODEL,INPUT,TENSOR_PARALLEL_SIZE,MAX_MODEL_LEN,TOTAL_GPU_MEM,UNAVAILABLE_MEM_ABS,MODEL_MEM_FROM_CONFIG,MODEL_DTYPE,QUANT_DTYPE,MODEL_MEM,PROFILER_MEM_OVERHEAD,APPROX_MEM_PER_GRAPH_MB,fsdpa,GPU_FREE_MEM_TARGET,BLOCK_SIZE,VLLM_PROMPT_BS_BUCKET_MIN,VLLM_PROMPT_BS_BUCKET_STEP,VLLM_DECODE_BS_BUCKET_MIN,VLLM_DECODE_BS_BUCKET_STEP,VLLM_PROMPT_SEQ_BUCKET_MIN,VLLM_PROMPT_SEQ_BUCKET_STEP,VLLM_DECODE_BLOCK_BUCKET_MIN,VLLM_DECODE_BLOCK_BUCKET_STEP,MAX_NUM_PREFILL_SEQS,NUM_HIDDEN_LAYERS,HIDDEN_SIZE,NUM_KEY_VALUE_HEADS,NUM_ATTENTION_HEADS,CACHE_DTYPE_BYTES,LIMIT_MODEL_LEN,PT_HPU_LAZY_MODE,VLLM_DELAYED_SAMPLING,VLLM_SKIP_WARMUP,EXPERIMENTAL_WEIGHT_SHARING
2-
meta-llama/Llama-3.1-8B-Instruct,,1,4352,128,2,16060522496,2,2,14.95752716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,131072,1,TRUE,FALSE,0
3-
meta-llama/Llama-3.1-70B-Instruct,,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
4-
meta-llama/Llama-3.3-70B-Instruct,,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
5-
meta-llama/Llama-3.2-1B-Instruct,,1,4352,128,2,2471645608,2,2,2.301899351,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,16,2048,8,32,2,131072,1,TRUE,FALSE,0
6-
meta-llama/Llama-3.2-3B-Instruct,,1,4352,128,2,6425499648,2,2,5.984212875,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,28,3072,8,24,2,131072,1,TRUE,FALSE,0
7-
mistralai/Mixtral-8x7B-Instruct-v0.1,,2,4352,256,2,93405585408,2,2,86.99073029,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,32768,1,TRUE,FALSE,0
8-
mistralai/Mixtral-8x22B-Instruct-v0.1,,4,4352,512,2,2.8126E+11,2,2,261.9439201,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,56,6144,8,48,2,65536,1,TRUE,FALSE,0
9-
mistralai/Mistral-7B-Instruct-v0.2,,1,4352,128,2,14483464192,2,2,13.48877716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,32768,1,TRUE,FALSE,0
10-
meta-llama/Llama-3.1-405B-Instruct,,8,4352,1024,2,8.11707E+11,2,2,755.9608459,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,126,16384,8,128,2,131072,1,TRUE,FALSE,0
11-
Qwen/Qwen2.5-14B-Instruct,,1,4352,128,2,29540067328,2,2,27.51133156,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,48,5120,8,40,2,32768,1,TRUE,FALSE,0
12-
deepseek-ai/DeepSeek-R1-Distill-Llama-70B,,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
13-
Qwen/Qwen2.5-32B-Instruct,,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,64,5120,8,40,2,32768,1,TRUE,FALSE,0
14-
Qwen/Qwen2.5-72B-Instruct,,4,4352,512,2,1.45412E+11,2,2,135.4258575,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,32768,1,TRUE,FALSE,0
15-
Qwen/Qwen2.5-7B-Instruct,,1,4352,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,28,3584,4,28,2,32768,1,TRUE,FALSE,0
16-
Qwen/Qwen2.5-32B-Instruct,,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,64,5120,8,40,2,32768,1,TRUE,FALSE,0
1+
MODEL,TENSOR_PARALLEL_SIZE,MAX_MODEL_LEN,TOTAL_GPU_MEM,UNAVAILABLE_MEM_ABS,MODEL_MEM_FROM_CONFIG,MODEL_DTYPE,QUANT_DTYPE,MODEL_MEM,PROFILER_MEM_OVERHEAD,APPROX_MEM_PER_GRAPH_MB,fsdpa,GPU_FREE_MEM_TARGET,BLOCK_SIZE,VLLM_PROMPT_BS_BUCKET_MIN,VLLM_PROMPT_BS_BUCKET_STEP,VLLM_DECODE_BS_BUCKET_MIN,VLLM_DECODE_BS_BUCKET_STEP,VLLM_PROMPT_SEQ_BUCKET_MIN,VLLM_PROMPT_SEQ_BUCKET_STEP,VLLM_DECODE_BLOCK_BUCKET_MIN,VLLM_DECODE_BLOCK_BUCKET_STEP,MAX_NUM_PREFILL_SEQS,NUM_HIDDEN_LAYERS,HIDDEN_SIZE,NUM_KEY_VALUE_HEADS,NUM_ATTENTION_HEADS,CACHE_DTYPE_BYTES,LIMIT_MODEL_LEN,PT_HPU_LAZY_MODE,VLLM_DELAYED_SAMPLING,VLLM_SKIP_WARMUP,EXPERIMENTAL_WEIGHT_SHARING
2+
meta-llama/Llama-3.1-8B-Instruct,1,4352,128,2,16060522496,2,2,14.95752716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,131072,1,TRUE,FALSE,0
3+
meta-llama/Llama-3.1-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
4+
meta-llama/Llama-3.3-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
5+
meta-llama/Llama-3.2-1B-Instruct,1,4352,128,2,2471645608,2,2,2.301899351,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,16,2048,8,32,2,131072,1,TRUE,FALSE,0
6+
meta-llama/Llama-3.2-3B-Instruct,1,4352,128,2,6425499648,2,2,5.984212875,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,28,3072,8,24,2,131072,1,TRUE,FALSE,0
7+
mistralai/Mixtral-8x7B-Instruct-v0.1,2,4352,256,2,93405585408,2,2,86.99073029,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,32768,1,TRUE,FALSE,0
8+
mistralai/Mixtral-8x22B-Instruct-v0.1,4,4352,512,2,2.8126E+11,2,2,261.9439201,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,56,6144,8,48,2,65536,1,TRUE,FALSE,0
9+
mistralai/Mistral-7B-Instruct-v0.2,1,4352,128,2,14483464192,2,2,13.48877716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,32,4096,8,32,2,32768,1,TRUE,FALSE,0
10+
meta-llama/Llama-3.1-405B-Instruct,8,4352,1024,2,8.11707E+11,2,2,755.9608459,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,126,16384,8,128,2,131072,1,TRUE,FALSE,0
11+
Qwen/Qwen2.5-14B-Instruct,1,4352,128,2,29540067328,2,2,27.51133156,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,48,5120,8,40,2,32768,1,TRUE,FALSE,0
12+
deepseek-ai/DeepSeek-R1-Distill-Llama-70B,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,131072,1,TRUE,FALSE,0
13+
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,1,1,128,1,32,1,32,128,256,128,256,16,64,5120,8,40,2,32768,1,TRUE,FALSE,0
14+
Qwen/Qwen2.5-72B-Instruct,4,4352,512,2,1.45412E+11,2,2,135.4258575,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,80,8192,8,64,2,32768,1,TRUE,FALSE,0
15+
Qwen/Qwen2.5-7B-Instruct,1,4352,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,28,3584,4,28,2,32768,1,TRUE,FALSE,0
16+
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,0,3,128,1,32,1,32,128,256,128,256,16,64,5120,8,40,2,32768,1,TRUE,FALSE,0
17+
meta-llama/Llama-3.2-11B-Vision-Instruct,1,8448,128,2,21340441670,2,2,19.87483507,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,40,4096,8,32,2,131072,1,TRUE,FALSE,0
18+
meta-llama/Llama-3.2-90B-Vision-Instruct,4,8448,512,2,177186710646,2,2,165.0179835,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,100,8192,8,64,2,131072,1,TRUE,FALSE,0

PyTorch/vLLM_Tutorials/Deploying_vLLM/vllm_autocalc.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ def vllm_auto_calc(fd):
2020
print(f"Clamping TENSOR_PARALLEL_SIZE to {tensor_parallel_size_new}")
2121
fd['TENSOR_PARALLEL_SIZE'] = tensor_parallel_size_new
2222

23-
fd['MAX_MODEL_LEN'] = max(1, fd['MAX_MODEL_LEN'])
23+
fd['MAX_MODEL_LEN'] = max(1, fd['MAX_MODEL_LEN'])
2424

2525
if fd['TENSOR_PARALLEL_SIZE'] > 1:
2626
fd['PT_HPU_ENABLE_LAZY_COLLECTIVES'] = True
@@ -134,10 +134,11 @@ def vllm_auto_calc(fd):
134134
0.5)
135135
fd['KV_CACHE_MEM'] = (fd['USABLE_MEM'] * fd['GPU_MEM_UTILIZATION'] *
136136
(1 - fd['VLLM_GRAPH_RESERVED_MEM']))
137-
137+
138138
if fd.get('MAX_NUM_SEQS') is None:
139139
fd['MAX_NUM_SEQS'] = (fd['TENSOR_PARALLEL_SIZE'] * fd['KV_CACHE_MEM'] /
140140
fd['KV_CACHE_PER_SEQ'])
141+
print("max num seq",fd['MAX_NUM_SEQS'] )
141142
if DTYPE == 'fp8':
142143
fd['MAX_NUM_SEQS'] = (max(
143144
1,
@@ -153,9 +154,15 @@ def vllm_auto_calc(fd):
153154
raise ValueError(
154155
"Not enough memory for kv cache increase TENSOR_PARALLEL_SIZE "
155156
"or reduce MAX_MODEL_LEN or increase bucket step")
157+
158+
if fd['MODEL'] in ['meta-llama/Llama-3.2-11B-Vision-Instruct', 'meta-llama/Llama-3.2-90B-Vision-Instruct']:
159+
if fd['MAX_NUM_SEQS'] > 128:
160+
fd['MAX_NUM_SEQS'] = 128
161+
print(f"{fd['MODEL']} currently does not support max-num-seqs > 128, hence limiting the max-num-seqs to 128")
156162
else:
157163
fd['MAX_NUM_SEQS'] = max(1, fd['MAX_NUM_SEQS'])
158164

165+
159166
fd['VLLM_DECODE_BLOCK_BUCKET_MAX'] = max(
160167
128, math.ceil((fd['MAX_NUM_SEQS'] * fd['MAX_MODEL_LEN']) / 128))
161168
fd['VLLM_PROMPT_SEQ_BUCKET_MAX'] = fd['MAX_MODEL_LEN']

0 commit comments

Comments
 (0)