The following guide shows the reader how to use Triton Perf Analyzer to measure and characterize the performance behaviors of Large Language Models (LLMs) using Triton with TensorRT-LLM and vLLM.
- Follow step 1 of the Installation section. It includes instructions for cloning llama if you do not already have it downloaded.
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch release/0.5.0
# Update the submodules
cd tensorrtllm_backend
# Install git-lfs if needed
sudo apt-get update && sudo apt-get install git-lfs -y --no-install-recommends
git lfs install
git submodule update --init --recursive
-
Launch the Triton docker container with the TensorRT-LLM backend. This will require mounting the repo from step 1 into the docker container and any models you plan to serve.
For the tensorrtllm_backend repository, you need the following directories mounted:
- backend: .../tensorrtllm_backend/:/tensorrtllm_backend
- llama repo: .../llama/repo:/Llama-2-7b-hf
- engine: .../tensorrtllm_backend/tensorrt_llm/examples/llama/engine:/engines
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v $(pwd):/tensorrtllm_backend \
-v /path/to/llama/repo:/Llama-2-7b-hf \
-v $(pwd)/tensorrt_llm/examples/llama/engines:/engines \
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 \
bash
-
Follow the steps here to create the engine.
Building the engine in the container with the
--output_dir /enginesflag will place the compiled.enginefile under the expected directory set in step 1.Note:
- Compiling the wheel and engine can take more than 1 hour.
- If you get an error compiling bfloat16, you can remove it for the default option.
Once the engine is created, copy the directory containing the engine file and config.json over to following directory: /tensorrt_llm/1
-
Serve the model with Triton.
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
After copying the model repository, use the following sed commands to set some required values in the config.pbtxt files.
sed -i 's#${tokenizer_dir}#/Llama-2-7b-hf/#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_dir}#/Llama-2-7b-hf/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
sed -i 's#${decoupled_mode}#true#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
sed -i 's#${engine_dir}#/engines/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Note: Due to a known bug, all model_version values in
/opt/tritonserver/inflight_batcher_llm/ensemble/config.pbtxt must be manually set to 1.
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
Download the pre-built Triton Server Container with vLLM backend from NGC registry.
docker pull nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3Run the Triton Server container with vLLM backend and launch the server.
git clone -b r23.10 https://github.com/triton-inference-server/vllm_backend.git
cd vllm_backend
docker run --gpus all --rm -it --net host \
--shm-size=2G --ulimit memlock=-1 --ulimit stack=67108864 \
-v $(pwd)/samples/model_repository:/model_repository \
nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3 \
tritonserver --model-repository /model_repositoryNext run the following command to start the Triton SDK container:
git clone https://github.com/triton-inference-server/client.git
cd client/src/c++/perf_analyzer/docs/examples
docker run --gpus all -it --rm --net host -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:23.10-py3-sdkIn this benchmarking scenario, we want to measure the effect of input prompt size on first-token latency. We issue single request to the server of fixed input sizes and request the model to compute at most one new token. This essentially means one pass through the model.
Inside the client container, run the following command to generate dummy prompts of size 100, 300, and 500 and receive single token from the model for each prompt.
# trtllm: -m ensemble -b trtllm
# vllm: -m vllm_model -b vllm
python profile.py -m ensemble -b trtllm --prompt-size-range 100 500 200 --max-tokens 1
# [ BENCHMARK SUMMARY ]
# Prompt size: 100
# * Max first token latency: 35.2451 ms
# * Min first token latency: 11.0879 ms
# * Avg first token latency: 18.3775 ms
# ...Note
Users can also run a custom prompt by providing input data JSON file using
--input-dataoption. They can also specify input tensors or parameters to the model as well. However, when a parameter is defined in both input data JSON file and through command line option (e.g.max_tokens), the command line option value will overwrite the one in the input data JSON file.$ echo ' { "data": [ { "text_input": [ "Hello, my name is" // user-provided prompt ], "stream": [ true ], "sampling_parameters": [ "{ \"max_tokens\": 1 }" ] } ] } ' > input_data.json $ python profile.py -m ensemble -b trtllm --input-data input_data.json
In this benchmarking scenario, we want to measure the effect of input prompt size on token-to-token latency. We issue single request to the server of fixed input sizes and request the model to compute a fixed amount of tokens.
Inside the client container, run the following command to generate dummy prompts of size 100, 300, and 500 and receive total 256 tokens from the model for each prompts.
# trtllm: -m ensemble -b trtllm
# vllm: -m vllm_model -b vllm
python profile.py -m ensemble -b trtllm --prompt-size-range 100 500 200 --max-tokens 256 --ignore-eos
# [ BENCHMARK SUMMARY ]
# Prompt size: 100
# * Max first token latency: 23.2899 ms
# * Min first token latency: 11.0127 ms
# * Avg first token latency: 16.0468 ms
# ...In this benchmarking scenario, we want to measure the effect of in-flight batch size on token-to-token (T2T) latency. We systematically issue requests to the server of fixed input sizes and request the model to compute a fixed amount of tokens in order to increase the in-flight batch size over time.
In this benchmark, we will run Perf Analyzer in
periodic concurrency mode
that periodically launches a new concurrent request to the model using
--periodic-concurrency-range START END STEP option.
In this example, Perf Analyzer starts with a single request and launches the new
ones until the total number reaches 100.
You can also specify the timing of the new requests:
Setting --request-period to 32 (as shown below) will make Perf Analyzer to
wait for all the requests to receive 32 responses before launching new requests.
Run the following command inside the client container.
# Install matplotlib to generate the benchmark plot
pip install matplotlib
# Run Perf Analyzer
# trtllm: -m ensemble -b trtllm
# vllm: -m vllm_model -b vllm
python profile.py -m ensemble -b trtllm --prompt-size-range 10 10 1 --periodic-concurrency-range 1 100 1 --request-period 32 --max-tokens 1024 --ignore-eos
# [ BENCHMARK SUMMARY ]
# Prompt size: 10
# * Max first token latency: 125.7212 ms
# * Min first token latency: 18.4281 ms
# * Avg first token latency: 61.8372 ms
# ...
# Saved in-flight batching benchmark plots @ 'inflight_batching_benchmark-*.png'.The resulting plot will look like
The plot demonstrates how the average T2T latency changes across the entire benchmark process as we increase the number of requests. To observe the change, we first align the responses of every requests and then split them into multiple segments of responses. For instance, assume we ran the following benchmark command:
# trtllm: -m ensemble -b trtllm
# vllm: -m vllm_model -b vllm
python profile.py -m ensemble -b trtllm --periodic-concurrency-range 1 4 1 --request-period 32 --max-tokens 1024 --ignore-eosWe start from a single request and increment up to 4 requests one by one for
every 32 responses (defined by --request-period).
For each request, there are total 1024 generated responses (defined by --max-tokens).
We align these total 1024 generated responses and split them by request period,
giving us 1024/32 = 32 total segments per request as shown below:
32 responses (=request period)
┌────┐
request 1 ──────┊──────┊──────┊──────┊─ ··· ─┊──────┊
request 2 ┊──────┊──────┊──────┊─ ··· ─┊──────┊──────┊
request 3 ┊ ┊──────┊──────┊─ ··· ─┊──────┊──────┊──────┊
request 4 ┊ ┊ ┊──────┊─ ··· ─┊──────┊──────┊──────┊──────
segment # 1 2 3 4 ··· 32 33 34 35
Then for each segment, we compute the mean of T2T latencies of the responses. This will allow us to visualize the change in T2T latency as the number of requests increase, filling up the inflight batch slots, and as they terminate. See profile.py for more details.
