You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where the `-m` flag sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image to classify.
18
43
19
-
The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
44
+
The file [`resnet_v1-50.yml`](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
20
45
21
46
-`name`: Name to use to save the model after downloading it
22
47
-`class`: The name of class that implements model architecture in torchvision.models
@@ -25,25 +50,31 @@ The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https:/
25
50
26
51
### Object detection
27
52
28
-
The script [detect_objects.py](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
53
+
The script [`detect_objects.py`](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
29
54
30
-
The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset. This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
55
+
The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset.
56
+
This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
31
57
32
58
To run inference with SSD-ResNet-34 on example image call:
Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects. The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
64
+
Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects.
65
+
The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
66
+
67
+
[`ssd_resnet34.yml`](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
38
68
39
-
[ssd_resnet34.yml](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
40
69
-`name`: Name of the model used for inference
41
70
-`script`: Script to download the Python model class and put it in the `PYTHONPATH`
42
71
-`class`: Name of the Python model class to import
43
72
-`source`: URL from where to download the model
44
73
-`labels`: URL from where to download labels for the model
45
74
-`threshold`: If confidence is below this threshold, then object will not be reported as detected
75
+
46
76
There is also additional information in `image_preprocess` that is used to preprocess image before doing inference:
77
+
47
78
-`input_shape`: Input shape used for inference in format NCHW
48
79
-`mean`: Mean values per channel for normalizing image
49
80
-`std`: Standard deviation values per channel for normalizing image
@@ -54,34 +85,37 @@ _Note: in PyTorch, in order to load the model from saved checkpoint, it is also
54
85
55
86
### Question answering
56
87
57
-
The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag). The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
88
+
The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag).
89
+
The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
58
90
59
91
To run the script on a random entry from the SQuAD dev-v2.0 dataset call:
60
92
61
-
```
93
+
```bash
62
94
python answer_questions.py
63
95
```
64
96
65
97
To pick a random question on a specific topic in the SQuAD dataset, use `-s` flag:
66
98
67
-
```
99
+
```bash
68
100
python answer_questions.py -s "Normans"
69
101
```
70
102
71
103
The routine `print_squad_questions` can be used to give a list of available subjects, see below.
72
104
73
-
74
105
To choose a specific entry from the SQuAD dataset use the `-id` flag to supply the ID of the question, for example:
will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia). The expected answer is "In 1066".
111
+
will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia).
112
+
The expected answer is "In 1066".
81
113
82
-
In the `utils` folder, [nlp.py](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset. This also displays the ID of each question which can be supplied to `answer_questions.py`. Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
114
+
In the `utils` folder, [`nlp.py`](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset.
115
+
This also displays the ID of each question which can be supplied to `answer_questions.py`.
116
+
Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
83
117
84
-
```
118
+
```bash
85
119
from utils import nlp
86
120
nlp.print_squad_questions(subject="Normans")
87
121
```
@@ -90,16 +124,18 @@ will print all the SQuAD entries on Normans - the context, questions, reference
90
124
91
125
It is also possible to supply a context file and question directly to `answer_questions.py` using the flags `-t` and `-q`, for example:
92
126
93
-
```
127
+
```bash
94
128
python answer_questions.py -t README.md -q "What does this folder contain?"
95
129
```
96
130
97
131
If no context file is provided, `answer_questions.py` will search through the SQuAD dataset for the question and, if the question can be located, use the context associated with it.
98
132
99
133
We can also quantize some of the layers of the models using the `--quantize` flag, which can make the model run several times faster
100
-
```
134
+
135
+
```bash
101
136
python answer_questions.py --quantize
102
137
```
138
+
103
139
See the section on [dynamic quantization](#dynamic-quantization) for more information.
104
140
105
141
To remove the setup from the measured inference time, you can use the `--warmup` flag.
@@ -110,20 +146,25 @@ This will run the model twice, and report the time of the second run.
110
146
Quantization reduces the precision of the inputs to your operators to speed up computation.
111
147
Typically this takes us from the default float data type (32 bits) to an integer (8 bits).
112
148
Currently dynamic quantization is supported (inputs are quantized at run time), and can be easily applied to a model using
149
+
113
150
```python
114
151
model = torch.ao.quantization.quantize_dynamic(
115
152
model,
116
153
{torch.nn.Linear},
117
154
dtype=torch.qint8)
118
155
```
119
-
Note that currently only the linear layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
120
156
121
-
`quantized_linear.py ` is a very simple example of the dynamic quantization of a single linear layer.
122
-
It takes the 3 dimensions of the linear layer, and returns the runtime of the unquantized and quantized models along with the ratio.
123
-
```
157
+
Note that currently only the `Linear` layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
158
+
159
+
`quantized_linear.py` is a very simple example of the dynamic quantization of a single `Linear` layer.
160
+
It takes the 3 dimensions of the `Linear` layer, and returns the runtime of the unquantized and quantized models along with the ratio.
161
+
162
+
```bash
124
163
python quantized_linear.py 384 1024 768
125
164
```
165
+
126
166
By running this you can see how the ratio unquantized/quantized time varies with number of threads and the size of the layer.
167
+
127
168
| Threads\M=K=N | 256 | 512 | 1024 |
128
169
|---------------|-----|-----|------|
129
170
| 4 | 1.4 | 3.9 | 5.5 |
@@ -133,51 +174,55 @@ By running this you can see how the ratio unquantized/quantized time varies with
133
174
The speedup is more pronounced for large linear layers and lower numbers of threads.
134
175
135
176
To see the effect on a full model, you can run the `answer_questions.py` script from the earlier NLP example with the `--quantize` flag.
Again, the effect is most pronounced for fewer threads and larger layers/models (hence `--bert-large`), in such cases you can see up to ~3x speedup.
140
183
141
184
| Threads | BERT Large speedup |
142
185
|---------|--------------------|
143
186
| 4 | 2.9 |
144
187
| 8 | 2.4 |
145
188
| 16 | 1.7 |
189
+
146
190
Note that in the above data we used the `--warmup` flag to run the model once before timing.
147
191
148
192
## General optimization guidelines
149
193
150
194
### Weight prepacking
151
-
Linear layers calling ACL matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a linear layer multiple times.
195
+
196
+
`Linear` layers calling [Arm ComputeLibrary](https://github.com/ARM-software/ComputeLibrary) (ACL) matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a `Linear` layer multiple times.
152
197
153
198
### General flags
199
+
154
200
There are several flags which typically improve the performance of PyTorch.
155
201
156
202
`DNNL_DEFAULT_FPMATH_MODE`: setting the environment variable `DNNL_DEFAULT_FPMATH_MODE` to `BF16` or `ANY` will instruct ACL to dispatch fp32 workloads to bfloat16 kernels where hardware support permits. _Note: this may introduce a drop in accuracy._
157
203
158
-
You can use `tcmalloc` to handle memory allocation in PyTorch, which often leads to better performance
*`IDEEP_CACHE_MATMUL_REORDERS=1` - Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
176
-
*`LRU_CACHE_CAPACITY=<cache size>` - Number of objects to cache in the LRU cache
219
+
-`IDEEP_CACHE_MATMUL_REORDERS=1`: Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
220
+
-`LRU_CACHE_CAPACITY=<cache size>`: Number of objects to cache in the LRU cache
@@ -189,78 +234,65 @@ Tool-Solutions leverage 4-bit dynamic weight quantization to accelerate GenAI wo
189
234
_Note: Model repo access might be required to run certain certain models correctly._
190
235
191
236
To access the protected models run
192
-
```
237
+
238
+
```bash
193
239
huggingface-cli login --token @hf_token
194
240
```
195
241
196
242
### Vision
197
243
198
-
The script [llama_vision_instruct.py](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
244
+
The script [`llama_vision_instruct.py`](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
The model will always generate this number of new tokens.
250
+
#### Command-Line Options
208
251
209
-
`--prompt`
210
-
Input prompt.
252
+
`--num-new-tokens`: The model will always generate this number of new tokens.
211
253
212
-
`--image-url`
213
-
URL to image.
254
+
`--prompt`: Input prompt.
214
255
215
-
`--benchmark`
216
-
Run a benchmark, with warmup and multiple iterations.
256
+
`--image-url`: URL to image.
217
257
218
-
`--dtype {bfloat16,float32}`
219
-
Precision to run the model in (or the non-linear layers for quantized model).
258
+
`--benchmark`: Run a benchmark, with warmup and multiple iterations.
220
259
221
-
`--quantize`
222
-
Quantize weights to int4 symmetric channelwise.
260
+
`--dtype {bfloat16,float32}`: Precision to run the model in (or the non-`Linear` layers for quantized model).
223
261
262
+
`--quantize`: Quantize weights to `torch.int4` symmetric channelwise.
224
263
225
264
### Text Generation
226
265
227
-
The script [transformers_llm_text_gen.py](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
266
+
The script [`transformers_llm_text_gen.py`](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
228
267
229
268
Run inference using default (groupwise, layout-aware INT4) using tranformer call:
0 commit comments