Skip to content

Commit 4e5d2ac

Browse files
authored
Merge pull request #391 from puneetmatharu/remove-tcmalloc-references
Tidy up `README` and remove `tcmalloc` references.
2 parents 108ea47 + fbb068f commit 4e5d2ac

4 files changed

Lines changed: 107 additions & 83 deletions

File tree

ML-Frameworks/pytorch-aarch64/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ where `YY` is the year, and `MM` the month of the increment.
1313

1414
### Removed
1515
- Delete unused submodules of PyTorch's third-party modules.
16+
- Removed all references to tcmalloc as the default build now uses mimalloc.
1617

1718
### Fixed
1819

ML-Frameworks/pytorch-aarch64/Dockerfile

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,7 @@ RUN apt-get update && apt-get install -y \
3737
python-is-python3 \
3838
# To allow users to install new things if they want
3939
sudo \
40-
# tcmalloc can speed up some models, see README.md for more details
41-
# we use minimal package instead of gperftools to reduce dependencies
42-
libtcmalloc-minimal4 \
43-
&& rm -rf /var/lib/apt/lists/* \
44-
# Make libtcmalloc_minimal accessible from the usual libtcmalloc location
45-
&& sudo ln -s /usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4 /usr/lib/aarch64-linux-gnu/libtcmalloc.so.4
40+
&& rm -rf /var/lib/apt/lists/*
4641

4742
# DOCKER_USER for the Docker user
4843
ENV DOCKER_USER=${USERNAME}
Lines changed: 105 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,47 @@
11
# Examples
22

3+
<!-- Generated with VS Code's 'Markdown All in One' extension. -->
4+
<!-- Regenerate with: 'Markdown All in One: Update Table of Contents'. -->
5+
6+
- [Examples](#examples)
7+
- [Description](#description)
8+
- [Vision](#vision)
9+
- [Image classification](#image-classification)
10+
- [Object detection](#object-detection)
11+
- [Natural Language Processing (NLP)](#natural-language-processing-nlp)
12+
- [Question answering](#question-answering)
13+
- [Dynamic quantization](#dynamic-quantization)
14+
- [General optimization guidelines](#general-optimization-guidelines)
15+
- [Weight prepacking](#weight-prepacking)
16+
- [General flags](#general-flags)
17+
- [Compiled mode flags](#compiled-mode-flags)
18+
- [Eager mode flags](#eager-mode-flags)
19+
- [Generative AI](#generative-ai)
20+
- [4 bit Dynamic Quantization](#4-bit-dynamic-quantization)
21+
- [Vision](#vision-1)
22+
- [Command-Line Options](#command-line-options)
23+
- [Text Generation](#text-generation)
24+
- [Command-Line Options](#command-line-options-1)
25+
26+
## Description
27+
328
This folder contains number of scripts that demonstrate how to run inference with various machine learning models.
429

530
## Vision
631

732
### Image classification
833

9-
The script [classify_image.py](classify_image.py) demonstrates how to run inference using the ResNet-50 model trained on the ImageNet data set.
34+
The script [`classify_image.py`](classify_image.py) demonstrates how to run inference using the ResNet-50 model trained on the ImageNet data set.
1035

1136
To run inference on an image call:
1237

13-
```
38+
```bash
1439
python classify_image.py -m ./resnet_v1-50.yml -i https://upload.wikimedia.org/wikipedia/commons/3/32/Weimaraner_wb.jpg
1540
```
1641

1742
Where the `-m` flag sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image to classify.
1843

19-
The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
44+
The file [`resnet_v1-50.yml`](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
2045

2146
- `name`: Name to use to save the model after downloading it
2247
- `class`: The name of class that implements model architecture in torchvision.models
@@ -25,25 +50,31 @@ The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https:/
2550

2651
### Object detection
2752

28-
The script [detect_objects.py](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
53+
The script [`detect_objects.py`](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
2954

30-
The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset. This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
55+
The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset.
56+
This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
3157

3258
To run inference with SSD-ResNet-34 on example image call:
33-
```
59+
60+
```bash
3461
python detect_objects.py -m ./ssd_resnet34.yml -i https://raw.githubusercontent.com/zhreshold/mxnet-ssd/master/data/demo/street.jpg
3562
```
3663

37-
Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects. The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
64+
Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects.
65+
The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
66+
67+
[`ssd_resnet34.yml`](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
3868

39-
[ssd_resnet34.yml](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
4069
- `name`: Name of the model used for inference
4170
- `script`: Script to download the Python model class and put it in the `PYTHONPATH`
4271
- `class`: Name of the Python model class to import
4372
- `source`: URL from where to download the model
4473
- `labels`: URL from where to download labels for the model
4574
- `threshold`: If confidence is below this threshold, then object will not be reported as detected
75+
4676
There is also additional information in `image_preprocess` that is used to preprocess image before doing inference:
77+
4778
- `input_shape`: Input shape used for inference in format NCHW
4879
- `mean`: Mean values per channel for normalizing image
4980
- `std`: Standard deviation values per channel for normalizing image
@@ -54,34 +85,37 @@ _Note: in PyTorch, in order to load the model from saved checkpoint, it is also
5485

5586
### Question answering
5687

57-
The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag). The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
88+
The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag).
89+
The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
5890

5991
To run the script on a random entry from the SQuAD dev-v2.0 dataset call:
6092

61-
```
93+
```bash
6294
python answer_questions.py
6395
```
6496

6597
To pick a random question on a specific topic in the SQuAD dataset, use `-s` flag:
6698

67-
```
99+
```bash
68100
python answer_questions.py -s "Normans"
69101
```
70102

71103
The routine `print_squad_questions` can be used to give a list of available subjects, see below.
72104

73-
74105
To choose a specific entry from the SQuAD dataset use the `-id` flag to supply the ID of the question, for example:
75106

76-
```
107+
```bash
77108
python answer_questions.py -id 56de16ca4396321400ee25c7
78109
```
79110

80-
will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia). The expected answer is "In 1066".
111+
will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia).
112+
The expected answer is "In 1066".
81113

82-
In the `utils` folder, [nlp.py](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset. This also displays the ID of each question which can be supplied to `answer_questions.py`. Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
114+
In the `utils` folder, [`nlp.py`](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset.
115+
This also displays the ID of each question which can be supplied to `answer_questions.py`.
116+
Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
83117

84-
```
118+
```bash
85119
from utils import nlp
86120
nlp.print_squad_questions(subject="Normans")
87121
```
@@ -90,16 +124,18 @@ will print all the SQuAD entries on Normans - the context, questions, reference
90124

91125
It is also possible to supply a context file and question directly to `answer_questions.py` using the flags `-t` and `-q`, for example:
92126

93-
```
127+
```bash
94128
python answer_questions.py -t README.md -q "What does this folder contain?"
95129
```
96130

97131
If no context file is provided, `answer_questions.py` will search through the SQuAD dataset for the question and, if the question can be located, use the context associated with it.
98132

99133
We can also quantize some of the layers of the models using the `--quantize` flag, which can make the model run several times faster
100-
```
134+
135+
```bash
101136
python answer_questions.py --quantize
102137
```
138+
103139
See the section on [dynamic quantization](#dynamic-quantization) for more information.
104140

105141
To remove the setup from the measured inference time, you can use the `--warmup` flag.
@@ -110,20 +146,25 @@ This will run the model twice, and report the time of the second run.
110146
Quantization reduces the precision of the inputs to your operators to speed up computation.
111147
Typically this takes us from the default float data type (32 bits) to an integer (8 bits).
112148
Currently dynamic quantization is supported (inputs are quantized at run time), and can be easily applied to a model using
149+
113150
```python
114151
model = torch.ao.quantization.quantize_dynamic(
115152
model,
116153
{torch.nn.Linear},
117154
dtype=torch.qint8)
118155
```
119-
Note that currently only the linear layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
120156

121-
`quantized_linear.py ` is a very simple example of the dynamic quantization of a single linear layer.
122-
It takes the 3 dimensions of the linear layer, and returns the runtime of the unquantized and quantized models along with the ratio.
123-
```
157+
Note that currently only the `Linear` layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
158+
159+
`quantized_linear.py` is a very simple example of the dynamic quantization of a single `Linear` layer.
160+
It takes the 3 dimensions of the `Linear` layer, and returns the runtime of the unquantized and quantized models along with the ratio.
161+
162+
```bash
124163
python quantized_linear.py 384 1024 768
125164
```
165+
126166
By running this you can see how the ratio unquantized/quantized time varies with number of threads and the size of the layer.
167+
127168
| Threads\M=K=N | 256 | 512 | 1024 |
128169
|---------------|-----|-----|------|
129170
| 4 | 1.4 | 3.9 | 5.5 |
@@ -133,51 +174,55 @@ By running this you can see how the ratio unquantized/quantized time varies with
133174
The speedup is more pronounced for large linear layers and lower numbers of threads.
134175

135176
To see the effect on a full model, you can run the `answer_questions.py` script from the earlier NLP example with the `--quantize` flag.
136-
```
177+
178+
```bash
137179
python answer_questions.py -id 56de16ca4396321400ee25c7 --quantize --bert-large
138180
```
181+
139182
Again, the effect is most pronounced for fewer threads and larger layers/models (hence `--bert-large`), in such cases you can see up to ~3x speedup.
140183

141184
| Threads | BERT Large speedup |
142185
|---------|--------------------|
143186
| 4 | 2.9 |
144187
| 8 | 2.4 |
145188
| 16 | 1.7 |
189+
146190
Note that in the above data we used the `--warmup` flag to run the model once before timing.
147191

148192
## General optimization guidelines
149193

150194
### Weight prepacking
151-
Linear layers calling ACL matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a linear layer multiple times.
195+
196+
`Linear` layers calling [Arm ComputeLibrary](https://github.com/ARM-software/ComputeLibrary) (ACL) matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a `Linear` layer multiple times.
152197

153198
### General flags
199+
154200
There are several flags which typically improve the performance of PyTorch.
155201

156202
`DNNL_DEFAULT_FPMATH_MODE`: setting the environment variable `DNNL_DEFAULT_FPMATH_MODE` to `BF16` or `ANY` will instruct ACL to dispatch fp32 workloads to bfloat16 kernels where hardware support permits. _Note: this may introduce a drop in accuracy._
157203

158-
You can use `tcmalloc` to handle memory allocation in PyTorch, which often leads to better performance
159-
`LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4`
160-
161204
You can control the number of threads with `OMP_NUM_THREADS`, smaller models may perform better with fewer threads.
162205

163206
### Compiled mode flags
164207

165-
* `TORCHINDUCTOR_CPP_WRAPPER=1` - Reduces Python overhead within the graph for torch.compile
166-
* `TORCHINDUCTOR_FREEZING=1` - Freezing will attempt to inline weights as constants in optimization
208+
- `TORCHINDUCTOR_CPP_WRAPPER=1` - Reduces Python overhead within the graph for torch.compile
209+
- `TORCHINDUCTOR_FREEZING=1` - Freezing will attempt to inline weights as constants in optimization
167210

168211
e.g.
169-
```
170-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python <your_model_script>.py
212+
213+
```bash
214+
TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python <your_model_script>.py
171215
```
172216

173217
### Eager mode flags
174218

175-
* `IDEEP_CACHE_MATMUL_REORDERS=1` - Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
176-
* `LRU_CACHE_CAPACITY=<cache size>` - Number of objects to cache in the LRU cache
219+
- `IDEEP_CACHE_MATMUL_REORDERS=1`: Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
220+
- `LRU_CACHE_CAPACITY=<cache size>`: Number of objects to cache in the LRU cache
177221

178222
e.g.
179-
```
180-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 IDEEP_CACHE_MATMUL_REORDERS=1 LRU_CACHE_CAPACITY=256 DNNL_DEFAULT_FPMATH_MODE=BF16 python <your_model_script>.py
223+
224+
```bash
225+
IDEEP_CACHE_MATMUL_REORDERS=1 LRU_CACHE_CAPACITY=256 DNNL_DEFAULT_FPMATH_MODE=BF16 python <your_model_script>.py
181226
```
182227

183228
## Generative AI
@@ -189,78 +234,65 @@ Tool-Solutions leverage 4-bit dynamic weight quantization to accelerate GenAI wo
189234
_Note: Model repo access might be required to run certain certain models correctly._
190235

191236
To access the protected models run
192-
```
237+
238+
```bash
193239
huggingface-cli login --token @hf_token
194240
```
195241

196242
### Vision
197243

198-
The script [llama_vision_instruct.py](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
244+
The script [`llama_vision_instruct.py`](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
199245

246+
```bash
247+
OMP_NUM_THREADS=16 python llama_vision_instruct.py --benchmark --dtype bfloat16 --quantize
200248
```
201-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 OMP_NUM_THREADS=16 python llama_vision_instruct.py --benchmark --dtype bfloat16 --quantize
202-
```
203-
204-
#### Command line options
205249

206-
`--num-new-tokens`
207-
The model will always generate this number of new tokens.
250+
#### Command-Line Options
208251

209-
`--prompt`
210-
Input prompt.
252+
`--num-new-tokens`: The model will always generate this number of new tokens.
211253

212-
`--image-url`
213-
URL to image.
254+
`--prompt`: Input prompt.
214255

215-
`--benchmark`
216-
Run a benchmark, with warmup and multiple iterations.
256+
`--image-url`: URL to image.
217257

218-
`--dtype {bfloat16,float32}`
219-
Precision to run the model in (or the non-linear layers for quantized model).
258+
`--benchmark`: Run a benchmark, with warmup and multiple iterations.
220259

221-
`--quantize`
222-
Quantize weights to int4 symmetric channelwise.
260+
`--dtype {bfloat16,float32}`: Precision to run the model in (or the non-`Linear` layers for quantized model).
223261

262+
`--quantize`: Quantize weights to `torch.int4` symmetric channelwise.
224263

225264
### Text Generation
226265

227-
The script [transformers_llm_text_gen.py](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
266+
The script [`transformers_llm_text_gen.py`](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
228267

229268
Run inference using default (groupwise, layout-aware INT4) using tranformer call:
230269

231-
```
232-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py
270+
```bash
271+
TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py
233272
```
234273

235274
Run with symmetric_channelwise quantization:
236275

237-
```
238-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_channelwise
276+
```bash
277+
TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_channelwise
239278
```
240279

241280
Run with custom group size (e.g. 64):
242281

282+
```bash
283+
TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_groupwise --groupsize 64
243284
```
244-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_groupwise --groupsize 64
245-
```
246-
247285

248286
#### Command-Line Options
249287

250-
`--quant-scheme`
251-
Description: Quantization scheme to apply: symmetric_channelwise or symmetric_groupwise.
288+
`--quant-scheme`: Quantization scheme to apply (`symmetric_channelwise` or `symmetric_groupwise`).
252289

253-
`--groupsize`
254-
Description: groupsize (used only with symmetric_groupwise).
290+
`--groupsize`: groupsize (used only with `symmetric_groupwise`).
255291

256-
`--max-new-tokens`
257-
Description: Max new tokens to generate.
292+
`--max-new-tokens`: Max new tokens to generate.
258293

259-
`--compile`
260-
Description: Whether to compile the model (default: `False`).
294+
`--compile`: Whether to compile the model (default: `False`).
261295

262-
`--model`
263-
Description: Local Path to model repo or huggingface model id. (Default: `"meta-llama/Llama-2-7b-hf"` )
296+
`--model`: Local path to model repo or huggingface model id. (Default: `"meta-llama/Llama-2-7b-hf"`)
264297

265-
`--prompt`
266-
Description: Input prompt for model generation.
298+
`--prompt`: Input prompt for model generation.

ML-Frameworks/pytorch-aarch64/examples/test-examples.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,3 @@ for example in "${examples[@]}"; do
2828
bash -c "$example"
2929
echo ""
3030
done
31-
32-
# Check an example with some of the flags from REAMDE.md > "General optimization
33-
# guidelines" There is no verbatim example for this
34-
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 DNNL_DEFAULT_FPMATH_MODE=BF16 ${examples[0]}

0 commit comments

Comments
 (0)