Merge pull request #391 from puneetmatharu/remove-tcmalloc-references

nSircombe · web-flow · commit 4e5d2ace7440 · 2025-10-12T19:02:05.000+01:00
Tidy up `README` and remove `tcmalloc` references.
diff --git a/ML-Frameworks/pytorch-aarch64/CHANGELOG.md b/ML-Frameworks/pytorch-aarch64/CHANGELOG.md
@@ -13,6 +13,7 @@ where `YY` is the year, and `MM` the month of the increment.
 
 ### Removed
 - Delete unused submodules of PyTorch's third-party modules.
+- Removed all references to tcmalloc as the default build now uses mimalloc.
 
 ### Fixed
 
diff --git a/ML-Frameworks/pytorch-aarch64/Dockerfile b/ML-Frameworks/pytorch-aarch64/Dockerfile
@@ -37,12 +37,7 @@ RUN apt-get update && apt-get install -y \
     python-is-python3 \
     # To allow users to install new things if they want
     sudo \
-    # tcmalloc can speed up some models, see README.md for more details
-    # we use minimal package instead of gperftools to reduce dependencies
-    libtcmalloc-minimal4 \
-    && rm -rf /var/lib/apt/lists/* \
-    # Make libtcmalloc_minimal accessible from the usual libtcmalloc location
-    && sudo ln -s /usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4 /usr/lib/aarch64-linux-gnu/libtcmalloc.so.4
+    && rm -rf /var/lib/apt/lists/*
 
 # DOCKER_USER for the Docker user
 ENV DOCKER_USER=${USERNAME}
diff --git a/ML-Frameworks/pytorch-aarch64/examples/README.md b/ML-Frameworks/pytorch-aarch64/examples/README.md
@@ -1,22 +1,47 @@
 # Examples
 
+<!-- Generated with VS Code's 'Markdown All in One' extension. -->
+<!-- Regenerate with: 'Markdown All in One: Update Table of Contents'. -->
+
+- [Examples](#examples)
+  - [Description](#description)
+  - [Vision](#vision)
+    - [Image classification](#image-classification)
+    - [Object detection](#object-detection)
+  - [Natural Language Processing (NLP)](#natural-language-processing-nlp)
+    - [Question answering](#question-answering)
+  - [Dynamic quantization](#dynamic-quantization)
+  - [General optimization guidelines](#general-optimization-guidelines)
+    - [Weight prepacking](#weight-prepacking)
+    - [General flags](#general-flags)
+    - [Compiled mode flags](#compiled-mode-flags)
+    - [Eager mode flags](#eager-mode-flags)
+  - [Generative AI](#generative-ai)
+    - [4 bit Dynamic Quantization](#4-bit-dynamic-quantization)
+    - [Vision](#vision-1)
+      - [Command-Line Options](#command-line-options)
+    - [Text Generation](#text-generation)
+      - [Command-Line Options](#command-line-options-1)
+
+## Description
+
 This folder contains number of scripts that demonstrate how to run inference with various machine learning models.
 
 ## Vision
 
 ### Image classification
 
-The script [classify_image.py](classify_image.py) demonstrates how to run inference using the ResNet-50 model trained on the ImageNet data set.
+The script [`classify_image.py`](classify_image.py) demonstrates how to run inference using the ResNet-50 model trained on the ImageNet data set.
 
 To run inference on an image call:
 
-```
+```bash
 python classify_image.py -m ./resnet_v1-50.yml -i https://upload.wikimedia.org/wikipedia/commons/3/32/Weimaraner_wb.jpg
 ```
 
 Where the `-m` flag sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image to classify.
 
-The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
+The file [`resnet_v1-50.yml`](resnet_v1-50.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html), information about the model:
 
 - `name`: Name to use to save the model after downloading it
 - `class`: The name of class that implements model architecture in torchvision.models
@@ -25,25 +50,31 @@ The file [resnet_v1-50.yml](resnet_v1-50.yml) provides, in [YAML format](https:/
 
 ### Object detection
 
-The script [detect_objects.py](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
+The script [`detect_objects.py`](detect_object.py) demonstrates how to run object detection using SSD-ResNet-34.
 
-The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset. This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
+The SSD-ResNet-34 model is trained from the Common Object in Context (COCO) image dataset.
+This is a multiscale SSD (Single Shot Detection) model based on the ResNet-34 backbone network that performs object detection.
 
 To run inference with SSD-ResNet-34 on example image call:
-```
+
+```bash
 python detect_objects.py -m ./ssd_resnet34.yml -i https://raw.githubusercontent.com/zhreshold/mxnet-ssd/master/data/demo/street.jpg
 ```
 
-Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects. The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
+Where `-m` sets the configuration file (see below) that describes the model, and `-i` sets the URL, or filename, of the image in which you want to detect objects.
+The output of the script will list what object the model detected and with what confidence. It will also draw bounding boxes around those objects in a new image.
+
+[`ssd_resnet34.yml`](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
 
-[ssd_resnet34.yml](ssd_resnet34.yml) provides, in [YAML format](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html) information about the model:
 - `name`: Name of the model used for inference
 - `script`: Script to download the Python model class and put it in the `PYTHONPATH`
 - `class`: Name of the Python model class to import
 - `source`: URL from where to download the model
 - `labels`: URL from where to download labels for the model
 - `threshold`: If confidence is below this threshold, then object will not be reported as detected
+
 There is also additional information in `image_preprocess` that is used to preprocess image before doing inference:
+
 - `input_shape`: Input shape used for inference in format NCHW
 - `mean`: Mean values per channel for normalizing image
 - `std`: Standard deviation values per channel for normalizing image
@@ -54,34 +85,37 @@ _Note: in PyTorch, in order to load the model from saved checkpoint, it is also
 
 ### Question answering
 
-The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag). The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
+The script `answer_questions.py` demonstrates how to build a simple question answering system using the pre-trained DistilBERT model (default) or BERT LARGE model (using the `--bert-large` flag).
+The script can answer questions from the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), or can be provided with a user defined context and question.
 
 To run the script on a random entry from the SQuAD dev-v2.0 dataset call:
 
-```
+```bash
 python answer_questions.py
 ```
 
 To pick a random question on a specific topic in the SQuAD dataset, use `-s` flag:
 
-```
+```bash
 python answer_questions.py -s "Normans"
 ```
 
 The routine `print_squad_questions` can be used to give a list of available subjects, see below.
 
-
 To choose a specific entry from the SQuAD dataset use the `-id` flag to supply the ID of the question, for example:
 
-```
+```bash
 python answer_questions.py -id 56de16ca4396321400ee25c7
 ```
 
-will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia). The expected answer is "In 1066".
+will attempt to answer "When was the battle of Hastings?" based on one of the entries on the [Normans from the SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/Normans.html) (originally derived from Wikipedia).
+The expected answer is "In 1066".
 
-In the `utils` folder, [nlp.py](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset. This also displays the ID of each question which can be supplied to `answer_questions.py`. Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
+In the `utils` folder, [`nlp.py`](utils/nlp.py) provides a some simple tools for obtaining and browsing the dataset.
+This also displays the ID of each question which can be supplied to `answer_questions.py`.
+Calling `print_squad_questions` will display a list of the various subjects contained in the dataset and supplying a `subject` argument will print the details of all the questions on that subject, for example, from within your Python environment:
 
-```
+```bash
 from utils import nlp
 nlp.print_squad_questions(subject="Normans")
 ```
@@ -90,16 +124,18 @@ will print all the SQuAD entries on Normans - the context, questions, reference
 
 It is also possible to supply a context file and question directly to `answer_questions.py` using the flags `-t` and `-q`, for example:
 
-```
+```bash
 python answer_questions.py -t README.md -q "What does this folder contain?"
 ```
 
 If no context file is provided, `answer_questions.py` will search through the SQuAD dataset for the question and, if the question can be located, use the context associated with it.
 
 We can also quantize some of the layers of the models using the `--quantize` flag, which can make the model run several times faster
-```
+
+```bash
 python answer_questions.py --quantize
 ```
+
 See the section on [dynamic quantization](#dynamic-quantization) for more information.
 
 To remove the setup from the measured inference time, you can use the `--warmup` flag.
@@ -110,20 +146,25 @@ This will run the model twice, and report the time of the second run.
 Quantization reduces the precision of the inputs to your operators to speed up computation.
 Typically this takes us from the default float data type (32 bits) to an integer (8 bits).
 Currently dynamic quantization is supported (inputs are quantized at run time), and can be easily applied to a model using
+
 ```python
 model = torch.ao.quantization.quantize_dynamic(
     model,
     {torch.nn.Linear},
     dtype=torch.qint8)
 ```
-Note that currently only the linear layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
 
-`quantized_linear.py ` is a very simple example of the dynamic quantization of a single linear layer.
-It takes the 3 dimensions of the linear layer, and returns the runtime of the unquantized and quantized models along with the ratio.
-```
+Note that currently only the `Linear` layer can be quantized, although for many models this layer contributes the largest runtime, so the overall speedup can still be large.
+
+`quantized_linear.py` is a very simple example of the dynamic quantization of a single `Linear` layer.
+It takes the 3 dimensions of the `Linear` layer, and returns the runtime of the unquantized and quantized models along with the ratio.
+
+```bash
 python quantized_linear.py 384 1024 768
 ```
+
 By running this you can see how the ratio unquantized/quantized time varies with number of threads and the size of the layer.
+
 | Threads\M=K=N | 256 | 512 | 1024 |
 |---------------|-----|-----|------|
 | 4             | 1.4 | 3.9 | 5.5  |
@@ -133,51 +174,55 @@ By running this you can see how the ratio unquantized/quantized time varies with
 The speedup is more pronounced for large linear layers and lower numbers of threads.
 
 To see the effect on a full model, you can run the `answer_questions.py` script from the earlier NLP example with the `--quantize` flag.
-```
+
+```bash
 python answer_questions.py -id 56de16ca4396321400ee25c7 --quantize --bert-large
 ```
+
 Again, the effect is most pronounced for fewer threads and larger layers/models (hence `--bert-large`), in such cases you can see up to ~3x speedup.
 
 | Threads | BERT Large speedup |
 |---------|--------------------|
 | 4       | 2.9                |
 | 8       | 2.4                |
 | 16      | 1.7                |
+
 Note that in the above data we used the `--warmup` flag to run the model once before timing.
 
 ## General optimization guidelines
 
 ### Weight prepacking
-Linear layers calling ACL matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a linear layer multiple times.
+
+`Linear` layers calling [Arm ComputeLibrary](https://github.com/ARM-software/ComputeLibrary) (ACL) matmuls reorder weights during runtime by default. These reorders can be eliminated by calling `pack_linear_weights` as shown in `pack_linear_weights.py`. This improves the performance of any models calling a `Linear` layer multiple times.
 
 ### General flags
+
 There are several flags which typically improve the performance of PyTorch.
 
 `DNNL_DEFAULT_FPMATH_MODE`: setting the environment variable `DNNL_DEFAULT_FPMATH_MODE` to `BF16` or `ANY` will instruct ACL to dispatch fp32 workloads to bfloat16 kernels where hardware support permits. _Note: this may introduce a drop in accuracy._
 
-You can use `tcmalloc` to handle memory allocation in PyTorch, which often leads to better performance
-`LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4`
-
 You can control the number of threads with `OMP_NUM_THREADS`, smaller models may perform better with fewer threads.
 
 ### Compiled mode flags
 
-* `TORCHINDUCTOR_CPP_WRAPPER=1`  - Reduces Python overhead within the graph for torch.compile
-* `TORCHINDUCTOR_FREEZING=1`     - Freezing will attempt to inline weights as constants in optimization
+- `TORCHINDUCTOR_CPP_WRAPPER=1` - Reduces Python overhead within the graph for torch.compile
+- `TORCHINDUCTOR_FREEZING=1` - Freezing will attempt to inline weights as constants in optimization
 
 e.g.
-```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4  TORCHINDUCTOR_CPP_WRAPPER=1  TORCHINDUCTOR_FREEZING=1  OMP_NUM_THREADS=16  python <your_model_script>.py
+
+```bash
+TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python <your_model_script>.py
 ```
 
 ### Eager mode flags
 
-* `IDEEP_CACHE_MATMUL_REORDERS=1`   - Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
-* `LRU_CACHE_CAPACITY=<cache size>` - Number of objects to cache in the LRU cache
+- `IDEEP_CACHE_MATMUL_REORDERS=1`: Caches reordered weight tensors. This increases performance but also increases memory usage. `LRU_CACHE_CAPACITY` should be set to a meaningful amount for this cache to be effective.
+- `LRU_CACHE_CAPACITY=<cache size>`: Number of objects to cache in the LRU cache
 
 e.g.
-```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4  IDEEP_CACHE_MATMUL_REORDERS=1 LRU_CACHE_CAPACITY=256 DNNL_DEFAULT_FPMATH_MODE=BF16 python <your_model_script>.py
+
+```bash
+IDEEP_CACHE_MATMUL_REORDERS=1 LRU_CACHE_CAPACITY=256 DNNL_DEFAULT_FPMATH_MODE=BF16 python <your_model_script>.py
 ```
 
 ## Generative AI
@@ -189,78 +234,65 @@ Tool-Solutions leverage 4-bit dynamic weight quantization to accelerate GenAI wo
 _Note: Model repo access might be required to run certain certain models correctly._
 
 To access the protected models run
-```
+
+```bash
 huggingface-cli login --token @hf_token
 ```
 
 ### Vision
 
-The script [llama_vision_instruct.py](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
+The script [`llama_vision_instruct.py`](llama_vision_instruct.py) runs and benchmarks Llama-3.2-11B-Vision-Instruct using text + image input and text output.
 
+```bash
+OMP_NUM_THREADS=16 python llama_vision_instruct.py --benchmark --dtype bfloat16 --quantize
 ```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4    OMP_NUM_THREADS=16 python llama_vision_instruct.py --benchmark --dtype bfloat16 --quantize
-```
-
-#### Command line options
 
-`--num-new-tokens`
-  The model will always generate this number of new tokens.
+#### Command-Line Options
 
-`--prompt`
-  Input prompt.
+`--num-new-tokens`: The model will always generate this number of new tokens.
 
-`--image-url`
-  URL to image.
+`--prompt`: Input prompt.
 
-`--benchmark`
-  Run a benchmark, with warmup and multiple iterations.
+`--image-url`: URL to image.
 
-`--dtype {bfloat16,float32}`
-  Precision to run the model in (or the non-linear layers for quantized model).
+`--benchmark`: Run a benchmark, with warmup and multiple iterations.
 
-`--quantize`
-  Quantize weights to int4 symmetric channelwise.
+`--dtype {bfloat16,float32}`: Precision to run the model in (or the non-`Linear` layers for quantized model).
 
+`--quantize`: Quantize weights to `torch.int4` symmetric channelwise.
 
 ### Text Generation
 
-The script [transformers_llm_text_gen.py](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
+The script [`transformers_llm_text_gen.py`](transformers_llm_text_gen.py) demonstrates how to generate text using TinyLlama-1.1B-Chat-v1.0 model via Transformers. It leverages the 4 bit dynamic quantization and can support a wide range of text models.
 
 Run inference using default (groupwise, layout-aware INT4) using tranformer call:
 
-```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4  TORCHINDUCTOR_CPP_WRAPPER=1  TORCHINDUCTOR_FREEZING=1  OMP_NUM_THREADS=16 python transformers_llm_text_gen.py
+```bash
+TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py
 ```
 
 Run with symmetric_channelwise quantization:
 
-```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4  TORCHINDUCTOR_CPP_WRAPPER=1  TORCHINDUCTOR_FREEZING=1  OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_channelwise
+```bash
+TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_channelwise
 ```
 
 Run with custom group size (e.g. 64):
 
+```bash
+TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_groupwise --groupsize 64
 ```
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4  TORCHINDUCTOR_CPP_WRAPPER=1  TORCHINDUCTOR_FREEZING=1  OMP_NUM_THREADS=16 python transformers_llm_text_gen.py --quant-scheme symmetric_groupwise --groupsize 64
-```
-
 
 #### Command-Line Options
 
-`--quant-scheme`
-  Description: Quantization scheme to apply: symmetric_channelwise or symmetric_groupwise.
+`--quant-scheme`: Quantization scheme to apply (`symmetric_channelwise` or `symmetric_groupwise`).
 
-`--groupsize`
-  Description: groupsize (used only with symmetric_groupwise).
+`--groupsize`: groupsize (used only with `symmetric_groupwise`).
 
-`--max-new-tokens`
-  Description: Max new tokens to generate.
+`--max-new-tokens`: Max new tokens to generate.
 
-`--compile`
-  Description: Whether to compile the model (default: `False`).
+`--compile`: Whether to compile the model (default: `False`).
 
-`--model`
-  Description: Local Path to model repo or huggingface model id. (Default: `"meta-llama/Llama-2-7b-hf"`  )
+`--model`: Local path to model repo or huggingface model id. (Default: `"meta-llama/Llama-2-7b-hf"`)
 
-`--prompt`
-  Description: Input prompt for model generation.
+`--prompt`: Input prompt for model generation.
diff --git a/ML-Frameworks/pytorch-aarch64/examples/test-examples.sh b/ML-Frameworks/pytorch-aarch64/examples/test-examples.sh
@@ -28,7 +28,3 @@ for example in "${examples[@]}"; do
     bash -c "$example"
     echo ""
 done
-
-# Check an example with some of the flags from REAMDE.md > "General optimization
-# guidelines" There is no verbatim example for this
-LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 DNNL_DEFAULT_FPMATH_MODE=BF16 ${examples[0]}