G2TM can now be run with PyTorch 2.4.1 and PyTorch Geometric!
The code can be found in the `torch2` branch
For optimal results obtained in the paper, we still recommand using the main branch with the NetworkX library.
Authors: Victor BERCY, Martyna POREBA, Michal SZCZEPANSKI, Samia BOUCHAFA
Graph-Guided Token Merging (G2TM) is a lightweight one-shot module designed to eliminate redundant tokens early in the ViT architecture. It performs a single merging step after a shallow attention block, enabling all subsequent layers to operate on a compact token set. It leverages graph theory to identify groups of semantically redundant patches.
In this repository, Graph-Guided Token Merging (G2TM) is applied to Segmenter: Transformer for Semantic Segmentation by Robin Strudel*, Ricardo Garcia*, Ivan Laptev and Cordelia Schmid, ICCV 2021, by extending its code with token merging modules.
In this section, we will explain how to set the environment up for this repository (Segmenter + G2TM) step by step.
1. Clone the repository:
git clone https://github.com/vbercy/g2tm-segmenter
cd g2tm-segmenter2. Setting up a conda environment:
G2TM targets PyTorch 1.13.1 and NetworkX 3.4.2.
# create environment
conda create -n G2TM python==3.10.*
conda activate G2TM
# install pytorch with cuda
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia3. Installing the Segmenter requirements:
# build tools to handle old packages
pip install "MKL==2024.0.0" "setuptools<80.9.0" wheel
# install mmcv separately
pip install --no-build-isolation mmcv==1.7.2
# set up the Segmenter package
pip install -v -e .4. Installing the G2TM requirements:
# set up the G2TM package
cd g2tm/ && pip install -v -e . && cd ../5. Prepare the datasets
If needed, to download and prepare ADE20K, Cityscapes and/or PascalContext dataset(s), use the following command(s):
python ./segm/scripts/prepare_ade20k.py <ade20k_dir>
python ./segm/scripts/prepare_cityscapes.py <cityscapes_dir>
python ./segm/scripts/prepare_pcontext.py <pcontext_dir>Then, define an OS environment variable pointing to the directory corresponding to the dataset you want to use:
export DATASET=/path/to/dataset/dirTo train a Segmenter model (size tiny, small, base or large) with G2TM on a specific dataset (whose path is provided by DATASET), use the command provided below. In the paper, for example, we chose to apply G2TM at the 2nd layer with a threshold of 0.88 and without any modified attention formulation.
Note: a log file and a tensorboard directory will automatically be created for you to monitor your training.
python ./segm/train.py --log-dir <model_dir> \
--dataset <dataset_name> \
--backbone vit_<size>_patch16_384 \
--decoder mask_transformer \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88 \
--batch-size 8We explain here the specific options for G2TM:
--patch-type graph: Applies the G2TM token reduction method.--selected-layer 2: Specifies which layers of the network to apply G2TM. In this case, the 2nd layer.--threshold 0.88: Sets the similarity threshold for merging tokens in G2TM.
All training commands can be run with or without G2TM using the patch-type option, as well as with or without (Inverse) Proportional Attention using the prop-attn or iprop-attn options.
For more examples of training commands (e.g.: with curriculum, with Inverse Proportional Attention, etc.), see TRAINING.
You can download a checkpoint with its configuration in a common folder, in the Results and Models part.
To perform an evaluation (mIoU) of a Segmenter model with G2TM on the dataset it has been trained on, execute the following command. Make sure that the directory provided for the model-path option contains the checkpoint AND the variant.yaml file. Here, we choose to evaluate the model with G2TM applied at the 2nd layer with a threshold of 0.88, as it has been trained in the previous part.
NOTE: Please use the correct values for the selected-layer and threshold options for the evaluated model. You can find these values in the variant.yaml file.
python ./segm/test.py <ckpt_file> \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88Note that you can still use the evaluation script (mIoU, mAcc, pAcc) provided by Segmenter using this command:
# single-scale baseline + G2TM evaluation:
python ./segm/eval/miou.py <ckpt_file> <dataset_name> \
--singlescale \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88
# Explanation:
# --singlescale: Evaluates the model using a single scale of input images (otherwise --multiscale).
# --patch-type pure: Uses the standard patch processing without any modifications.All evaluation commands can be run with or without G2TM using the patch-type option, as well as with or without (Inverse) Proportional Attention using the prop-attn or iprop-attn options.
To calculate the throughput and GFLOPs of a model, execute the following commands. Again, ensure that the directory provided for the model-path option contains the checkpoint AND the variant.yaml file.
NOTE: Please use the specific values for the selected-layer and threshold options for the evaluated model. You can find these values in the variant.yaml file.
# Im/sec
python ./segm/speedtest.py <ckpt_file> <dataset_name> \
--batch-size 1 \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88# GFLOPs
python ./segm/flops.py <ckpt_file> <dataset_name> \
--batch-size 1 \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88To profile model activity during inference on CPU and GPU using PyTorch tools, use the following command:
python ./segm/profile_model.py <ckpt_file> <dataset_name> \
--patch-type graph \
--selected-layer 2 \
--threshold 0.88All benchmarking commands can be run with or without G2TM using the patch-type option, as well as with or without (Inverse) Proportional Attention using the prop-attn or iprop-attn options.
To visualize segementation maps as well as the tokens and the attention maps at a specified layer for a specific image, execute the following command. It supports visualizations for both models with and without token reduction. For more details on the outputs, see the function documentation. In the example below, we generate visualization for a Segmenter model with G2TM applied at the 2nd layer with a threshold of 0.88.
python ./segm/show_attn_map.py <ckpt_file> <img_path> \
<output_dir> <dataset_cmap> \
--cls --enc --layer-id <layer> \
--patch-type graph \
--selected-layer 1
--threshold 0.95We explain here the specific options for G2TM:
--cls: The attention maps provided are so with respect to the [CLS] token, otherwise (--patch) you have to provide the coordinate of the reference patch (--x-patch <x> --y-patch <y>).--enc: Whether the visualization is made in the encoder or in the decoder (--dec).--layer-id <layer>: The index of the layer (starting from 0) for visualization.
To get some statistics on the remaining tokens after merging, please run the following command:
python ./segm/token_stats.py <ckpt_file> <dataset> \
--layer-id <layer> \
--patch-type graph \
--selected-layer 1
--threshold 0.95We explain here the specific options for G2TM:
--layer-id <layer>: The index of the layer (starting from 0) where to measure the token statistics (measured after the merging operation if the Transformer block contains a G2TM module).
All token commands can be run with or without G2TM using the patch-type option, as well as with or without (Inverse) Proportional Attention using the prop-attn or iprop-attn options.
See RESULTS for some comparative results for Segmenter + G2TM and the corresponding model checkpoints.
NOTE: We are still looking for a solution to host all model checkpoints, in the meantime do not hesitate to request the checkpoints by contacting one of the authors.
- [x] Training and Inference scripts
- [x] Flops and Speedtest scripts
- [x] Token and attention map visualization scripts
- [x] Experiments on ADE20K and Cityscapes datasets
- [ ] Experiments on Pascal-Context dataset
- [ ] ONNX export script
This code extends the official Segmenter code (under MIT Licence). It uses the repository structure and some utils functions from ToMe (under CC-BY-NC licence), as well as utils functions from AGLM.
Inheriting from the Segmenter repository, the Vision Transformer code is based on timm library (under Apache 2.0 Licence) and the semantic segmentation training and evaluation pipelines are using the mmsegmentation and mmcv libraries (under Apache 2.0 Licence).
All files covered by Segmenter's or ToMe's licences include a header indicating the licence and whether the file has been modified. You can find such files from Segmenter's repository in the segm directory and from ToMe's repository in the patch and vis folders.
Below are other Python librairies, along with their corresponding licenses, used in this work:
- Click under BSD-3-Clause License
- einops under MIT License
- FVCore under Apache 2.0 License
- Matplotlib under PSF License
- NetworkX under BSD-3-Clause License
- Numpy under BSD-3-Clause License
- ONNX under Apache 2.0 License
- ONNXRuntime under MIT License
- OpenCV under MIT License
- Pillow under MIT-CMU License
- PyTorch under BSD-3-Clause License
- Scipy under BSD-3-Clause License
- Tqdm under MPL v. 2.0 and MIT Licenses
By contributing to G2TM, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.
Copyright © 2025 Commissariat à l'Energie Atomique et aux Energies Alternatives (CEA)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
If you cite G2TM or use this repository in your work, please cite:
@conference{bercy2026g2tm,
author={Victor Bercy and Martyna Poreba and Michal Szczepanski and Samia Bouchafa},
title={G2TM: Single-Module Graph-Guided Token Merging for Efficient Semantic Segmentation},
booktitle={Proceedings of the 21st International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP},
year={2026},
pages={43-54},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0014267600004084},
isbn={978-989-758-804-4},
issn={2184-4321},
}

