Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda^1,2 Stefan Roth^1,2 Simone Schaub-Meyer^1,2

¹TU Darmstadt ²hessian.AI

Accepted to GCPR 2025

TL;DR: EMAT processes high-resolution correlation tokens, boosting few-shot classification and segmentation, especially for small objects, while using at least four times fewer parameters than existing methods. It supports N-way K-shot tasks and correctly outputs empty masks when no target is present.

Installation

This project was originally developed using Python 3.9, PyTorch 2.1.1, and CUDA 12.1 on Linux. To reproduce the environment, follow these steps:

# 1) Clone the repository
git clone https://github.com/visinf/emat

# 2) Move into the repository
cd emat

# 3) Create the conda environment
conda create -n emat -c conda-forge -c nvidia -c pytorch \
    python=3.9 cuda-version=12.1 \
    pytorch==2.1.1 torchvision==0.16.1 pytorch-cuda=12.1

# 4) Activate the conda environment
conda activate emat

# 5) Install additional required packages using pip
pip install -r requirements.txt

Dataset Preparation

This project uses the PASCAL-5ⁱ and COCO-20ⁱ datasets. After downloading both datasets, organize them in the following directory structure:

<DIR_WITH_DATASETS>/
    PASCAL/
        JPEGImages/
        SegmentationClassAug/
        ...
    COCO/
        annotations/
        train2014/
        val2014/
        ...

Training

EMAT can be trained for Few-Shot Classification and Segmentation (FS-CS) or Few-Shot Segmentation (FS-S). Below, we describe how to execute each type of few-shot task. We performed training on three NVIDIA RTX A6000 GPUs (48 GB). You can train EMAT using fewer GPUs by adjusting the batch size in the configuration file located in configs/.

Prerequisites

The PASCAL-5ⁱ and COCO-20ⁱ datasets must be organized as described in the Dataset Preparation section. Additionally, set the path <DIR_WITH_DATASETS> in the configuration file under DATA.PATH.
EMAT uses a ViT-S/14 (without registers) pre-trained with DINOv2 as its backbone. Download the corresponding checkpoint and set its path in the configuration file under METHOD.BACKBONE_CHECKPOINT.

Few-Shot Classification and Segmentation (FS-CS)

Each dataset includes four folds. For example, to train EMAT on Fold-0 of PASCAL-5ⁱ, run:

python main.py \
       --config_path configs/emat-pascal.yaml \
       --fold 0 \
       --way 1 \
       --shot 1 \
       --gpus 0,1,2

Few-Shot Segmentation (FS-S)

Similarly, to train EMAT on Fold-0 of PASCAL-5ⁱ for segmentation only, run:

python main.py \
       --config_path configs/ematseg-pascal.yaml \
       --fold 0 \
       --way 1 \
       --shot 1 \
       --only_seg \
       --no_empty_masks \
       --gpus 0,1,2

Checkpoints

We demonstrate that EMAT outperforms the recent state of the art (CST) across different evaluation settings. For a fair comparison, we updated CST^* to use the same backbone as EMAT (i.e., DINOv2 instead of DINO). The final checkpoints are provided in Table 1.

Table 1. Comparison of EMAT and the previous SOTA (CST^*) in FS-CS on PASCAL-5ⁱ and COCO-20ⁱ across all evaluation settings: original, partially augmented, and fully augmented, using 2-way 1-shot tasks (base configuration).

Dataset	Method	Checkpoint	Original		Partially Augmented		Fully Augmented
Dataset	Method	Checkpoint	Acc.	mIoU	Acc.	mIoU	Acc.	mIoU
PASCAL-5ⁱ	CST^*	download	80.58	63.28	80.60	63.23	78.57	63.08
PASCAL-5ⁱ	EMAT	download	82.70	63.38	82.92	63.32	81.23	63.24
COCO-20ⁱ	CST^*	download	78.70	51.47	78.87	51.53	71.18	50.76
COCO-20ⁱ	EMAT	download	80.07	52.81	80.25	52.82	73.00	51.99

Additional checkpoints can be found here.

Evaluation

As explained earlier, EMAT can be used for both Few-Shot Classification and Segmentation (FS-CS) and Few-Shot Segmentation (FS-S). To evaluate a checkpoint, either one obtained after training or one of our provided checkpoints, for FS-CS, run:

python main.py \
       --experiment_path <PATH_TO_EXPERIMENT> \
       --fold {0, 1, 2, 3} \
       --way 2 \
       --shot 1 \
       --setting {original, partially-augmented, fully-augmented} \
       --eval

To evaluate a checkpoint on our splits based on object size, run:

python main.py \
       --experiment_path <PATH_TO_EXPERIMENT> \
       --fold {0, 1, 2, 3} \
       --way 1 \
       --shot 1 \
       --object_size_split {0-5, 5-10, 10-15, 0-15} \
       --eval \
       --no_empty_masks

Finally, to evaluate a checkpoint on FS-S, use:

python main.py \
       --experiment_path <PATH_TO_EXPERIMENT> \
       --fold {0, 1, 2, 3} \
       --way 1 \
       --shot 1 \
       --eval \
       --only_seg \
       --no_empty_masks

Evaluation Scripts

We also provide evaluation scripts in the evaluation/ directory to reproduce the results presented in our paper.

Table 2. Available evaluation scripts.

Script Name	Description
`eval_emat.sh`	Evaluates EMAT on all folds of PASCAL-5ⁱ and COCO-20ⁱ across all evaluation settings and object size splits (FS-CS).
`eval_ematseg.sh`	Evaluates EMAT on all folds of PASCAL-5ⁱ and COCO-20ⁱ using 1-way 1- and 5-shot tasks (FS-S).
`eval_cst.sh`	Evaluates CST^* on all folds of PASCAL-5ⁱ and COCO-20ⁱ across all evaluation settings and object size splits (FS-CS).
`eval_cst-large.sh`	Evaluates CST^* with a larger support dimension on all folds of PASCAL-5ⁱ, using both the full dataset and only small-object subsets (FS-CS).

To reproduce all results:

Download all checkpoints provided here, and place them in the experiments/ directory.
Run each script as follows:

cd evaluation
bash <SCRIPT_NAME> <GPU_ID>

After executing all scripts, process the results using the following command:

python process_results.py

Citation

If you find our work helpful, please consider citing the following paper and ⭐ the repo.

@inproceedings{carrion2025emat,
    title={Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation},
    author={Dustin Carrión-Ojeda and Stefan Roth and Simone Scahub-Meyer},
    booktitle={Proceedings of the German Conference on Pattern Recognition (GCPR)},
    year={2025},
}

Acknowledgements

We acknowledge the authors of CST, and DINOv2 for open-sourcing their implementations. This project was funded by the Hessian Ministry of Science and Research, Arts and Culture (HMWK) through the project "The Third Wave of Artificial Intelligence - 3AI". The project was further supported by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany's Excellence Strategy (EXC 3057/1 "Reasonable Artificial Intelligence", Project No. 533677015). Stefan Roth acknowledges support by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
data		data
evaluation		evaluation
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Accepted to GCPR 2025

Installation

Dataset Preparation

Training

Prerequisites

Few-Shot Classification and Segmentation (FS-CS)

Few-Shot Segmentation (FS-S)

Checkpoints

Evaluation

Evaluation Scripts

Citation

Acknowledgements

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

visinf/emat

Folders and files

Latest commit

History

Repository files navigation

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Accepted to GCPR 2025

Installation

Dataset Preparation

Training

Prerequisites

Few-Shot Classification and Segmentation (FS-CS)

Few-Shot Segmentation (FS-S)

Checkpoints

Evaluation

Evaluation Scripts

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages