Improving State Space Models for Computer Vision

CSCI-GA.2271-001 (Advanced) Computer Vision - Fall 2024

This project explores techniques with the potential of improving state-space models (SSMs) for computer vision tasks, focusing on lightweight architectures and efficient training methods. Our work builds upon prior research, including the original Vision Mamba (Vim) [1] and MambaVision [2], and extends these architectures with optimization techniques and experiments involving novel integrations such as Squeeze-and-Excitation (SE) blocks, Early-Bird Lottery Ticket Hypothesis, Sliding Window Attention (SWA), and more!

Repository Structure

This repository is organized into two main folders:

vision_mamba/
Contains the implementation, experiments, and checkpoints for Vision Mamba (Vim).
mamba_vision/
Includes the MambaVision, the hybrid Vision Transformer and Mamba-based architecture, along with its checkpoints and experiments.

Subfolder: `checkpoints/`

Both the vision_mamba/ and mamba_vision/ folders include a checkpoints/ directory to store trained models and weights for reproducibility.

Experiments

The experiments conducted in this project are organized into two categories, corresponding to Vision Mamba (Vim) and MambaVision. Each experiment is located in its respective folder (vision_mamba/ or mamba_vision/) and includes some checkpoints for reproducibility. Below is an overview of the experiments performed:

With Vision Mamba (Vim):

Squeeze-and-Excitation (SE) Integration [3]: Evaluated SE block integration to recalibrate features, achieving variable results.
MobileNetV2-Inspired Depthwise Convolutions [4]: Reduces computational overhead with slight accuracy variation.
Dynamic Pruning with Cosine Scheduler: Improved sparsity efficiency while maintaining high accuracy, achieving a best accuracy of 70%.
Iterative Magnitude Pruning [5]: Repeatedly pruned low-magnitude weights to identify sparse, efficient subnetworks while retaining performance.
Early-Bird Lottery Ticket Hypothesis [6]: Identified sparse subnetworks within the first 6 epochs by stabilizing mask distances, enabling early optimization.
Gradient Pruning: Pruned weights based on gradient magnitudes during training, allowing for adaptive sparsity that targets the most insignificant weights while maintaining performance.

With MambaVision:

Sliding Window Attention (SWA): Investigated increased receptive fields for MambaVision on larger-scale tasks.
Exploration of S4/S5 SSMs: Tested alternative kernels (S4 and S5) for continuous signals, showing promise in handling vision tasks with continuous characteristics.
Incorporating Attention-Based Pooling: Replaced standard pooling layers with attention-based pooling. This approach emphasizes critical features while suppressing irrelevant ones.
Integrating ConvNeXt Principles into MambaVision: Reduced kernel sizes (7×7 → 3×3), adjusted strides (4 → 2). Downsampling layers were modified to 2×2 kernels with stride 2, and the number of blocks was reduced to prevent feature over-reduction. while employing depthwise separable convolutions, GeLU activations, and layer normalization for better computational efficiency and training stability.

References

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” 2024. [Online]. Available: https://arxiv.org/abs/2401.09417
A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba-transformer vision backbone,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08083
J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” 2019. [Online]. Available: https://arxiv.org/abs/1709.01507
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” 2019. [Online]. Available: https://arxiv.org/abs/1801.04381
M. Paul, F. Chen, B. W. Larsen, J. Frankle, S. Ganguli, and G. K. Dziugaite, “Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask?” 2022. [Online]. Available: https://arxiv.org/abs/2210.03044
H. You, C. Li, P. Xu, Y. Fu, Y. Wang, X. Chen, R. G. Baraniuk, Z. Wang, and Y. Lin, “Drawing early-bird tickets: Towards more efficient training of deep networks,” 2022. [Online]. Available: https://arxiv.org/abs/1909.11957

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
checkpoints		checkpoints
mamba_vision		mamba_vision
vision_mamba		vision_mamba
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Improving State Space Models for Computer Vision

Repository Structure

Subfolder: `checkpoints/`

Experiments

With Vision Mamba (Vim):

With MambaVision:

References

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

rdaggs/vim_ssm_cv24

Folders and files

Latest commit

History

Repository files navigation

Improving State Space Models for Computer Vision

Repository Structure

Subfolder: checkpoints/

Experiments

With Vision Mamba (Vim):

With MambaVision:

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Subfolder: `checkpoints/`

Packages