Visual-Attention-Network¶
摘要¶
While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

使用方式¶
from mmpretrain import inference_model
predict = inference_model('van-tiny_3rdparty_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
import torch
from mmpretrain import get_model
model = get_model('van-tiny_3rdparty_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
Prepare your dataset according to the docs.
测试:
python tools/test.py configs/van/van-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
Models and results¶
Image Classification on ImageNet-1k¶
模型 |
预训练 |
Params (M) |
Flops (G) |
Top-1 (%) |
Top-5 (%) |
配置文件 |
下载 |
---|---|---|---|---|---|---|---|
|
从头训练 |
4.11 |
0.88 |
75.41 |
93.02 |
||
|
从头训练 |
13.86 |
2.52 |
81.01 |
95.63 |
||
|
从头训练 |
26.58 |
5.03 |
82.80 |
96.21 |
||
|
从头训练 |
44.77 |
8.99 |
83.86 |
96.73 |
Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reprodcue the training results.
引用¶
@article{guo2022visual,
title={Visual Attention Network},
author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
journal={arXiv preprint arXiv:2202.09741},
year={2022}
}