EVA-02¶

EVA-02: A Visual Representation for Neon Genesis

摘要¶

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.

TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart.

使用方式¶

from mmpretrain import inference_model

predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])

import torch
from mmpretrain import get_model

model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
inputs = torch.rand(1, 3, 336, 336)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))

Prepare your dataset according to the docs.

训练：

python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py

测试：

python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth

Models and results¶

Pretrained models¶

模型	Params (M)	Flops (G)	配置文件	下载
`vit-tiny-p14_eva02-pre_in21k`*	5.50	1.70	config	model
`vit-small-p14_eva02-pre_in21k`*	21.62	6.14	config	model
`vit-base-p14_eva02-pre_in21k`*	85.77	23.22	config	model
`vit-large-p14_eva02-pre_in21k`*	303.29	81.15	config	model
`vit-large-p14_eva02-pre_m38m`*	303.29	81.15	config	model

The input size / patch size of MIM pre-trained EVA-02 is 224x224 / 14x14.

Models with * are converted from the official repo.

Image Classification on ImageNet-1k¶

(w/o IN-21K intermediate fine-tuning)¶

模型	预训练	Params (M)	Flops (G)	Top-1 (%)	Top-5 (%)	配置文件	下载
`vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`*	EVA02 ImageNet-21k	5.76	4.68	80.69	95.54	config	model
`vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`*	EVA02 ImageNet-21k	22.13	15.48	85.78	97.60	config	model
`vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`*	EVA02 ImageNet-21k	87.13	107.11	88.29	98.53	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.

(w IN-21K intermediate fine-tuning)¶

模型	预训练	Params (M)	Flops (G)	Top-1 (%)	Top-5 (%)	配置文件	下载
`vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`*	EVA02 ImageNet-21k	87.13	107.11	88.47	98.62	config	model
`vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`*	EVA02 ImageNet-21k	305.08	362.33	89.65	98.95	config	model
`vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`*	EVA02 Merged-38M	305.10	362.33	89.83	99.00	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.

引用¶

@article{EVA-02,
  title={EVA-02: A Visual Representation for Neon Genesis},
  author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
  journal={arXiv preprint arXiv:2303.11331},
  year={2023}
}