EVA-02¶
Abstract¶
We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
How to use it?¶
from mmpretrain import inference_model
predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
import torch
from mmpretrain import get_model
model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
inputs = torch.rand(1, 3, 336, 336)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
Prepare your dataset according to the docs.
Train:
python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
Test:
python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
Models and results¶
Pretrained models¶
Model |
Params (M) |
Flops (G) |
Config |
Download |
---|---|---|---|---|
|
5.50 |
1.70 |
||
|
21.62 |
6.14 |
||
|
85.77 |
23.22 |
||
|
303.29 |
81.15 |
||
|
303.29 |
81.15 |
The input size / patch size of MIM pre-trained EVA-02 is
224x224
/14x14
.
Models with * are converted from the official repo.
Image Classification on ImageNet-1k¶
(w/o IN-21K intermediate fine-tuning)¶
Model |
Pretrain |
Params (M) |
Flops (G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
|
EVA02 ImageNet-21k |
5.76 |
4.68 |
80.69 |
95.54 |
||
|
EVA02 ImageNet-21k |
22.13 |
15.48 |
85.78 |
97.60 |
||
|
EVA02 ImageNet-21k |
87.13 |
107.11 |
88.29 |
98.53 |
Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.
(w IN-21K intermediate fine-tuning)¶
Model |
Pretrain |
Params (M) |
Flops (G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
|
EVA02 ImageNet-21k |
87.13 |
107.11 |
88.47 |
98.62 |
||
|
EVA02 ImageNet-21k |
305.08 |
362.33 |
89.65 |
98.95 |
||
|
EVA02 Merged-38M |
305.10 |
362.33 |
89.83 |
99.00 |
Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.
Citation¶
@article{EVA-02,
title={EVA-02: A Visual Representation for Neon Genesis},
author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
journal={arXiv preprint arXiv:2303.11331},
year={2023}
}