Shortcuts

MFF

Abstract

There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.

Train/Test Command

Prepare your dataset according to the docs.

Train:

python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py

Test:

python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None

Models and results

Pretrained models

Model

Params (M)

Flops (G)

Config

Download

mff_vit-base-p16_8xb512-amp-coslr-300e_in1k

-

-

config

model | log

mff_vit-base-p16_8xb512-amp-coslr-800e_in1k

-

-

config

model | log

Image Classification on ImageNet-1k

Model

Pretrain

Params (M)

Flops (G)

Top-1 (%)

Config

Download

vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k

MFF 300-Epochs

86.57

17.58

83.00

config

model / log

vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k

MFF 800-Epochs

86.57

17.58

83.70

config

model / log

vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k

MFF 300-Epochs

304.33

61.60

64.20

config

log

vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k

MFF 800-Epochs

304.33

61.60

68.30

config

model / log

Citation

@article{MFF,
  title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
  author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
  journal={arXiv},
  year={2023}
}