MFF¶
Abstract¶
There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
Train/Test Command
Prepare your dataset according to the docs.
Train:
python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
Test:
python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
Models and results¶
Pretrained models¶
Model |
Params (M) |
Flops (G) |
Config |
Download |
---|---|---|---|---|
|
- |
- |
||
|
- |
- |
Image Classification on ImageNet-1k¶
Model |
Pretrain |
Params (M) |
Flops (G) |
Top-1 (%) |
Config |
Download |
---|---|---|---|---|---|---|
|
86.57 |
17.58 |
83.00 |
|||
|
86.57 |
17.58 |
83.70 |
|||
|
304.33 |
61.60 |
64.20 |
|||
|
304.33 |
61.60 |
68.30 |
Citation¶
@article{MFF,
title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
journal={arXiv},
year={2023}
}