MAEViT¶
- class mmpretrain.models.selfsup.MAEViT(arch='b', img_size=224, patch_size=16, out_indices=-1, drop_rate=0, drop_path_rate=0, norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, out_type='raw', interpolate_mode='bicubic', patch_cfg={}, layer_cfgs={}, mask_ratio=0.75, init_cfg=None)[source]¶
Vision Transformer for MAE pre-training.
A PyTorch implement of: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This module implements the patch masking in MAE and initialize the position embedding with sine-cosine position embedding.
- Parameters:
arch (str | dict) – Vision Transformer architecture Default: ‘b’
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to
dict(type='LN')
.final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
out_type (str) –
The type of output features. Please choose from
"cls_token"
: The class token tensor with shape (B, C)."featmap"
: The feature map tensor from the patch tokens with shape (B, C, H, W)."avg_featmap"
: The global averaged feature map tensor with shape (B, C)."raw"
: The raw feature tensor includes patch tokens and class tokens with shape (B, L, C).
It only works without input mask. Defaults to
"avg_featmap"
.interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
mask_ratio (bool) – The ratio of total number of patches to be masked. Defaults to 0.75.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.
- forward(x, mask=True)[source]¶
Generate features for masked images.
The function supports two kind of forward behaviors. If the
mask
isTrue
, the function will generate mask to masking some patches randomly and get the hidden features for visible patches, which means the function will be executed as masked imagemodeling pre-training; if themask
isNone
orFalse
, the forward function will callsuper().forward()
, which extract features from images without mask.- Parameters:
x (torch.Tensor) – Input images, which is of shape B x C x H x W.
mask (bool, optional) – To indicate whether the forward function generating
mask
or not.
- Returns:
Hidden features, mask and the ids to restore original image.
x
(torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.mask
(torch.Tensor): mask used to mask image.ids_restore
(torch.Tensor): ids to restore original image.
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
- random_masking(x, mask_ratio=0.75)[source]¶
Generate the mask for MAE Pre-training.
- Parameters:
x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.
mask_ratio (float) – The mask ratio of total patches. Defaults to 0.75.
- Returns:
masked image, mask and the ids to restore original image.
x_masked
(torch.Tensor): masked image.mask
(torch.Tensor): mask used to mask image.ids_restore
(torch.Tensor): ids to restore original image.
- Return type:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]