MAEHiViT¶

class mmpretrain.models.selfsup.MAEHiViT(arch='b', img_size=224, patch_size=16, inner_patches=4, out_indices=[23], drop_rate=0.0, drop_path_rate=0.0, norm_cfg={'eps': 1e-06, 'type': 'LN'}, ape=True, rpe=False, layer_scale_init_value=0.0, mask_ratio=0.75, init_cfg=None)[source]¶

HiViT for MAE pre-training.

A PyTorch implement of: HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer. This module implements the patch masking in MAE and initialize the position embedding with sine-cosine position embedding.

Parameters:

arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size
patch_size (int | tuple) – The patch size Defaults to 4, to downsample 4x at the first stage
inner_patches (int) – The inner patches within a token Defaults to 4
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
ape (bool) – the absolute position embedding
rpe (bool) – the relative position embedding Defaults to False
layer_scale_init_value (float) – the layer scale init value
mask_ratio (bool) – The ratio of total number of patches to be masked. Defaults to 0.75.
init_cfg (Union[List[dict], dict], optional) – Initialization config dict. Defaults to None.

forward(x, mask=True)[source]¶

Generate features for masked images.

The function supports two kind of forward behaviors. If the mask is True, the function will generate mask to masking some patches randomly and get the hidden features for visible patches, which means the function will be executed as masked imagemodeling pre-training; if the mask is None or False, the forward function will call super().forward(), which extract features from images without mask.

Parameters:

x (torch.Tensor) – Input images, which is of shape B x C x H x W.
mask (bool, optional) – To indicate whether the forward function generating mask or not.

Returns:

Hidden features, mask and the ids to restore original image.

x (torch.Tensor): hidden features, which is of shape B x (L * mask_ratio) x C.
mask (torch.Tensor): mask used to mask image.
ids_restore (torch.Tensor): ids to restore original image.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

init_weights()[source]¶: Initialize position embedding, patch embedding.

masking_id(batch_size, mask_ratio)[source]¶

Generate the mask for MAE Pre-training.

Parameters:

batch_size – The batch size of input data
mask_ratio – The mask ratio of total patches. Defaults to 0.75.

Returns:

the ids for the tokens retained, the ids to restore original image, and the mask

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]