class mmpretrain.models.selfsup.CAEPretrainViT(arch='b', img_size=224, patch_size=16, in_channels=3, out_indices=-1, drop_rate=0, drop_path_rate=0, bias='qv_bias', norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=True, out_type='raw', frozen_stages=-1, use_abs_pos_emb=True, use_rel_pos_bias=False, use_shared_rel_pos_bias=False, layer_scale_init_value=None, interpolate_mode='bicubic', patch_cfg={}, layer_cfgs={}, init_cfg=[{'type': 'Constant', 'val': 1, 'layer': ['LayerNorm']}, {'type': 'TruncNormal', 'std': 0.02, 'layer': ['Conv2d']}, {'type': 'Xavier', 'distribution': 'uniform', 'layer': ['Linear']}])[source]

Vision Transformer for CAE pre-training and the implementation is based on BEiTViT.

  • arch (str | dict) – Vision Transformer architecture. Default: ‘b’

  • img_size (int | tuple) – Input image size

  • patch_size (int | tuple) – The patch size

  • out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.

  • drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.

  • drop_path_rate (float) – stochastic depth rate. Defaults to 0.

  • bias (bool | str) – The option to add leanable bias for q, k, v. If bias is True, it will add leanable bias. If bias is ‘qv_bias’, it will only add leanable bias for q, v. If bias is False, it will not add bias for q, k, v. Default to ‘qv_bias’.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').

  • final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.

  • out_type (str) –

    The type of output features. Please choose from

    • "cls_token": The class token tensor with shape (B, C).

    • "featmap": The feature map tensor from the patch tokens with shape (B, C, H, W).

    • "avg_featmap": The global averaged feature map tensor with shape (B, C).

    • "raw": The raw feature tensor includes patch tokens and class tokens with shape (B, L, C).

    It only works without input mask. Defaults to "avg_featmap".

  • interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.

  • layer_scale_init_value (float, optional) – The init value of gamma in BEiTTransformerEncoderLayer.

  • patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.

  • layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.

  • init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x, mask)[source]

Generate features for masked images.

This function generates mask images and get the hidden features for visible patches.

The function supports two kind of forward behaviors. If the mask is not None, the forward function will be executed as masked image modeling pre-training; if the mask is None, the forward function will call super().forward(), which extract features from images without mask.

  • x (torch.Tensor) – Input images, which is of shape B x C x H x W.

  • mask (torch.Tensor, optional) – Mask for input, which is of shape B x L.


hidden features.

Return type:



Initialize position embedding, patch embedding and cls token.

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.