class mmpretrain.models.selfsup.MixMIMPretrainTransformer(arch='base', mlp_ratio=4, img_size=224, patch_size=4, in_channels=3, window_size=[14, 14, 14, 7], qkv_bias=True, patch_cfg={}, norm_cfg={'type': 'LN'}, drop_rate=0.0, drop_path_rate=0.0, attn_drop_rate=0.0, use_checkpoint=False, mask_ratio=0.5, range_mask_ratio=0.0, init_cfg=None)[source]

MixMIM backbone for MixMIM pre-training.

A PyTorch implement of : ` MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning <>`_

  • arch (str | dict) –

    MixMIM architecture. If use string, choose from ‘base’,’large’ and ‘huge’. If use dict, it should have below keys:

    • embed_dims (int): The dimensions of embedding.

    • depths (int): The number of transformer encoder layers.

    • num_heads (int): The number of heads in attention modules.

    Defaults to ‘base’.

  • mlp_ratio (int) – The mlp ratio in FFN. Defaults to 4.

  • img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to mlp_ratio the most common input image shape. Defaults to 224.

  • patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.

  • in_channels (int) – The num of input channels. Defaults to 3.

  • window_size (list) – The height and width of the window.

  • qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.

  • patch_cfg (dict) – Extra config dict for patch embedding. Defaults to an empty dict.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').

  • drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.

  • attn_drop_rate (float) – Attention drop rate. Defaults to 0.

  • use_checkpoint (bool) – Whether use the checkpoint to reduce GPU memory cost. Defaults to False.

  • mask_ratio (bool) – The base ratio of total number of patches to be masked. Defaults to 0.5.

  • range_mask_ratio (float) – The range of mask ratio. Defaults to 0.

  • init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(x, mask=True)[source]

Generate features for masked images.

This function generates mask and masks some patches randomly and get the hidden features for visible patches.

  • x (torch.Tensor) – Input images, which is of shape B x C x H x W.

  • mask (bool, optional) – To indicate whether the forward containing mask or not.


  • x (torch.Tensor): hidden features, which is of shape B x L x C.

  • mask_s4 (torch.Tensor): the mask tensor for the last layer.

Return type:

Tuple[torch.Tensor, torch.Tensor]


Initialize position embedding, patch embedding.

random_masking(x, mask_ratio=0.5)[source]

Generate the mask for MixMIM Pretraining.

  • x (torch.Tensor) – Image with data augmentation applied, which is of shape B x L x C.

  • mask_ratio (float) – The mask ratio of total patches. Defaults to 0.5.


  • mask_s1 (torch.Tensor): mask with stride of self.encoder_stride // 8.

  • mask_s2 (torch.Tensor): mask with stride of self.encoder_stride // 4.

  • mask_s3 (torch.Tensor): mask with stride of self.encoder_stride // 2.

  • mask (torch.Tensor): mask with stride of self.encoder_stride.

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.