class mmpretrain.models.backbones.MViT(arch='base', img_size=224, in_channels=3, out_scales=-1, drop_path_rate=0.0, use_abs_pos_embed=False, interpolate_mode='bicubic', pool_kernel=(3, 3), dim_mul=2, head_mul=2, adaptive_kv_stride=4, rel_pos_spatial=True, residual_pooling=True, dim_mul_in_attention=True, rel_pos_zero_init=False, mlp_ratio=4.0, qkv_bias=True, norm_cfg={'eps': 1e-06, 'type': 'LN'}, patch_cfg={'kernel_size': 7, 'padding': 3, 'stride': 4}, init_cfg=None)[源代码]

Multi-scale ViT v2.

A PyTorch implement of : MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Inspiration from the official implementation and the detectron2 implementation

  • arch (str | dict) –

    MViT architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:

    • embed_dims (int): The dimensions of embedding.

    • num_layers (int): The number of layers.

    • num_heads (int): The number of heads in attention modules of the initial layer.

    • downscale_indices (List[int]): The layer indices to downscale the feature map.

    Defaults to ‘base’.

  • img_size (int) – The expected input image shape. Defaults to 224.

  • in_channels (int) – The num of input channels. Defaults to 3.

  • out_scales (int | Sequence[int]) – The output scale indices. They should not exceed the length of downscale_indices. Defaults to -1, which means the last scale.

  • drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.

  • use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.

  • interpolate_mode (str) – Select the interpolate mode for absolute position embedding vector resize. Defaults to “bicubic”.

  • pool_kernel (tuple) – kernel size for qkv pooling layers. Defaults to (3, 3).

  • dim_mul (int) – The magnification for embed_dims in the downscale layers. Defaults to 2.

  • head_mul (int) – The magnification for num_heads in the downscale layers. Defaults to 2.

  • adaptive_kv_stride (int) – The stride size for kv pooling in the initial layer. Defaults to 4.

  • rel_pos_spatial (bool) – Whether to enable the spatial relative position embedding. Defaults to True.

  • residual_pooling (bool) – Whether to enable the residual connection after attention pooling. Defaults to True.

  • dim_mul_in_attention (bool) – Whether to multiply the embed_dims in attention layers. If False, multiply it in MLP layers. Defaults to True.

  • rel_pos_zero_init (bool) – If True, zero initialize relative positional parameters. Defaults to False.

  • mlp_ratio (float) – Ratio of hidden dimensions in MLP layers. Defaults to 4.0.

  • qkv_bias (bool) – enable bias for qkv if True. Defaults to True.

  • norm_cfg (dict) – Config dict for normalization layer for all output features. Defaults to dict(type='LN', eps=1e-6).

  • patch_cfg (dict) – Config dict for the patch embedding layer. Defaults to dict(kernel_size=7, stride=4, padding=3).

  • init_cfg (dict, optional) – The Config for initialization. Defaults to None.


>>> import torch
>>> from mmpretrain.models import build_backbone
>>> cfg = dict(type='MViT', arch='tiny', out_scales=[0, 1, 2, 3])
>>> model = build_backbone(cfg)
>>> inputs = torch.rand(1, 3, 224, 224)
>>> outputs = model(inputs)
>>> for i, output in enumerate(outputs):
>>>     print(f'scale{i}: {output.shape}')
scale0: torch.Size([1, 96, 56, 56])
scale1: torch.Size([1, 192, 28, 28])
scale2: torch.Size([1, 384, 14, 14])
scale3: torch.Size([1, 768, 7, 7])

Forward the MViT.

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.