MobileViT¶

class mmpretrain.models.backbones.MobileViT(arch='small', in_channels=3, stem_channels=16, last_exp_factor=4, out_indices=(4,), frozen_stages=-1, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'Swish'}, init_cfg=[{'type': 'Kaiming', 'layer': ['Conv2d']}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}])[source]¶

MobileViT backbone.

A PyTorch implementation of : MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Modified from the official repo and timm.

Parameters:

arch (str | List[list]) –
Architecture of MobileViT.
- If a string, choose from “small”, “x_small” and “xx_small”.
- If a list, every item should be also a list, and the first item of the sub-list can be chosen from “moblienetv2” and “mobilevit”, which indicates the type of this layer sequence. If “mobilenetv2”, the other items are the arguments of make_mobilenetv2_layer (except in_channels) and if “mobilevit”, the other items are the arguments of make_mobilevit_layer (except in_channels).
Defaults to “small”.
in_channels (int) – Number of input image channels. Defaults to 3.
stem_channels (int) – Channels of stem layer. Defaults to 16.
last_exp_factor (int) – Channels expand factor of last layer. Defaults to 4.
out_indices (Sequence[int]) – Output from which stages. Defaults to (4, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Defaults to -1, which means not freezing any parameters.
conv_cfg (dict, optional) – Config dict for convolution layer. Defaults to None, which means using conv2d.
norm_cfg (dict, optional) – Config dict for normalization layer. Defaults to dict(type=’BN’).
act_cfg (dict, optional) – Config dict for activation layer. Defaults to dict(type=’Swish’).
init_cfg (dict, optional) – Initialization config dict.

static make_mobilenetv2_layer(in_channels, out_channels, stride, num_blocks, expand_ratio=4)[source]¶

Build mobilenetv2 layer, which consists of several InvertedResidual layers.

Parameters:

in_channels (int) – The input channels.
out_channels (int) – The output channels.
stride (int) – The stride of the first 3x3 convolution in the InvertedResidual layers.
num_blocks (int) – The number of InvertedResidual blocks.
expand_ratio (int) – adjusts number of channels of the hidden layer in InvertedResidual by this amount. Defaults to 4.

static make_mobilevit_layer(in_channels, out_channels, stride, transformer_dim, ffn_dim, num_transformer_blocks, expand_ratio=4)[source]¶

Build mobilevit layer, which consists of one InvertedResidual and one MobileVitBlock.

Parameters:

in_channels (int) – The input channels.
out_channels (int) – The output channels.
stride (int) – The stride of the first 3x3 convolution in the InvertedResidual layers.
transformer_dim (int) – The channels of the transformer layers.
ffn_dim (int) – The mid-channels of the feedforward network in transformer layers.
num_transformer_blocks (int) – The number of transformer blocks.
expand_ratio (int) – adjusts number of channels of the hidden layer in InvertedResidual by this amount. Defaults to 4.