SwinTransformer¶

class mmpretrain.models.backbones.SwinTransformer(arch='tiny', img_size=224, patch_size=4, in_channels=3, window_size=7, drop_rate=0.0, drop_path_rate=0.1, out_indices=(3,), out_after_downsample=False, use_abs_pos_embed=False, interpolate_mode='bicubic', with_cp=False, frozen_stages=-1, norm_eval=False, pad_small_map=False, norm_cfg={'type': 'LN'}, stage_cfgs={}, patch_cfg={}, init_cfg=None)[source]¶

Swin Transformer.

A PyTorch implement of : Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Inspiration from https://github.com/microsoft/Swin-Transformer

Parameters:

arch (str | dict) –
Swin Transformer architecture. If use string, choose from ‘tiny’, ‘small’, ‘base’ and ‘large’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- depths (List[int]): The number of blocks in each stage.
- num_heads (List[int]): The number of heads in attention modules of each stage.
Defaults to ‘tiny’.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 4.
in_channels (int) – The num of input channels. Defaults to 3.
window_size (int) – The height and width of the window. Defaults to 7.
drop_rate (float) – Dropout rate after embedding. Defaults to 0.
drop_path_rate (float) – Stochastic depth rate. Defaults to 0.1.
out_after_downsample (bool) – Whether to output the feature map of a stage after the following downsample layer. Defaults to False.
use_abs_pos_embed (bool) – If True, add absolute position embedding to the patch embedding. Defaults to False.
interpolate_mode (str) – Select the interpolate mode for absolute position embeding vector resize. Defaults to “bicubic”.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Defaults to False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Defaults to False.
pad_small_map (bool) – If True, pad the small feature map to the window size, which is common used in detection and segmentation. If False, avoid shifting window and shrink the window size to the size of feature map, which is common used in classification. Defaults to False.
norm_cfg (dict) – Config dict for normalization layer for all output features. Defaults to dict(type='LN')
stage_cfgs (Sequence[dict] | dict) – Extra config dict for each stage. Defaults to an empty dict.
patch_cfg (dict) – Extra config dict for patch embedding. Defaults to an empty dict.
init_cfg (dict, optional) – The Config for initialization. Defaults to None.

Examples

>>> from mmpretrain.models import SwinTransformer
>>> import torch
>>> extra_config = dict(
>>>     arch='tiny',
>>>     stage_cfgs=dict(downsample_cfg={'kernel_size': 3,
>>>                                     'expansion_ratio': 3}))
>>> self = SwinTransformer(**extra_config)
>>> inputs = torch.rand(1, 3, 224, 224)
>>> output = self.forward(inputs)
>>> print(output.shape)
(1, 2592, 4)

get_layer_depth(param_name, prefix='')[source]¶

Get the layer-wise depth of a parameter.

Parameters:

param_name (str) – The name of the parameter.
prefix (str) – The prefix for the parameter. Defaults to an empty string.

Returns:

The layer-wise depth and the num of layers.

Return type:

Tuple[int, int]

Note

The first depth is the stem module (layer_depth=0), and the last depth is the subsequent module (layer_depth=num_layers-1)