class mmpretrain.models.backbones.ViTSAM(arch='base', img_size=224, patch_size=16, in_channels=3, out_channels=256, out_indices=-1, out_type='raw', drop_rate=0.0, drop_path_rate=0.0, qkv_bias=True, use_abs_pos=True, use_rel_pos=True, window_size=14, norm_cfg={'eps': 1e-06, 'type': 'LN'}, frozen_stages=-1, interpolate_mode='bicubic', patch_cfg={}, layer_cfgs={}, init_cfg=None)[source]

Vision Transformer as image encoder used in SAM.

A PyTorch implement of backbone: Segment Anything

  • arch (str | dict) –

    Vision Transformer architecture. If use string, choose from ‘base’, ‘large’, ‘huge’. If use dict, it should have below keys:

    • embed_dims (int): The dimensions of embedding.

    • num_layers (int): The number of transformer encoder layers.

    • num_heads (int): The number of heads in attention modules.

    • feedforward_channels (int): The hidden dimensions in feedforward modules.

    • global_attn_indexes (int): The index of layers with global attention.

    Defaults to ‘base’.

  • img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.

  • patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.

  • in_channels (int) – The num of input channels. Defaults to 3.

  • out_channels (int) – The num of output channels, if equal to 0, the channel reduction layer is disabled. Defaults to 256.

  • out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.

  • out_type (str) –

    The type of output features. Please choose from

    • "raw" or "featmap": The feature map tensor from the patch tokens with shape (B, C, H, W).

    • "avg_featmap": The global averaged feature map tensor with shape (B, C).

    Defaults to "raw".

  • drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.

  • drop_path_rate (float) – stochastic depth rate. Defaults to 0.

  • qkv_bias (bool) – Whether to add bias for qkv in attention modules. Defaults to True.

  • use_abs_pos (bool) – Whether to use absolute position embedding. Defaults to True.

  • use_rel_pos (bool) – Whether to use relative position embedding. Defaults to True.

  • window_size (int) – Window size for window attention. Defaults to 14.

  • norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').

  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.

  • interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.

  • patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.

  • layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.

  • init_cfg (dict, optional) – Initialization config dict. Defaults to None.

get_layer_depth(param_name, prefix='')[source]

Get the layer-wise depth of a parameter.

  • param_name (str) – The name of the parameter.

  • prefix (str) – The prefix for the parameter. Defaults to an empty string.


The layer-wise depth and the num of layers.

Return type:

Tuple[int, int]


The first depth is the stem module (layer_depth=0), and the last depth is the subsequent module (layer_depth=num_layers-1)

Read the Docs v: latest
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.