BEiTViT¶

class mmpretrain.models.backbones.BEiTViT(arch='base', img_size=224, patch_size=16, in_channels=3, out_indices=-1, drop_rate=0, drop_path_rate=0, bias='qv_bias', norm_cfg={'eps': 1e-06, 'type': 'LN'}, final_norm=False, out_type='avg_featmap', with_cls_token=True, frozen_stages=-1, use_abs_pos_emb=False, use_rel_pos_bias=True, use_shared_rel_pos_bias=False, interpolate_mode='bicubic', layer_scale_init_value=0.1, patch_cfg={}, layer_cfgs={}, init_cfg=None)[source]¶

Backbone for BEiT.

A PyTorch implement of : BEiT: BERT Pre-Training of Image Transformers A PyTorch implement of : BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Parameters:

arch (str | dict) –
BEiT architecture. If use string, choose from ‘base’, ‘large’. If use dict, it should have below keys:
- embed_dims (int): The dimensions of embedding.
- num_layers (int): The number of transformer encoder layers.
- num_heads (int): The number of heads in attention modules.
- feedforward_channels (int): The hidden dimensions in feedforward modules.
Defaults to ‘base’.
img_size (int | tuple) – The expected input image shape. Because we support dynamic input shape, just set the argument to the most common input image shape. Defaults to 224.
patch_size (int | tuple) – The patch size in patch embedding. Defaults to 16.
in_channels (int) – The num of input channels. Defaults to 3.
out_indices (Sequence | int) – Output from which stages. Defaults to -1, means the last stage.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
bias (bool | str) – The option to add leanable bias for q, k, v. If bias is True, it will add leanable bias. If bias is ‘qv_bias’, it will only add leanable bias for q, v. If bias is False, it will not add bias for q, k, v. Default to ‘qv_bias’.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
final_norm (bool) – Whether to add a additional layer to normalize final feature map. Defaults to True.
out_type (str) –
The type of output features. Please choose from
- "cls_token": The class token tensor with shape (B, C).
- "featmap": The feature map tensor from the patch tokens with shape (B, C, H, W).
- "avg_featmap": The global averaged feature map tensor with shape (B, C).
- "raw": The raw feature tensor includes patch tokens and class tokens with shape (B, L, C).
Defaults to "avg_featmap".
with_cls_token (bool) – Whether concatenating class token into image tokens as transformer input. Defaults to True.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1.
use_abs_pos_emb (bool) – Use position embedding like vanilla ViT. Defaults to False.
use_rel_pos_bias (bool) – Use relative position embedding in each transformer encoder layer. Defaults to True.
use_shared_rel_pos_bias (bool) – Use shared relative position embedding, all transformer encoder layers share the same relative position embedding. Defaults to False.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
interpolate_mode (str) – Select the interpolate mode for position embeding vector resize. Defaults to “bicubic”.
patch_cfg (dict) – Configs of patch embeding. Defaults to an empty dict.
layer_cfgs (Sequence | dict) – Configs of each transformer layer in encoder. Defaults to an empty dict.
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

get_layer_depth(param_name, prefix='')[source]¶

Get the layer-wise depth of a parameter.

Parameters:

param_name (str) – The name of the parameter.
prefix (str) – The prefix for the parameter. Defaults to an empty string.

Returns:

The layer-wise depth and the num of layers.

Return type:

Tuple[int, int]

Note

The first depth is the stem module (layer_depth=0), and the last depth is the subsequent module (layer_depth=num_layers-1)

static resize_pos_embed(*args, **kwargs)[source]¶: Interface for backward-compatibility.