BEiTV2Neck¶

class mmpretrain.models.necks.BEiTV2Neck(num_layers=2, early_layers=9, backbone_arch='base', drop_rate=0.0, drop_path_rate=0.0, layer_scale_init_value=0.1, use_rel_pos_bias=False, norm_cfg={'eps': 1e-06, 'type': 'LN'}, init_cfg={'bias': 0, 'layer': 'Linear', 'std': 0.02, 'type': 'TruncNormal'})[source]¶

Neck for BEiTV2 Pre-training.

This module construct the decoder for the final prediction.

Parameters:

num_layers (int) – Number of encoder layers of neck. Defaults to 2.
early_layers (int) – The layer index of the early output from the backbone. Defaults to 9.
backbone_arch (str) – Vision Transformer architecture. Defaults to base.
drop_rate (float) – Probability of an element to be zeroed. Defaults to 0.
drop_path_rate (float) – stochastic depth rate. Defaults to 0.
layer_scale_init_value (float) – The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1.
use_rel_pos_bias (bool) – Whether to use unique relative position bias, if False, use shared relative position bias defined in backbone.
norm_cfg (dict) – Config dict for normalization layer. Defaults to dict(type='LN').
init_cfg (dict, optional) – Initialization config dict. Defaults to None.

forward(inputs, rel_pos_bias, **kwargs)[source]¶

Get the latent prediction and final prediction.

Parameters:

x (Tuple[torch.Tensor]) – Features of tokens.
rel_pos_bias (torch.Tensor) – Shared relative position bias table.

Returns:

x: The final layer features from backbone, which are normed in BEiTV2Neck.
x_cls_pt: The early state features from backbone, which are consist of final layer cls_token and early state patch_tokens from backbone and sent to PatchAggregation layers in the neck.

Return type:

Tuple[torch.Tensor, torch.Tensor]

rescale_patch_aggregation_init_weight()[source]¶: Rescale the initialized weights.