TNT¶

class mmpretrain.models.backbones.TNT(arch='b', img_size=224, patch_size=16, in_channels=3, ffn_ratio=4, qkv_bias=False, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, act_cfg={'type': 'GELU'}, norm_cfg={'type': 'LN'}, first_stride=4, num_fcs=2, init_cfg=[{'type': 'TruncNormal', 'layer': 'Linear', 'std': 0.02}, {'type': 'Constant', 'layer': 'LayerNorm', 'val': 1.0, 'bias': 0.0}])[source]¶

Transformer in Transformer.

A PyTorch implement of: Transformer in Transformer

Inspiration from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/tnt.py

Parameters:

arch (str | dict) – Vision Transformer architecture Default: ‘b’
img_size (int | tuple) – Input image size. Defaults to 224
patch_size (int | tuple) – The patch size. Deault to 16
in_channels (int) – Number of input channels. Defaults to 3
ffn_ratio (int) – A ratio to calculate the hidden_dims in ffn layer. Default: 4
qkv_bias (bool) – Enable bias for qkv if True. Default False
drop_rate (float) – Probability of an element to be zeroed after the feed forward layer. Default 0.
attn_drop_rate (float) – The drop out rate for attention layer. Default 0.
drop_path_rate (float) – stochastic depth rate. Default 0.
act_cfg (dict) – The activation config for FFNs. Defaults to GELU.
norm_cfg (dict) – Config dict for normalization layer. Default layer normalization
first_stride (int) – The stride of the conv2d layer. We use a conv2d layer and a unfold layer to implement image to pixel embedding.
num_fcs (int) – The number of fully-connected layers for FFNs. Default 2
init_cfg (dict, optional) – Initialization config dict