mmpretrain.models¶
The models
package contains several sub-packages for addressing the different components of a model.
classifiers
: The top-level module which defines the whole process of a classification model.selfsup
: The top-level module which defines the whole process of a self-supervised learning model.retrievers
: The top-level module which defines the whole process of a retrieval model.backbones
: Usually a feature extraction network, e.g., ResNet, MobileNet.necks
: The component between backbones and heads, e.g., GlobalAveragePooling.heads
: The component for specific tasks.losses
: Loss functions.peft
: The PEFT (Parameter-Efficient Fine-Tuning) module, e.g. LoRAModel.utils
: Some helper functions and common components used in various networks.data_preprocessor
: The component before model to preprocess the inputs, e.g., ClsDataPreprocessor.Common Components: Common components used in various networks.
Helper Functions: Helper functions.
Build Functions¶
Build classifier. |
|
Build backbone. |
|
Build neck. |
|
Build head. |
|
Build loss. |
Classifiers¶
Base class for classifiers. |
|
Image classifiers for supervised classification task. |
|
Image classifiers for pytorch-image-models (timm) model. |
|
Image classifiers for HuggingFace model. |
Self-supervised Algorithms¶
BaseModel for Self-Supervised Learning. |
|
BEiT v1/v2. |
|
BYOL. |
|
BarlowTwins. |
|
CAE. |
|
DenseCL. |
|
EVA. |
|
iTPN. |
|
MAE. |
|
MILAN. |
|
MaskFeat. |
|
MixMIM. |
|
MoCo. |
|
MoCo v3. |
|
SimCLR. |
|
SimMIM. |
|
SimSiam. |
|
Implementation of SparK. |
|
SwAV. |
Some of above algorithms modified the backbone module to adapt the extra inputs
like mask
, and here is the a list of these modified backbone modules.
Vision Transformer for BEiT pre-training. |
|
Vision Transformer for CAE pre-training and the implementation is based on BEiTViT. |
|
HiViT for iTPN pre-training. |
|
HiViT for MAE pre-training. |
|
Vision Transformer for MAE pre-training. |
|
Vision Transformer for MILAN pre-training. |
|
Vision Transformer for MaskFeat pre-training. |
|
MixMIM backbone for MixMIM pre-training. |
|
Vision Transformer for MoCoV3 pre-training. |
|
Swin Transformer for SimMIM pre-training. |
Some self-supervise algorithms need an external target generator to generate the optimization target. Here is a list of target generators.
Vector-Quantized Knowledge Distillation. |
|
DALL-E Encoder for feature extraction. |
|
Generate HOG feature for images. |
|
Get the features and attention from the last layer of CLIP. |
Retrievers¶
Base class for retriever. |
|
Image To Image Retriever for supervised retrieval task. |
Multi-Modality Algorithms¶
BLIP2 Caption. |
|
BLIP2 Retriever. |
|
BLIP2 VQA. |
|
BLIP Caption. |
|
BLIP Grounding. |
|
BLIP NLVR. |
|
BLIP Retriever. |
|
BLIP VQA. |
|
The Open Flamingo model for multiple tasks. |
|
The OFA model for multiple tasks. |
|
The multi-modality model of MiniGPT-4. |
|
The LLaVA model for multiple tasks. |
|
The Otter model for multiple tasks. |
Backbones¶
AlexNet backbone. |
|
Backbone for BEiT. |
|
CSP-Darknet backbone used in YOLOv4. |
|
The abstract CSP Network class. |
|
CSP-ResNeXt backbone. |
|
CSP-ResNet backbone. |
|
Conformer backbone. |
|
ConvMixer. |
|
ConvNeXt v1&v2 backbone. |
|
DaViT. |
|
DeiT3 backbone. |
|
DenseNet. |
|
Distilled Vision Transformer. |
|
EdgeNeXt. |
|
EfficientFormer. |
|
EfficientNet backbone. |
|
EfficientNetV2 backbone. |
|
HiViT. |
|
HRNet backbone. |
|
HorNet backbone. |
|
Inception V3 backbone. |
|
LeNet5 backbone. |
|
LeViT backbone. |
|
Multi-scale ViT v2. |
|
Mlp-Mixer backbone. |
|
MobileNetV2 backbone. |
|
MobileNetV3 backbone. |
|
MobileOne backbone. |
|
MobileViT backbone. |
|
The backbone of Twins-PCPVT. |
|
PoolFormer. |
|
Pyramid Vision GNN backbone. |
|
RegNet backbone. |
|
RepLKNet backbone. |
|
RepMLPNet backbone. |
|
RepVGG backbone. |
|
Res2Net backbone. |
|
ResNeSt backbone. |
|
ResNeXt backbone. |
|
ResNet backbone. |
|
ResNetV1c backbone. |
|
ResNetV1d backbone. |
|
ResNet backbone for CIFAR. |
|
Reversible Vision Transformer. |
|
SEResNeXt backbone. |
|
SEResNet backbone. |
|
The backbone of Twins-SVT. |
|
ShuffleNetV1 backbone. |
|
ShuffleNetV2 backbone. |
|
ResNet with sparse module conversion function. |
|
ConvNeXt with sparse module conversion function. |
|
Swin Transformer. |
|
Swin Transformer V2. |
|
Tokens-to-Token Vision Transformer (T2T-ViT) |
|
Wrapper to use backbones from timm library. |
|
Transformer in Transformer. |
|
Visual Attention Network. |
|
VGG backbone. |
|
Vision GNN backbone. |
|
Vision Transformer. |
|
Vision Transformer as image encoder used in SAM. |
|
XCiT backbone. |
|
EVA02 Vision Transformer. |
Necks¶
Neck for BEiTV2 Pre-training. |
|
Neck for CAE Pre-training. |
|
Normalize cls token across batch before head. |
|
The non-linear neck of DenseCL. |
|
Generalized Mean Pooling neck. |
|
Global Average Pooling neck. |
|
Fuse feature map of multiple scales in HRNet. |
|
Linear neck with Dimension projection. |
|
Decoder for MAE Pre-training. |
|
Prompt decoder for MILAN. |
|
Decoder for MixMIM Pretraining. |
|
The non-linear neck of MoCo v2: fc-relu-fc. |
|
The non-linear neck. |
|
Linear Decoder For SimMIM pretraining. |
|
The non-linear neck of SwAV: fc-bn-relu-fc-normalization. |
|
The neck module of iTPN (transformer pyramid network). |
|
The decoder for SparK, which upsamples the feature maps. |
Heads¶
ArcFace classifier head. |
|
Head for BEiT v1 Pre-training. |
|
Head for BEiT v2 Pre-training. |
|
Head for CAE Pre-training. |
|
Class-specific residual attention classifier head. |
|
Classification head. |
|
Linear classifier head. |
|
Head for contrastive learning. |
|
Distilled Vision Transformer classifier head. |
|
EfficientFormer classifier head. |
|
Head for latent feature cross correlation. |
|
Head for latent feature prediction. |
|
Linear classifier head. |
|
Head for MAE Pre-training. |
|
Pre-training head for Masked Image Modeling. |
|
Head for MixMIM Pre-training. |
|
Head for MoCo v3 Pre-training. |
|
Classification head for multilabel task. |
|
Linear classification head for multilabel task. |
|
Multi task head. |
|
Head for SimMIM Pre-training. |
|
Classifier head with several hidden fc layer and a output fc layer. |
|
Head for SwAV Pre-training. |
|
The classification head for Vision GNN. |
|
Vision Transformer classifier head. |
|
Head for iTPN Pre-training using Clip. |
|
Pre-training head for SparK. |
Losses¶
asymmetric loss. |
|
Loss function for CAE. |
|
Cosine similarity loss function. |
|
Cross correlation loss function. |
|
Cross entropy loss. |
|
Focal loss. |
|
Initializer for the label smoothed cross entropy loss. |
|
Loss for the reconstruction of pixel in Masked Image Modeling. |
|
Implementation of seesaw loss. |
|
The Loss for SwAV. |
PEFT¶
Implements LoRA in a module. |
models.utils¶
This package includes some helper functions and common components used in various networks.
Common Components¶
The Conditional Position Encoding (CPE) module. |
|
CosineEMA is implemented for updating momentum parameter, used in BYOL, MoCoV3, etc. |
|
CNN Feature Map Embedding. |
|
Inverted Residual Block. |
|
LayerScale layer. |
|
Multi-head Attention Module. |
|
Image to Patch Embedding. |
|
Merge patch feature map. |
|
Squeeze-and-Excitation Module. |
|
Shift Window Multihead Self-Attention Module. |
|
Window based multi-head self-attention (W-MSA) module with relative position bias. |
|
Window based multi-head self-attention (W-MSA) module with relative position bias. |
Helper Functions¶
Channel Shuffle operation. |
|
Determine whether the model is called during the tracing of code with |
|
Make divisible function. |
|
Resize pos_embed weights. |
|
Resize relative position bias table. |
|
A to_tuple function generator. |