备注
您正在阅读 MMClassification 0.x 版本的文档。MMClassification 0.x 会在 2022 年末被切换为次要分支。建议您升级到 MMClassification 1.0 版本,体验更多新特性和新功能。请查阅 MMClassification 1.0 的安装教程、迁移教程以及更新日志。
欢迎来到 MMClassification 中文教程!¶
You can switch between Chinese and English documentation in the lower-left corner of the layout.
您可以在页面左下角切换中英文文档。
依赖环境¶
在本节中,我们将演示如何准备 PyTorch 相关的依赖环境。
MMClassification 适用于 Linux、Windows 和 macOS。它需要 Python 3.6+、CUDA 9.2+ 和 PyTorch 1.5+。
备注
如果你对配置 PyTorch 环境已经很熟悉,并且已经完成了配置,可以直接进入下一节。 否则的话,请依照以下步骤完成配置。
第 1 步 从官网下载并安装 Miniconda。
第 2 步 创建一个 conda 虚拟环境并激活它。
conda create --name openmmlab python=3.8 -y
conda activate openmmlab
第 3 步 按照官方指南安装 PyTorch。例如:
在 GPU 平台:
conda install pytorch torchvision -c pytorch
警告
以上命令会自动安装最新版的 PyTorch 与对应的 cudatoolkit,请检查它们是否与你的环境匹配。
在 CPU 平台:
conda install pytorch torchvision cpuonly -c pytorch
安装¶
我们推荐用户按照我们的最佳实践来安装 MMClassification。但除此之外,如果你想根据 你的习惯完成安装流程,也可以参见自定义安装一节来获取更多信息。
最佳实践¶
pip install -U openmim
mim install mmcv-full
第 2 步 安装 MMClassification
根据具体需求,我们支持两种安装模式:
从源码安装(推荐):希望基于 MMClassification 框架开发自己的图像分类任务,需要添加新的功能,比如新的模型或是数据集,或者使用我们提供的各种工具。
作为 Python 包安装:只是希望调用 MMClassification 的 API 接口,或者在自己的项目中导入 MMClassification 中的模块。
从源码安装¶
这种情况下,从源码按如下方式安装 mmcls:
git clone https://github.com/open-mmlab/mmclassification.git
cd mmclassification
pip install -v -e .
# "-v" 表示输出更多安装相关的信息
# "-e" 表示以可编辑形式安装,这样可以在不重新安装的情况下,让本地修改直接生效
另外,如果你希望向 MMClassification 贡献代码,或者使用试验中的功能,请签出到 dev
分支。
git checkout dev
作为 Python 包安装¶
直接使用 pip 安装即可。
pip install mmcls
验证安装¶
为了验证 MMClassification 的安装是否正确,我们提供了一些示例代码来执行模型推理。
第 1 步 我们需要下载配置文件和模型权重文件
mim download mmcls --config resnet50_8xb32_in1k --dest .
第 2 步 验证示例的推理流程
如果你是从源码安装的 mmcls,那么直接运行以下命令进行验证:
python demo/image_demo.py demo/demo.JPEG resnet50_8xb32_in1k.py resnet50_8xb32_in1k_20210831-ea4938fc.pth --device cpu
你可以看到命令行中输出了结果字典,包括 pred_label
,pred_score
和 pred_class
三个字段。另外如果你拥有图形
界面(而不是使用远程终端),那么可以启用 --show
选项,将示例图像和对应的预测结果在窗口中进行显示。
如果你是作为 PyThon 包安装,那么可以打开你的 Python 解释器,并粘贴如下代码:
from mmcls.apis import init_model, inference_model
config_file = 'resnet50_8xb32_in1k.py'
checkpoint_file = 'resnet50_8xb32_in1k_20210831-ea4938fc.pth'
model = init_model(config_file, checkpoint_file, device='cpu') # 或者 device='cuda:0'
inference_model(model, 'demo/demo.JPEG')
你会看到输出一个字典,包含预测的标签、得分及类别名。
自定义安装¶
CUDA 版本¶
安装 PyTorch 时,需要指定 CUDA 版本。如果您不清楚选择哪个,请遵循我们的建议:
对于 Ampere 架构的 NVIDIA GPU,例如 GeForce 30 series 以及 NVIDIA A100,CUDA 11 是必需的。
对于更早的 NVIDIA GPU,CUDA 11 是向前兼容的,但 CUDA 10.2 能够提供更好的兼容性,也更加轻量。
请确保你的 GPU 驱动版本满足最低的版本需求,参阅这张表。
备注
如果按照我们的最佳实践进行安装,CUDA 运行时库就足够了,因为我们提供相关 CUDA 代码的预编译,你不需要进行本地编译。
但如果你希望从源码进行 MMCV 的编译,或是进行其他 CUDA 算子的开发,那么就必须安装完整的 CUDA 工具链,参见
NVIDIA 官网,另外还需要确保该 CUDA 工具链的版本与 PyTorch 安装时
的配置相匹配(如用 conda install
安装 PyTorch 时指定的 cudatoolkit 版本)。
不使用 MIM 安装 MMCV¶
MMCV 包含 C++ 和 CUDA 扩展,因此其对 PyTorch 的依赖比较复杂。MIM 会自动解析这些 依赖,选择合适的 MMCV 预编译包,使安装更简单,但它并不是必需的。
要使用 pip 而不是 MIM 来安装 MMCV,请遵照 MMCV 安装指南。 它需要你用指定 url 的形式手动指定对应的 PyTorch 和 CUDA 版本。
举个例子,如下命令将会安装基于 PyTorch 1.10.x 和 CUDA 11.3 编译的 mmcv-full。
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html
在 CPU 环境中安装¶
MMClassification 可以仅在 CPU 环境中安装,在 CPU 模式下,你可以完成训练(需要 MMCV 版本 >= 1.4.4)、测试和模型推理等所有操作。
在 CPU 模式下,MMCV 的部分功能将不可用,通常是一些 GPU 编译的算子。不过不用担心, MMClassification 中几乎所有的模型都不会依赖这些算子。
在 Google Colab 中安装¶
Google Colab 通常已经包含了 PyTorch 环境,因此我们只需要安装 MMCV 和 MMClassification 即可,命令如下:
!pip3 install openmim
!mim install mmcv-full
第 2 步 从源码安装 MMClassification
!git clone https://github.com/open-mmlab/mmclassification.git
%cd mmclassification
!pip install -e .
第 3 步 验证
import mmcls
print(mmcls.__version__)
# 预期输出: 0.23.0 或更新的版本号
备注
在 Jupyter 中,感叹号 !
用于执行外部命令,而 %cd
是一个魔术命令,用于切换 Python 的工作路径。
通过 Docker 使用 MMClassification¶
MMClassification 提供 Dockerfile 用于构建镜像。请确保你的 Docker 版本 >=19.03。
# 构建默认的 PyTorch 1.8.1,CUDA 10.2 版本镜像
# 如果你希望使用其他版本,请修改 Dockerfile
docker build -t mmclassification docker/
用以下命令运行 Docker 镜像:
docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmclassification/data mmclassification
故障解决¶
基础教程¶
本文档提供 MMClassification 相关用法的基本教程。
准备数据集¶
MMClassification 建议用户将数据集根目录链接到 $MMCLASSIFICATION/data
下。
如果用户的文件夹结构与默认结构不同,则需要在配置文件中进行对应路径的修改。
mmclassification
├── mmcls
├── tools
├── configs
├── docs
├── data
│ ├── imagenet
│ │ ├── meta
│ │ ├── train
│ │ ├── val
│ ├── cifar
│ │ ├── cifar-10-batches-py
│ ├── mnist
│ │ ├── train-images-idx3-ubyte
│ │ ├── train-labels-idx1-ubyte
│ │ ├── t10k-images-idx3-ubyte
│ │ ├── t10k-labels-idx1-ubyte
对于 ImageNet,其存在多个版本,但最为常用的一个是 ILSVRC 2012,可以通过以下步骤获取该数据集。
注册账号并登录 下载页面
获取 ILSVRC2012 下载链接并下载以下文件
ILSVRC2012_img_train.tar (~138GB)
ILSVRC2012_img_val.tar (~6.3GB)
解压下载的文件
使用 该脚本 获取元数据
对于 MNIST,CIFAR10 和 CIFAR100,程序将会在需要的时候自动下载数据集。
对于用户自定义数据集的准备,请参阅 教程 3:如何自定义数据集
使用预训练模型进行推理¶
MMClassification 提供了一些脚本用于进行单张图像的推理、数据集的推理和数据集的测试(如 ImageNet 等)
单张图像的推理¶
python demo/image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE}
# Example
python demo/image_demo.py demo/demo.JPEG configs/resnet/resnet50_8xb32_in1k.py \
https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth
数据集的推理与测试¶
支持单 GPU
支持 CPU
支持单节点多 GPU
支持多节点
用户可使用以下命令进行数据集的推理:
# 单 GPU
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]
# CPU: 禁用 GPU 并运行单 GPU 测试脚本
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]
# 多 GPU
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--metrics ${METRICS}] [--out ${RESULT_FILE}]
# 基于 slurm 分布式环境的多节点
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}] --launcher slurm
可选参数:
RESULT_FILE
:输出结果的文件名。如果未指定,结果将不会保存到文件中。支持 json, yaml, pickle 格式。METRICS
:数据集测试指标,如准确率 (accuracy), 精确率 (precision), 召回率 (recall) 等
例子:
在 CIFAR10 验证集上,使用 ResNet-50 进行推理并获得预测标签及其对应的预测得分。
python tools/test.py configs/resnet/resnet50_8xb16_cifar10.py \
https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
--out result.pkl
模型训练¶
MMClassification 使用 MMDistributedDataParallel
进行分布式训练,使用 MMDataParallel
进行非分布式训练。
所有的输出(日志文件和模型权重文件)会被将保存到工作目录下。工作目录通过配置文件中的参数 work_dir
指定。
默认情况下,MMClassification 在每个周期后会在验证集上评估模型,可以通过在训练配置中修改 interval
参数来更改评估间隔
evaluation = dict(interval=12) # 每进行 12 轮训练后评估一次模型
使用单个 GPU 进行训练¶
python tools/train.py ${CONFIG_FILE} [optional arguments]
如果用户想在命令中指定工作目录,则需要增加参数 --work-dir ${YOUR_WORK_DIR}
使用 CPU 训练¶
使用 CPU 训练的流程和使用单 GPU 训练的流程一致,我们仅需要在训练流程开始前禁用 GPU。
export CUDA_VISIBLE_DEVICES=-1
之后运行单 GPU 训练脚本即可。
警告
我们不推荐用户使用 CPU 进行训练,这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
使用单台机器多个 GPU 进行训练¶
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
可选参数为:
--no-validate
(不建议): 默认情况下,程序将会在训练期间的每 k (默认为 1) 个周期进行一次验证。要禁用这一功能,使用--no-validate
--work-dir ${WORK_DIR}
:覆盖配置文件中指定的工作目录。--resume-from ${CHECKPOINT_FILE}
:从以前的模型权重文件恢复训练。
resume-from
和 load-from
的不同点:
resume-from
加载模型参数和优化器状态,并且保留检查点所在的周期数,常被用于恢复意外被中断的训练。
load-from
只加载模型参数,但周期数从 0 开始计数,常被用于微调模型。
使用多台机器进行训练¶
如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:
在第一台机器上:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
在第二台机器上:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。
如果用户在 slurm 集群上运行 MMClassification,可使用 slurm_train.sh
脚本。(该脚本也支持单台机器上进行训练)
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
用户可以在 slurm_train.sh 中检查所有的参数和环境变量
如果用户的多台机器通过 Ethernet 连接,则可以参考 pytorch launch utility。如果用户没有高速网络,如 InfiniBand,速度将会非常慢。
使用单台机器启动多个任务¶
如果用使用单台机器启动多个任务,如在有 8 块 GPU 的单台机器上启动 2 个需要 4 块 GPU 的训练任务,则需要为每个任务指定不同端口,以避免通信冲突。
如果用户使用 dist_train.sh
脚本启动训练任务,则可以通过以下命令指定端口
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
如果用户在 slurm 集群下启动多个训练任务,则需要修改配置文件中的 dist_params
变量,以设置不同的通信端口。
在 config1.py
中,
dist_params = dict(backend='nccl', port=29500)
在 config2.py
中,
dist_params = dict(backend='nccl', port=29501)
之后便可启动两个任务,分别对应 config1.py
和 config2.py
。
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
实用工具¶
我们在 tools/
目录下提供的一些对训练和测试十分有用的工具
计算 FLOPs 和参数量(试验性的)¶
我们根据 flops-counter.pytorch 提供了一个脚本用于计算给定模型的 FLOPs 和参数量
python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
用户将获得如下结果:
==============================
Input shape: (3, 224, 224)
Flops: 4.12 GFLOPs
Params: 25.56 M
==============================
警告
此工具仍处于试验阶段,我们不保证该数字正确无误。您最好将结果用于简单比较,但在技术报告或论文中采用该结果之前,请仔细检查。
FLOPs 与输入的尺寸有关,而参数量与输入尺寸无关。默认输入尺寸为 (1, 3, 224, 224)
一些运算不会被计入 FLOPs 的统计中,例如 GN 和自定义运算。详细信息请参考
mmcv.cnn.get_model_complexity_info()
模型发布¶
在发布模型之前,你也许会需要
转换模型权重至 CPU 张量
删除优化器状态
计算模型权重文件的哈希值,并添加至文件名之后
python tools/convert_models/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}
例如:
python tools/convert_models/publish_model.py work_dirs/resnet50/latest.pth imagenet_resnet50.pth
最终输出的文件名将会是 imagenet_resnet50_{date}-{hash id}.pth
详细教程¶
目前,MMClassification 提供以下几种更详细的教程:
教程 1:如何编写配置文件¶
MMClassification 主要使用 python 文件作为配置文件。其配置文件系统的设计将模块化与继承整合进来,方便用户进行各种实验。所有配置文件都放置在 configs
文件夹下,主要包含 _base_
原始配置文件夹 以及 resnet
, swin_transformer
,vision_transformer
等诸多算法文件夹。
可以使用 python tools/misc/print_config.py /PATH/TO/CONFIG
命令来查看完整的配置信息,从而方便检查所对应的配置文件。
配置文件以及权重命名规则¶
MMClassification 按照以下风格进行配置文件命名,代码库的贡献者需要遵循相同的命名规则。文件名总体分为四部分:算法信息,模块信息,训练信息和数据信息。逻辑上属于不同部分的单词之间用下划线 '_'
连接,同一部分有多个单词用短横线 '-'
连接。
{algorithm info}_{module info}_{training info}_{data info}.py
algorithm info
:算法信息,算法名称或者网络架构,如 resnet 等;module info
: 模块信息,因任务而异,用以表示一些特殊的 neck、head 和 pretrain 信息;training info
:一些训练信息,训练策略设置,包括 batch size,schedule 数据增强等;data info
:数据信息,数据集名称、模态、输入尺寸等,如 imagenet, cifar 等;
算法信息¶
指论文中的算法名称缩写,以及相应的分支架构信息。例如:
resnet50
mobilenet-v3-large
vit-small-patch32
:patch32
表示ViT
切分的分块大小seresnext101-32x4d
:SeResNet101
基本网络结构,32x4d
表示在Bottleneck
中groups
和width_per_group
分别为32和4
模块信息¶
指一些特殊的 neck
、head
或者 pretrain
的信息, 在分类中常见为预训练信息,比如:
in21k-pre
: 在ImageNet21k
上预训练in21k-pre-3rd-party
: 在ImageNet21k
上预训练,其权重来自其他仓库
训练信息¶
训练策略的一些设置,包括训练类型、 batch size
、 lr schedule
、 数据增强以及特殊的损失函数等等,比如:
Batch size 信息:
格式为
{gpu x batch_per_gpu}
, 如8xb32
训练类型(主要见于 transformer 网络,如 ViT
算法,这类算法通常分为预训练和微调两种模式):
ft
: Finetune config,用于微调的配置文件pt
: Pretrain config,用于预训练的配置文件
训练策略信息,训练策略以复现配置文件为基础,此基础不必标注训练策略。但如果在此基础上进行改进,则需注明训练策略,按照应用点位顺序排列,如:{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}
coslr-200e
: 使用 cosine scheduler, 训练 200 个 epochautoaug-mixup-lbs-coslr-50e
: 使用了autoaug
、mixup
、label smooth
、cosine scheduler
, 训练了 50 个轮次
数据信息¶
in1k
:ImageNet1k
数据集,默认使用224x224
大小的图片in21k
:ImageNet21k
数据集,有些地方也称为ImageNet22k
数据集,默认使用224x224
大小的图片in1k-384px
: 表示训练的输出图片大小为384x384
cifar100
配置文件命名案例:¶
repvgg-D2se_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py
repvgg-D2se
: 算法信息repvgg
: 主要算法名称。D2se
: 模型的结构。
deploy
:模块信息,该模型为推理状态。4xb64-autoaug-lbs-mixup-coslr-200e
: 训练信息4xb64
: 使用4块 GPU 并且 每块 GPU 的批大小为64。autoaug
: 使用AutoAugment
数据增强方法。lbs
: 使用label smoothing
损失函数。mixup
: 使用mixup
训练增强方法。coslr
: 使用cosine scheduler
优化策略。200e
: 训练 200 轮次。
in1k
: 数据信息。 配置文件用于ImageNet1k
数据集上使用224x224
大小图片训练。
备注
部分配置文件目前还没有遵循此命名规范,相关文件命名近期会更新。
权重命名规则¶
权重的命名主要包括配置文件名,日期和哈希值。
{config_name}_{date}-{hash}.pth
配置文件结构¶
在 configs/_base_
文件夹下有 4 个基本组件类型,分别是:
你可以通过继承一些基本配置文件轻松构建自己的训练配置文件。由来自_base_
的组件组成的配置称为 primitive。
为了帮助用户对 MMClassification 检测系统中的完整配置和模块有一个基本的了解,我们使用 ResNet50 原始配置文件 作为案例进行说明并注释每一行含义。更详细的用法和各个模块对应的替代方案,请参考 API 文档。
_base_ = [
'../_base_/models/resnet50.py', # 模型
'../_base_/datasets/imagenet_bs32.py', # 数据
'../_base_/schedules/imagenet_bs256.py', # 训练策略
'../_base_/default_runtime.py' # 默认运行设置
]
下面对这四个部分分别进行说明,仍然以上述 ResNet50 原始配置文件作为案例。
模型¶
模型参数 model
在配置文件中为一个 python
字典,主要包括网络结构、损失函数等信息:
type
: 分类器名称, 目前 MMClassification 只支持ImageClassifier
, 参考 API 文档。backbone
: 主干网类型,可用选项参考 API 文档。neck
: 颈网络类型,目前 MMClassification 只支持GlobalAveragePooling
, 参考 API 文档。head
: 头网络类型, 包括单标签分类与多标签分类头网络,可用选项参考 API 文档。loss
: 损失函数类型, 支持CrossEntropyLoss
,LabelSmoothLoss
等,可用选项参考 API 文档。
备注
配置文件中的 ‘type’ 不是构造时的参数,而是类名。
model = dict(
type='ImageClassifier', # 分类器类型
backbone=dict(
type='ResNet', # 主干网络类型
depth=50, # 主干网网络深度, ResNet 一般有18, 34, 50, 101, 152 可以选择
num_stages=4, # 主干网络状态(stages)的数目,这些状态产生的特征图作为后续的 head 的输入。
out_indices=(3, ), # 输出的特征图输出索引。越远离输入图像,索引越大
frozen_stages=-1, # 网络微调时,冻结网络的stage(训练时不执行反相传播算法),若num_stages=4,backbone包含stem 与 4 个 stages。frozen_stages为-1时,不冻结网络; 为0时,冻结 stem; 为1时,冻结 stem 和 stage1; 为4时,冻结整个backbone
style='pytorch'), # 主干网络的风格,'pytorch' 意思是步长为2的层为 3x3 卷积, 'caffe' 意思是步长为2的层为 1x1 卷积。
neck=dict(type='GlobalAveragePooling'), # 颈网络类型
head=dict(
type='LinearClsHead', # 线性分类头,
num_classes=1000, # 输出类别数,这与数据集的类别数一致
in_channels=2048, # 输入通道数,这与 neck 的输出通道一致
loss=dict(type='CrossEntropyLoss', loss_weight=1.0), # 损失函数配置信息
topk=(1, 5), # 评估指标,Top-k 准确率, 这里为 top1 与 top5 准确率
))
数据¶
数据参数 data
在配置文件中为一个 python
字典,主要包含构造数据集加载器(dataloader)配置信息:
samples_per_gpu
: 构建 dataloader 时,每个 GPU 的 Batch Sizeworkers_per_gpu
: 构建 dataloader 时,每个 GPU 的 线程数train | val | test
: 构造数据集type
: 数据集类型, MMClassification 支持ImageNet
、Cifar
等 ,参考API 文档data_prefix
: 数据集根目录pipeline
: 数据处理流水线,参考相关教程文档 如何设计数据处理流水线
评估参数 evaluation
也是一个字典, 为 evaluation hook
的配置信息, 主要包括评估间隔、评估指标等。
# dataset settings
dataset_type = 'ImageNet' # 数据集名称,
img_norm_cfg = dict( #图像归一化配置,用来归一化输入的图像。
mean=[123.675, 116.28, 103.53], # 预训练里用于预训练主干网络模型的平均值。
std=[58.395, 57.12, 57.375], # 预训练里用于预训练主干网络模型的标准差。
to_rgb=True) # 是否反转通道,使用 cv2, mmcv 读取图片默认为 BGR 通道顺序,这里 Normalize 均值方差数组的数值是以 RGB 通道顺序, 因此需要反转通道顺序。
# 训练数据流水线
train_pipeline = [
dict(type='LoadImageFromFile'), # 读取图片
dict(type='RandomResizedCrop', size=224), # 随机缩放抠图
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'), # 以概率为0.5随机水平翻转图片
dict(type='Normalize', **img_norm_cfg), # 归一化
dict(type='ImageToTensor', keys=['img']), # image 转为 torch.Tensor
dict(type='ToTensor', keys=['gt_label']), # gt_label 转为 torch.Tensor
dict(type='Collect', keys=['img', 'gt_label']) # 决定数据中哪些键应该传递给检测器的流程
]
# 测试数据流水线
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=(256, -1)),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']) # test 时不传递 gt_label
]
data = dict(
samples_per_gpu=32, # 单个 GPU 的 Batch size
workers_per_gpu=2, # 单个 GPU 的 线程数
train=dict( # 训练数据信息
type=dataset_type, # 数据集名称
data_prefix='data/imagenet/train', # 数据集目录,当不存在 ann_file 时,类别信息从文件夹自动获取
pipeline=train_pipeline), # 数据集需要经过的 数据流水线
val=dict( # 验证数据集信息
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt', # 标注文件路径,存在 ann_file 时,不通过文件夹自动获取类别信息
pipeline=test_pipeline),
test=dict( # 测试数据集信息
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt',
pipeline=test_pipeline))
evaluation = dict( # evaluation hook 的配置
interval=1, # 验证期间的间隔,单位为 epoch 或者 iter, 取决于 runner 类型。
metric='accuracy') # 验证期间使用的指标。
训练策略¶
主要包含 优化器设置、 optimizer hook
设置、学习率策略和 runner
设置:
optimizer
: 优化器设置信息, 支持pytorch
所有的优化器,参考相关 mmcv 文档optimizer_config
:optimizer hook
的配置文件,如设置梯度限制,参考相关 mmcv 代码lr_config
: 学习率策略,支持 “CosineAnnealing”、 “Step”、 “Cyclic” 等等,参考相关 mmcv 文档runner
: 有关runner
可以参考mmcv
对于runner
介绍文档
# 用于构建优化器的配置文件。支持 PyTorch 中的所有优化器,同时它们的参数与 PyTorch 里的优化器参数一致。
optimizer = dict(type='SGD', # 优化器类型
lr=0.1, # 优化器的学习率,参数的使用细节请参照对应的 PyTorch 文档。
momentum=0.9, # 动量(Momentum)
weight_decay=0.0001) # 权重衰减系数(weight decay)。
# optimizer hook 的配置文件
optimizer_config = dict(grad_clip=None) # 大多数方法不使用梯度限制(grad_clip)。
# 学习率调整配置,用于注册 LrUpdater hook。
lr_config = dict(policy='step', # 调度流程(scheduler)的策略,也支持 CosineAnnealing, Cyclic, 等。
step=[30, 60, 90]) # 在 epoch 为 30, 60, 90 时, lr 进行衰减
runner = dict(type='EpochBasedRunner', # 将使用的 runner 的类别,如 IterBasedRunner 或 EpochBasedRunner。
max_epochs=100) # runner 总回合数, 对于 IterBasedRunner 使用 `max_iters`
运行设置¶
本部分主要包括保存权重策略、日志配置、训练参数、断点权重路径和工作目录等等。
# Checkpoint hook 的配置文件。
checkpoint_config = dict(interval=1) # 保存的间隔是 1,单位会根据 runner 不同变动,可以为 epoch 或者 iter。
# 日志配置信息。
log_config = dict(
interval=100, # 打印日志的间隔, 单位 iters
hooks=[
dict(type='TextLoggerHook'), # 用于记录训练过程的文本记录器(logger)。
# dict(type='TensorboardLoggerHook') # 同样支持 Tensorboard 日志
])
dist_params = dict(backend='nccl') # 用于设置分布式训练的参数,端口也同样可被设置。
log_level = 'INFO' # 日志的输出级别。
resume_from = None # 从给定路径里恢复检查点(checkpoints),训练模式将从检查点保存的轮次开始恢复训练。
workflow = [('train', 1)] # runner 的工作流程,[('train', 1)] 表示只有一个工作流且工作流仅执行一次。
work_dir = 'work_dir' # 用于保存当前实验的模型检查点和日志的目录文件地址。
继承并修改配置文件¶
为了精简代码、更快的修改配置文件以及便于理解,我们建议继承现有方法。
对于在同一算法文件夹下的所有配置文件,MMClassification 推荐只存在 一个 对应的 原始配置 文件。 所有其他的配置文件都应该继承 原始配置 文件,这样就能保证配置文件的最大继承深度为 3。
例如,如果在 ResNet 的基础上做了一些修改,用户首先可以通过指定 _base_ = './resnet50_8xb32_in1k.py'
(相对于你的配置文件的路径),来继承基础的 ResNet 结构、数据集以及其他训练配置信息,然后修改配置文件中的必要参数以完成继承。如想在基础 resnet50 的基础上将训练轮数由 100 改为 300 和修改学习率衰减轮数,同时修改数据集路径,可以建立新的配置文件 configs/resnet/resnet50_8xb32-300e_in1k.py
, 文件中写入以下内容:
_base_ = './resnet50_8xb32_in1k.py'
runner = dict(max_epochs=300)
lr_config = dict(step=[150, 200, 250])
data = dict(
train=dict(data_prefix='mydata/imagenet/train'),
val=dict(data_prefix='mydata/imagenet/train', ),
test=dict(data_prefix='mydata/imagenet/train', )
)
使用配置文件里的中间变量¶
用一些中间变量,中间变量让配置文件更加清晰,也更容易修改。
例如数据集里的 train_pipeline
/ test_pipeline
是作为数据流水线的中间变量。我们首先要定义 train_pipeline
/ test_pipeline
,然后将它们传递到 data
中。如果想修改训练或测试时输入图片的大小,就需要修改 train_pipeline
/ test_pipeline
这些中间变量。
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=384, backend='pillow',),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=384, backend='pillow'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
data = dict(
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
忽略基础配置文件里的部分内容¶
有时,您需要设置 _delete_=True
去忽略基础配置文件里的一些域内容。 可以参照 mmcv 来获得一些简单的指导。
以下是一个简单应用案例。 如果在上述 ResNet50 案例中 使用 cosine schedule ,使用继承并直接修改会报 get unexcepected keyword 'step'
错, 因为基础配置文件 lr_config 域信息的 'step'
字段被保留下来了,需要加入 _delete_=True
去忽略基础配置文件里的 lr_config
相关域内容:
_base_ = '../../configs/resnet/resnet50_8xb32_in1k.py'
lr_config = dict(
_delete_=True,
policy='CosineAnnealing',
min_lr=0,
warmup='linear',
by_epoch=True,
warmup_iters=5,
warmup_ratio=0.1
)
引用基础配置文件里的变量¶
有时,您可以引用 _base_
配置信息的一些域内容,这样可以避免重复定义。 可以参照 mmcv 来获得一些简单的指导。
以下是一个简单应用案例,在训练数据预处理流水线中使用 auto augment 数据增强,参考配置文件 configs/_base_/datasets/imagenet_bs64_autoaug.py
。 在定义 train_pipeline
时,可以直接在 _base_
中加入定义 auto augment 数据增强的文件命名,再通过 {{_base_.auto_increasing_policies}}
引用变量:
_base_ = ['./pipelines/auto_aug.py']
# dataset settings
dataset_type = 'ImageNet'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='AutoAugment', policies={{_base_.auto_increasing_policies}}),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [...]
data = dict(
samples_per_gpu=64,
workers_per_gpu=2,
train=dict(..., pipeline=train_pipeline),
val=dict(..., pipeline=test_pipeline))
evaluation = dict(interval=1, metric='accuracy')
通过命令行参数修改配置信息¶
当用户使用脚本 “tools/train.py” 或者 “tools/test.py” 提交任务,以及使用一些工具脚本时,可以通过指定 --cfg-options
参数来直接修改所使用的配置文件内容。
更新配置文件内的字典
可以按照原始配置文件中字典的键的顺序指定配置选项。 例如,
--cfg-options model.backbone.norm_eval=False
将主干网络中的所有 BN 模块更改为train
模式。更新配置文件内列表的键
一些配置字典在配置文件中会形成一个列表。例如,训练流水线
data.train.pipeline
通常是一个列表。 例如,[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...]
。如果要将流水线中的'flip_prob=0.5'
更改为'flip_prob=0.0'
,您可以这样指定--cfg-options data.train.pipeline.1.flip_prob=0.0
。更新列表/元组的值。
当配置文件中需要更新的是一个列表或者元组,例如,配置文件通常会设置
workflow=[('train', 1)]
,用户如果想更改, 需要指定--cfg-options workflow="[(train,1),(val,1)]"
。注意这里的引号 ” 对于列表以及元组数据类型的修改是必要的, 并且 不允许 引号内所指定的值的书写存在空格。
导入用户自定义模块¶
备注
本部分仅在当将 MMClassification 当作库构建自己项目时可能用到,初学者可跳过。
在学习完后续教程 如何添加新数据集、如何设计数据处理流程 、如何增加新模块 后,您可能使用 MMClassification 完成自己的项目并在项目中自定义了数据集、模型、数据增强等。为了精简代码,可以将 MMClassification 作为一个第三方库,只需要保留自己的额外的代码,并在配置文件中导入自定义的模块。案例可以参考 OpenMMLab 算法大赛项目。
只需要在你的配置文件中添加以下代码:
custom_imports = dict(
imports=['your_dataset_class',
'your_transforme_class',
'your_model_class',
'your_module_class'],
allow_failed_imports=False)
常见问题¶
无
教程 2:如何微调模型¶
已经证明,在 ImageNet 数据集上预先训练的分类模型对于其他数据集和其他下游任务有很好的效果。
该教程提供了如何将 Model Zoo 中提供的预训练模型用于其他数据集,已获得更好的效果。
在新数据集上微调模型分为两步:
按照 教程 3:如何自定义数据集 添加对新数据集的支持。
按照本教程中讨论的内容修改配置文件
假设我们现在有一个在 ImageNet-2012 数据集上训练好的 ResNet-50 模型,并且希望在 CIFAR-10 数据集上进行模型微调,我们需要修改配置文件中的五个部分。
继承基础配置¶
首先,创建一个新的配置文件 configs/tutorial/resnet50_finetune_cifar.py
来保存我们的配置,当然,这个文件名可以自由设定。
为了重用不同配置之间的通用部分,我们支持从多个现有配置中继承配置。要微调
ResNet-50 模型,新配置需要继承 _base_/models/resnet50.py
来搭建模型的基本结构。
为了使用 CIFAR10 数据集,新的配置文件可以直接继承 _base_/datasets/cifar10.py
。
而为了保留运行相关设置,比如训练调整器,新的配置文件需要继承
_base_/default_runtime.py
。
要继承以上这些配置文件,只需要把下面一段代码放在我们的配置文件开头。
_base_ = [
'../_base_/models/resnet50.py',
'../_base_/datasets/cifar10.py', '../_base_/default_runtime.py'
]
除此之外,你也可以不使用继承,直接编写完整的配置文件,例如
configs/lenet/lenet5_mnist.py
。
修改模型¶
在进行模型微调是,我们通常希望在主干网络(backbone)加载预训练模型,再用我们的数据集训练一个新的分类头(head)。
为了在主干网络加载预训练模型,我们需要修改主干网络的初始化设置,使用
Pretrained
类型的初始化函数。另外,在初始化设置中,我们使用
prefix='backbone'
来告诉初始化函数移除权重文件中键值名称的前缀,比如把
backbone.conv1
变成 conv1
。方便起见,我们这里使用一个在线的权重文件链接,它
会在训练前自动下载对应的文件,你也可以提前下载这个模型,然后使用本地路径。
接下来,新的配置文件需要按照新数据集的类别数目来修改分类头的配置。只需要修改分
类头中的 num_classes
设置即可。
model = dict(
backbone=dict(
init_cfg=dict(
type='Pretrained',
checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
prefix='backbone',
)),
head=dict(num_classes=10),
)
小技巧
这里我们只需要设定我们想要修改的部分配置,其他配置将会自动从我们的父配置文件中获取。
另外,有时我们在进行微调时会希望冻结主干网络前面几层的参数,这么做有助于在后续
训练中,保持网络从预训练权重中获得的提取低阶特征的能力。在 MMClassification 中,
这一功能可以通过简单的一个 frozen_stages
参数来实现。比如我们需要冻结前两层网
络的参数,只需要在上面的配置中添加一行:
model = dict(
backbone=dict(
frozen_stages=2,
init_cfg=dict(
type='Pretrained',
checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
prefix='backbone',
)),
head=dict(num_classes=10),
)
备注
目前还不是所有的网络都支持 frozen_stages
参数,在使用之前,请先检查
文档
以确认你所使用的主干网络是否支持。
修改数据集¶
当针对一个新的数据集进行微调时,我们通常都需要修改一些数据集相关的配置。比如这 里,我们就需要把 CIFAR-10 数据集中的图像大小从 32 缩放到 224 来配合 ImageNet 上 预训练模型的输入。这一需要可以通过修改数据集的预处理流水线(pipeline)来实现。
img_norm_cfg = dict(
mean=[125.307, 122.961, 113.8575],
std=[51.5865, 50.847, 51.255],
to_rgb=False,
)
train_pipeline = [
dict(type='RandomCrop', size=32, padding=4),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Resize', size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label']),
]
test_pipeline = [
dict(type='Resize', size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
]
data = dict(
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline),
)
修改训练策略设置¶
用于微调任务的超参数与默认配置不同,通常只需要较小的学习率和较少的训练时间。
# 用于批大小为 128 的优化器学习率
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# 学习率衰减策略
lr_config = dict(policy='step', step=[15])
runner = dict(type='EpochBasedRunner', max_epochs=200)
log_config = dict(interval=100)
开始训练¶
现在,我们完成了用于微调的配置文件,完整的文件如下:
_base_ = [
'../_base_/models/resnet50.py',
'../_base_/datasets/cifar10_bs16.py', '../_base_/default_runtime.py'
]
# 模型设置
model = dict(
backbone=dict(
frozen_stages=2,
init_cfg=dict(
type='Pretrained',
checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
prefix='backbone',
)),
head=dict(num_classes=10),
)
# 数据集设置
img_norm_cfg = dict(
mean=[125.307, 122.961, 113.8575],
std=[51.5865, 50.847, 51.255],
to_rgb=False,
)
train_pipeline = [
dict(type='RandomCrop', size=32, padding=4),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Resize', size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label']),
]
test_pipeline = [
dict(type='Resize', size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
]
data = dict(
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline),
)
# 训练策略设置
# 用于批大小为 128 的优化器学习率
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# 学习率衰减策略
lr_config = dict(policy='step', step=[15])
runner = dict(type='EpochBasedRunner', max_epochs=200)
log_config = dict(interval=100)
接下来,我们使用一台 8 张 GPU 的电脑来训练我们的模型,指令如下:
bash tools/dist_train.sh configs/tutorial/resnet50_finetune_cifar.py 8
当然,我们也可以使用单张 GPU 来进行训练,使用如下命令:
python tools/train.py configs/tutorial/resnet50_finetune_cifar.py
但是如果我们使用单张 GPU 进行训练的话,需要在数据集设置部分作如下修改:
data = dict(
samples_per_gpu=128,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline),
)
这是因为我们的训练策略是针对批次大小(batch size)为 128 设置的。在父配置文件中,
设置了 samples_per_gpu=16
,如果使用 8 张 GPU,总的批次大小就是 128。而如果使
用单张 GPU,就必须手动修改 samples_per_gpu=128
来匹配训练策略。
教程 3:如何自定义数据集¶
我们支持许多常用的图像分类领域公开数据集,你可以在 此页面中找到它们。
在本节中,我们将介绍如何使用自己的数据集以及如何使用数据集包装。
使用自己的数据集¶
将数据集重新组织为已有格式¶
想要使用自己的数据集,最简单的方法就是将数据集转换为现有的数据集格式。
对于多分类任务,我们推荐使用 CustomDataset
格式。
CustomDataset
支持两种类型的数据格式:
提供一个标注文件,其中每一行表示一张样本图片。
样本图片可以以任意的结构进行组织,比如:
train/ ├── folder_1 │ ├── xxx.png │ ├── xxy.png │ └── ... ├── 123.png ├── nsdf3.png └── ...
而标注文件则记录了所有样本图片的文件路径以及相应的类别序号。其中第一列表示图像 相对于主目录(本例中为
train
目录)的路径,第二列表示类别序号:folder_1/xxx.png 0 folder_1/xxy.png 1 123.png 1 nsdf3.png 2 ...
备注
类别序号的值应当属于
[0, num_classes - 1]
范围。将所有样本文件按如下结构进行组织:
train/ ├── cat │ ├── xxx.png │ ├── xxy.png │ └── ... │ └── xxz.png ├── bird │ ├── bird1.png │ ├── bird2.png │ └── ... └── dog ├── 123.png ├── nsdf3.png ├── ... └── asd932_.png
这种情况下,你不需要提供标注文件,所有位于
cat
目录下的图片文件都会被视为cat
类别的样本。
通常而言,我们会将整个数据集分为三个子数据集:train
,val
和 test
,分别用于训练、验证和测试。每一个子
数据集都需要被组织成如上的一种结构。
举个例子,完整的数据集结构如下所示(使用第一种组织结构):
mmclassification
└── data
└── my_dataset
├── meta
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
├── train
├── val
└── test
之后在你的配置文件中,可以修改其中的 data
字段为如下格式:
...
dataset_type = 'CustomDataset'
classes = ['cat', 'bird', 'dog'] # 数据集中各类别的名称
data = dict(
train=dict(
type=dataset_type,
data_prefix='data/my_dataset/train',
ann_file='data/my_dataset/meta/train.txt',
classes=classes,
pipeline=train_pipeline
),
val=dict(
type=dataset_type,
data_prefix='data/my_dataset/val',
ann_file='data/my_dataset/meta/val.txt',
classes=classes,
pipeline=test_pipeline
),
test=dict(
type=dataset_type,
data_prefix='data/my_dataset/test',
ann_file='data/my_dataset/meta/test.txt',
classes=classes,
pipeline=test_pipeline
)
)
...
创建一个新的数据集类¶
用户可以编写一个继承自 BasesDataset
的新数据集类,并重载 load_annotations(self)
方法,
类似 CIFAR10
和 ImageNet。
通常,此方法返回一个包含所有样本的列表,其中的每个样本都是一个字典。字典中包含了必要的数据信息,例如 img
和 gt_label
。
假设我们将要实现一个 Filelist
数据集,该数据集将使用文件列表进行训练和测试。注释列表的格式如下:
000001.jpg 0
000002.jpg 1
我们可以在 mmcls/datasets/filelist.py
中创建一个新的数据集类以加载数据。
import mmcv
import numpy as np
from .builder import DATASETS
from .base_dataset import BaseDataset
@DATASETS.register_module()
class Filelist(BaseDataset):
def load_annotations(self):
assert isinstance(self.ann_file, str)
data_infos = []
with open(self.ann_file) as f:
samples = [x.strip().split(' ') for x in f.readlines()]
for filename, gt_label in samples:
info = {'img_prefix': self.data_prefix}
info['img_info'] = {'filename': filename}
info['gt_label'] = np.array(gt_label, dtype=np.int64)
data_infos.append(info)
return data_infos
将新的数据集类加入到 mmcls/datasets/__init__.py
中:
from .base_dataset import BaseDataset
...
from .filelist import Filelist
__all__ = [
'BaseDataset', ... ,'Filelist'
]
然后在配置文件中,为了使用 Filelist
,用户可以按以下方式修改配置
train = dict(
type='Filelist',
ann_file = 'image_list.txt',
pipeline=train_pipeline
)
使用数据集包装¶
数据集包装是一种可以改变数据集类行为的类,比如将数据集中的样本进行重复,或是将不同类别的数据进行再平衡。
重复数据集¶
我们使用 RepeatDataset
作为一个重复数据集的封装。举个例子,假设原始数据集是 Dataset_A
,为了重复它,我们需要如下的配置文件:
data = dict(
train=dict(
type='RepeatDataset',
times=N,
dataset=dict( # 这里是 Dataset_A 的原始配置
type='Dataset_A',
...
pipeline=train_pipeline
)
)
...
)
类别平衡数据集¶
我们使用 ClassBalancedDataset
作为根据类别频率对数据集进行重复采样的封装类。进行重复采样的数据集需要实现函数 self.get_cat_ids(idx)
以支持 ClassBalancedDataset
。
举个例子,按照 oversample_thr=1e-3
对 Dataset_A
进行重复采样,需要如下的配置文件:
data = dict(
train = dict(
type='ClassBalancedDataset',
oversample_thr=1e-3,
dataset=dict( # 这里是 Dataset_A 的原始配置
type='Dataset_A',
...
pipeline=train_pipeline
)
)
...
)
更加具体的细节,请参考 API 文档。
教程 4:如何设计数据处理流程¶
设计数据流水线¶
按照典型的用法,我们通过 Dataset
和 DataLoader
来使用多个 worker 进行数据加
载。对 Dataset
的索引操作将返回一个与模型的 forward
方法的参数相对应的字典。
数据流水线和数据集在这里是解耦的。通常,数据集定义如何处理标注文件,而数据流水 线定义所有准备数据字典的步骤。流水线由一系列操作组成。每个操作都将一个字典作为 输入,并输出一个字典。
这些操作分为数据加载,预处理和格式化。
这里使用 ResNet-50 在 ImageNet 数据集上的数据流水线作为示例。
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=256),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
对于每个操作,我们列出了添加、更新、删除的相关字典字段。在流水线的最后,我们使
用 Collect
仅保留进行模型 forward
方法所需的项。
数据加载¶
LoadImageFromFile
- 从文件中加载图像
添加:img, img_shape, ori_shape
默认情况下,LoadImageFromFile
将会直接从硬盘加载图像,但对于一些效率较高、规
模较小的模型,这可能会导致 IO 瓶颈。MMCV 支持多种数据加载后端来加速这一过程。例
如,如果训练设备上配置了 memcached,那么我们按照如下
方式修改配置文件。
memcached_root = '/mnt/xxx/memcached_client/'
train_pipeline = [
dict(
type='LoadImageFromFile',
file_client_args=dict(
backend='memcached',
server_list_cfg=osp.join(memcached_root, 'server_list.conf'),
client_cfg=osp.join(memcached_root, 'client.conf'))),
]
更多支持的数据加载后端,可以参见 mmcv.fileio.FileClient。
预处理¶
Resize
- 缩放图像尺寸
添加:scale, scale_idx, pad_shape, scale_factor, keep_ratio
更新:img, img_shape
RandomFlip
- 随机翻转图像
添加:flip, flip_direction
更新:img
RandomCrop
- 随机裁剪图像
更新:img, pad_shape
Normalize
- 图像数据归一化
添加:img_norm_cfg
更新:img
格式化¶
ToTensor
- 转换(标签)数据至 torch.Tensor
更新:根据参数
keys
指定
ImageToTensor
- 转换图像数据至 torch.Tensor
更新:根据参数
keys
指定
Collect
- 保留指定键值
删除:除了参数
keys
指定以外的所有键值对
扩展及使用自定义流水线¶
编写一个新的数据处理操作,并放置在
mmcls/datasets/pipelines/
目录下的任何 一个文件中,例如my_pipeline.py
。这个类需要重载__call__
方法,接受一个 字典作为输入,并返回一个字典。from mmcls.datasets import PIPELINES @PIPELINES.register_module() class MyTransform(object): def __call__(self, results): # 对 results['img'] 进行变换操作 return results
在
mmcls/datasets/pipelines/__init__.py
中导入这个新的类。... from .my_pipeline import MyTransform __all__ = [ ..., 'MyTransform' ]
在数据流水线的配置中添加这一操作。
img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='RandomResizedCrop', size=224), dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'), dict(type='MyTransform'), dict(type='Normalize', **img_norm_cfg), dict(type='ImageToTensor', keys=['img']), dict(type='ToTensor', keys=['gt_label']), dict(type='Collect', keys=['img', 'gt_label']) ]
流水线可视化¶
设计好数据流水线后,可以使用可视化工具查看具体的效果。
教程 5:如何增加新模块¶
开发新组件¶
我们基本上将模型组件分为 3 种类型。
主干网络:通常是一个特征提取网络,例如 ResNet、MobileNet
颈部:用于连接主干网络和头部的组件,例如 GlobalAveragePooling
头部:用于执行特定任务的组件,例如分类和回归
添加新的主干网络¶
这里,我们以 ResNet_CIFAR 为例,展示了如何开发一个新的主干网络组件。
ResNet_CIFAR 针对 CIFAR 32x32 的图像输入,将 ResNet 中 kernel_size=7, stride=2
的设置替换为 kernel_size=3, stride=1
,并移除了 stem 层之后的
MaxPooling
,以避免传递过小的特征图到残差块中。
它继承自 ResNet
并只修改了 stem 层。
创建一个新文件
mmcls/models/backbones/resnet_cifar.py
。
import torch.nn as nn
from ..builder import BACKBONES
from .resnet import ResNet
@BACKBONES.register_module()
class ResNet_CIFAR(ResNet):
"""ResNet backbone for CIFAR.
(对这个主干网络的简短描述)
Args:
depth(int): Network depth, from {18, 34, 50, 101, 152}.
...
(参数文档)
"""
def __init__(self, depth, deep_stem=False, **kwargs):
# 调用基类 ResNet 的初始化函数
super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem **kwargs)
# 其他特殊的初始化流程
assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
def _make_stem_layer(self, in_channels, base_channels):
# 重载基类的方法,以实现对网络结构的修改
self.conv1 = build_conv_layer(
self.conv_cfg,
in_channels,
base_channels,
kernel_size=3,
stride=1,
padding=1,
bias=False)
self.norm1_name, norm1 = build_norm_layer(
self.norm_cfg, base_channels, postfix=1)
self.add_module(self.norm1_name, norm1)
self.relu = nn.ReLU(inplace=True)
def forward(self, x): # 需要返回一个元组
pass # 此处省略了网络的前向实现
def init_weights(self, pretrained=None):
pass # 如果有必要的话,重载基类 ResNet 的参数初始化函数
def train(self, mode=True):
pass # 如果有必要的话,重载基类 ResNet 的训练状态函数
在
mmcls/models/backbones/__init__.py
中导入新模块
...
from .resnet_cifar import ResNet_CIFAR
__all__ = [
..., 'ResNet_CIFAR'
]
在配置文件中使用新的主干网络
model = dict(
...
backbone=dict(
type='ResNet_CIFAR',
depth=18,
other_arg=xxx),
...
添加新的颈部组件¶
这里我们以 GlobalAveragePooling
为例。这是一个非常简单的颈部组件,没有任何参数。
要添加新的颈部组件,我们主要需要实现 forward
函数,该函数对主干网络的输出进行
一些操作并将结果传递到头部。
创建一个新文件
mmcls/models/necks/gap.py
import torch.nn as nn from ..builder import NECKS @NECKS.register_module() class GlobalAveragePooling(nn.Module): def __init__(self): self.gap = nn.AdaptiveAvgPool2d((1, 1)) def forward(self, inputs): # 简单起见,我们默认输入是一个张量 outs = self.gap(inputs) outs = outs.view(inputs.size(0), -1) return outs
在
mmcls/models/necks/__init__.py
中导入新模块... from .gap import GlobalAveragePooling __all__ = [ ..., 'GlobalAveragePooling' ]
修改配置文件以使用新的颈部组件
model = dict( neck=dict(type='GlobalAveragePooling'), )
添加新的头部组件¶
在此,我们以 LinearClsHead
为例,说明如何开发新的头部组件。
要添加一个新的头部组件,基本上我们需要实现 forward_train
函数,它接受来自颈部
或主干网络的特征图作为输入,并基于真实标签计算。
创建一个文件
mmcls/models/heads/linear_head.py
.from ..builder import HEADS from .cls_head import ClsHead @HEADS.register_module() class LinearClsHead(ClsHead): def __init__(self, num_classes, in_channels, loss=dict(type='CrossEntropyLoss', loss_weight=1.0), topk=(1, )): super(LinearClsHead, self).__init__(loss=loss, topk=topk) self.in_channels = in_channels self.num_classes = num_classes if self.num_classes <= 0: raise ValueError( f'num_classes={num_classes} must be a positive integer') self._init_layers() def _init_layers(self): self.fc = nn.Linear(self.in_channels, self.num_classes) def init_weights(self): normal_init(self.fc, mean=0, std=0.01, bias=0) def forward_train(self, x, gt_label): cls_score = self.fc(x) losses = self.loss(cls_score, gt_label) return losses
在
mmcls/models/heads/__init__.py
中导入这个模块... from .linear_head import LinearClsHead __all__ = [ ..., 'LinearClsHead' ]
修改配置文件以使用新的头部组件。
连同 GlobalAveragePooling
颈部组件,完整的模型配置如下:
model = dict(
type='ImageClassifier',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(3, ),
style='pytorch'),
neck=dict(type='GlobalAveragePooling'),
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=2048,
loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
topk=(1, 5),
))
添加新的损失函数¶
要添加新的损失函数,我们主要需要在损失函数模块中 forward
函数。另外,利用装饰器 weighted_loss
可以方便的实现对每个元素的损失进行加权平均。
假设我们要模拟从另一个分类模型生成的概率分布,需要添加 L1loss
来实现该目的。
创建一个新文件
mmcls/models/losses/l1_loss.py
import torch import torch.nn as nn from ..builder import LOSSES from .utils import weighted_loss @weighted_loss def l1_loss(pred, target): assert pred.size() == target.size() and target.numel() > 0 loss = torch.abs(pred - target) return loss @LOSSES.register_module() class L1Loss(nn.Module): def __init__(self, reduction='mean', loss_weight=1.0): super(L1Loss, self).__init__() self.reduction = reduction self.loss_weight = loss_weight def forward(self, pred, target, weight=None, avg_factor=None, reduction_override=None): assert reduction_override in (None, 'none', 'mean', 'sum') reduction = ( reduction_override if reduction_override else self.reduction) loss = self.loss_weight * l1_loss( pred, target, weight, reduction=reduction, avg_factor=avg_factor) return loss
在文件
mmcls/models/losses/__init__.py
中导入这个模块... from .l1_loss import L1Loss, l1_loss __all__ = [ ..., 'L1Loss', 'l1_loss' ]
修改配置文件中的
loss
字段以使用新的损失函数loss=dict(type='L1Loss', loss_weight=1.0))
教程 6:如何自定义优化策略¶
在本教程中,我们将介绍如何在运行自定义模型时,进行构造优化器、定制学习率及动量调整策略、梯度裁剪、梯度累计以及用户自定义优化方法等。
构造 PyTorch 内置优化器¶
MMClassification 支持 PyTorch 实现的所有优化器,仅需在配置文件中,指定 “optimizer” 字段。 例如,如果要使用 “SGD”,则修改如下。
optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)
要修改模型的学习率,只需要在优化器的配置中修改 lr
即可。
要配置其他参数,可直接根据 PyTorch API 文档 进行。
备注
配置文件中的 ‘type’ 不是构造时的参数,而是 PyTorch 内置优化器的类名。
例如,如果想使用 Adam
并设置参数为 torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
,
则需要进行如下修改
optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
定制学习率调整策略¶
定制学习率衰减曲线¶
深度学习研究中,广泛应用学习率衰减来提高网络的性能。要使用学习率衰减,可以在配置中设置 lr_confg
字段。
比如在默认的 ResNet 网络训练中,我们使用阶梯式的学习率衰减策略,配置文件为:
lr_config = dict(policy='step', step=[100, 150])
在训练过程中,程序会周期性地调用 MMCV 中的 StepLRHook
来进行学习率更新。
此外,我们也支持其他学习率调整方法,如 CosineAnnealing
和 Poly
等。详情可见 这里
ConsineAnnealing:
lr_config = dict(policy='CosineAnnealing', min_lr_ratio=1e-5)
Poly:
lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
定制学习率预热策略¶
在训练的早期阶段,网络容易不稳定,而学习率的预热就是为了减少这种不稳定性。通过预热,学习率将会从一个很小的值逐步提高到预定值。
在 MMClassification 中,我们同样使用 lr_config
配置学习率预热策略,主要的参数有以下几个:
warmup
: 学习率预热曲线类别,必须为 ‘constant’、 ‘linear’, ‘exp’ 或者None
其一, 如果为None
, 则不使用学习率预热策略。warmup_by_epoch
: 是否以轮次(epoch)为单位进行预热。warmup_iters
: 预热的迭代次数,当warmup_by_epoch=True
时,单位为轮次(epoch);当warmup_by_epoch=False
时,单位为迭代次数(iter)。warmup_ratio
: 预测的初始学习率lr = lr * warmup_ratio
。
例如:
逐迭代次数地线性预热
lr_config = dict( policy='CosineAnnealing', by_epoch=False, min_lr_ratio=1e-2, warmup='linear', warmup_ratio=1e-3, warmup_iters=20 * 1252, warmup_by_epoch=False)
逐轮次地指数预热
lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='exp', warmup_iters=5, warmup_ratio=0.1, warmup_by_epoch=True)
小技巧
配置完成后,可以使用 MMClassification 提供的 学习率可视化工具 画出对应学习率调整曲线。
定制动量调整策略¶
MMClassification 支持动量调整器根据学习率修改模型的动量,从而使模型收敛更快。
动量调整程序通常与学习率调整器一起使用,例如,以下配置用于加速收敛。 更多细节可参考 CyclicLrUpdater 和 CyclicMomentumUpdater。
这里是一个用例:
lr_config = dict(
policy='cyclic',
target_ratio=(10, 1e-4),
cyclic_times=1,
step_ratio_up=0.4,
)
momentum_config = dict(
policy='cyclic',
target_ratio=(0.85 / 0.95, 1),
cyclic_times=1,
step_ratio_up=0.4,
)
参数化精细配置¶
一些模型可能具有一些特定于参数的设置以进行优化,例如 BatchNorm 层不添加权重衰减或者对不同的网络层使用不同的学习率。
在 MMClassification 中,我们通过 optimizer
的 paramwise_cfg
参数进行配置,可以参考MMCV。
使用指定选项
MMClassification 提供了包括
bias_lr_mult
、bias_decay_mult
、norm_decay_mult
、dwconv_decay_mult
、dcn_offset_lr_mult
和bypass_duplicate
选项,指定相关所有的bais
、norm
、dwconv
、dcn
和bypass
参数。例如令模型中所有的 BN 不进行参数衰减:optimizer = dict( type='SGD', lr=0.8, weight_decay=1e-4, paramwise_cfg=dict(norm_decay_mult=0.) )
使用
custom_keys
指定参数MMClassification 可通过
custom_keys
指定不同的参数使用不同的学习率或者权重衰减,例如对特定的参数不使用权重衰减:paramwise_cfg = dict( custom_keys={ 'backbone.cls_token': dict(decay_mult=0.0), 'backbone.pos_embed': dict(decay_mult=0.0) }) optimizer = dict( type='SGD', lr=0.8, weight_decay=1e-4, paramwise_cfg=paramwise_cfg)
对 backbone 使用更小的学习率与衰减系数:
optimizer = dict( type='SGD', lr=0.8, weight_decay=1e-4, # backbone 的 'lr' and 'weight_decay' 分别为 0.1 * lr 和 0.9 * weight_decay paramwise_cfg = dict(custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=0.9)}))
梯度裁剪与梯度累计¶
除了 PyTorch 优化器的基本功能,我们还提供了一些对优化器的增强功能,例如梯度裁剪、梯度累计等,参考 MMCV。
梯度裁剪¶
在训练过程中,损失函数可能接近于一些异常陡峭的区域,从而导致梯度爆炸。而梯度裁剪可以帮助稳定训练过程,更多介绍可以参见该页面。
目前我们支持在 optimizer_config
字段中添加 grad_clip
参数来进行梯度裁剪,更详细的参数可参考 PyTorch 文档。
用例如下:
# norm_type: 使用的范数类型,此处使用范数2。
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
当使用继承并修改基础配置方式时,如果基础配置中 grad_clip=None
,需要添加 _delete_=True
。有关 _delete_
可以参考教程 1:如何编写配置文件。案例如下:
_base_ = [./_base_/schedules/imagenet_bs256_coslr.py]
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2), _delete_=True, type='OptimizerHook')
# 当 type 为 'OptimizerHook',可以省略 type;其他情况下,此处必须指明 type='xxxOptimizerHook'。
梯度累计¶
计算资源缺乏缺乏时,每个训练批次的大小(batch size)只能设置为较小的值,这可能会影响模型的性能。
可以使用梯度累计来规避这一问题。
用例如下:
data = dict(samples_per_gpu=64)
optimizer_config = dict(type="GradientCumulativeOptimizerHook", cumulative_iters=4)
表示训练时,每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64,也就等价于单张 GPU 上一次迭代的批次大小为 256,也即:
data = dict(samples_per_gpu=256)
optimizer_config = dict(type="OptimizerHook")
备注
当在 optimizer_config
不指定优化器钩子类型时,默认使用 OptimizerHook
。
用户自定义优化方法¶
在学术研究和工业实践中,可能需要使用 MMClassification 未实现的优化方法,可以通过以下方法添加。
备注
本部分将修改 MMClassification 源码或者向 MMClassification 框架添加代码,初学者可跳过。
自定义优化器¶
1. 定义一个新的优化器¶
一个自定义的优化器可根据如下规则进行定制
假设我们想添加一个名为 MyOptimzer
的优化器,其拥有参数 a
, b
和 c
。
可以创建一个名为 mmcls/core/optimizer
的文件夹,并在目录下的一个文件,如 mmcls/core/optimizer/my_optimizer.py
中实现该自定义优化器:
from mmcv.runner import OPTIMIZERS
from torch.optim import Optimizer
@OPTIMIZERS.register_module()
class MyOptimizer(Optimizer):
def __init__(self, a, b, c):
2. 注册优化器¶
要注册上面定义的上述模块,首先需要将此模块导入到主命名空间中。有两种方法可以实现它。
修改
mmcls/core/optimizer/__init__.py
,将其导入至optimizer
包;再修改mmcls/core/__init__.py
以导入optimizer
包创建
mmcls/core/optimizer/__init__.py
文件。 新定义的模块应导入到mmcls/core/optimizer/__init__.py
中,以便注册器能找到新模块并将其添加:
# 在 mmcls/core/optimizer/__init__.py 中
from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字
__all__ = ['MyOptimizer']
# 在 mmcls/core/__init__.py 中
...
from .optimizer import * # noqa: F401, F403
在配置中使用
custom_imports
手动导入
custom_imports = dict(imports=['mmcls.core.optimizer.my_optimizer'], allow_failed_imports=False)
mmcls.core.optimizer.my_optimizer
模块将会在程序开始阶段被导入,MyOptimizer
类会随之自动被注册。
注意,只有包含 MyOptmizer
类的包会被导入。mmcls.core.optimizer.my_optimizer.MyOptimizer
不会 被直接导入。
3. 在配置文件中指定优化器¶
之后,用户便可在配置文件的 optimizer
域中使用 MyOptimizer
。
在配置中,优化器由 “optimizer” 字段定义,如下所示:
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
要使用自定义的优化器,可以将该字段更改为
optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)
自定义优化器构造器¶
某些模型可能具有一些特定于参数的设置以进行优化,例如 BatchNorm 层的权重衰减。
虽然我们的 DefaultOptimizerConstructor
已经提供了这些强大的功能,但可能仍然无法覆盖需求。
此时我们可以通过自定义优化器构造函数来进行其他细粒度的参数调整。
from mmcv.runner.optimizer import OPTIMIZER_BUILDERS
@OPTIMIZER_BUILDERS.register_module()
class MyOptimizerConstructor:
def __init__(self, optimizer_cfg, paramwise_cfg=None):
pass
def __call__(self, model):
... # 在这里实现自己的优化器构造器。
return my_optimizer
这里是我们默认的优化器构造器的实现,可以作为新优化器构造器实现的模板。
教程 7:如何自定义模型运行参数¶
在本教程中,我们将介绍如何在运行自定义模型时,进行自定义工作流和钩子的方法。
定制工作流¶
工作流是一个形如 (任务名,周期数) 的列表,用于指定运行顺序和周期。这里“周期数”的单位由执行器的类型来决定。
比如在 MMClassification 中,我们默认使用基于轮次的执行器(EpochBasedRunner
),那么“周期数”指的就是对应的任务在一个周期中
要执行多少个轮次。通常,我们只希望执行训练任务,那么只需要使用以下设置:
workflow = [('train', 1)]
有时我们可能希望在训练过程中穿插检查模型在验证集上的一些指标(例如,损失,准确性)。
在这种情况下,可以将工作流程设置为:
[('train', 1), ('val', 1)]
这样一来,程序会一轮训练一轮测试地反复执行。
需要注意的是,默认情况下,我们并不推荐用这种方式来进行模型验证,而是推荐在训练中使用 EvalHook
进行模型验证。使用上述工作流的方式进行模型验证只是一个替代方案。
备注
在验证周期时不会更新模型参数。
配置文件内的关键词
max_epochs
控制训练时期数,并且不会影响验证工作流程。工作流
[('train', 1), ('val', 1)]
和[('train', 1)]
不会改变EvalHook
的行为。 因为EvalHook
由after_train_epoch
调用,而验证工作流只会影响after_val_epoch
调用的钩子。 因此,[('train', 1), ('val', 1)]
和[('train', 1)]
的区别在于,runner 在完成每一轮训练后,会计算验证集上的损失。
钩子¶
钩子机制在 OpenMMLab 开源算法库中应用非常广泛,结合执行器可以实现对训练过程的整个生命周期进行管理,可以通过相关文章进一步理解钩子。
钩子只有在构造器中被注册才起作用,目前钩子主要分为两类:
默认训练钩子
默认训练钩子由运行器默认注册,一般为一些基础型功能的钩子,已经有确定的优先级,一般不需要修改优先级。
定制钩子
定制钩子通过 custom_hooks
注册,一般为一些增强型功能的钩子,需要在配置文件中指定优先级,不指定该钩子的优先级将默被设定为 ‘NORMAL’。
优先级列表
Level |
Value |
---|---|
HIGHEST |
0 |
VERY_HIGH |
10 |
HIGH |
30 |
ABOVE_NORMAL |
40 |
NORMAL(default) |
50 |
BELOW_NORMAL |
60 |
LOW |
70 |
VERY_LOW |
90 |
LOWEST |
100 |
优先级确定钩子的执行顺序,每次训练前,日志会打印出各个阶段钩子的执行顺序,方便调试。
默认训练钩子¶
有一些常见的钩子未通过 custom_hooks
注册,但会在运行器(Runner
)中默认注册,它们是:
Hooks |
Priority |
---|---|
|
VERY_HIGH (10) |
|
HIGH (30) |
|
ABOVE_NORMAL (40) |
|
NORMAL (50) |
|
LOW (70) |
|
LOW (70) |
|
VERY_LOW (90) |
OptimizerHook
,MomentumUpdaterHook
和 LrUpdaterHook
在 优化策略 部分进行了介绍,
IterTimerHook
用于记录所用时间,目前不支持修改;
下面介绍如何使用去定制 CheckpointHook
、LoggerHooks
以及 EvalHook
。
权重文件钩子(CheckpointHook)¶
MMCV 的 runner 使用 checkpoint_config
来初始化 CheckpointHook
。
checkpoint_config = dict(interval=1)
用户可以设置 “max_keep_ckpts” 来仅保存少量模型权重文件,或者通过 “save_optimizer” 决定是否存储优化器的状态字典。 更多细节可参考 这里。
日志钩子(LoggerHooks)¶
log_config
包装了多个记录器钩子,并可以设置间隔。
目前,MMCV 支持 TextLoggerHook
、 WandbLoggerHook
、MlflowLoggerHook
和 TensorboardLoggerHook
。
更多细节可参考这里。
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')
])
验证钩子(EvalHook)¶
配置中的 evaluation
字段将用于初始化 EvalHook
。
EvalHook
有一些保留参数,如 interval
,save_best
和 start
等。其他的参数,如“metrics”将被传递给 dataset.evaluate()
。
evaluation = dict(interval=1, metric='accuracy', metric_options={'topk': (1, )})
我们可以通过参数 save_best
保存取得最好验证结果时的模型权重:
# "auto" 表示自动选择指标来进行模型的比较。也可以指定一个特定的 key 比如 "accuracy_top-1"。
evaluation = dict(interval=1, save_best=True, metric='accuracy', metric_options={'topk': (1, )})
在跑一些大型实验时,可以通过修改参数 start
跳过训练靠前轮次时的验证步骤,以节约时间。如下:
evaluation = dict(interval=1, start=200, metric='accuracy', metric_options={'topk': (1, )})
表示在第 200 轮之前,只执行训练流程,不执行验证;从轮次 200 开始,在每一轮训练之后进行验证。
备注
在 MMClassification 的默认配置文件中,evaluation 字段一般被放在 datasets 基础配置文件中。
使用内置钩子¶
一些钩子已在 MMCV 和 MMClassification 中实现:
可以直接修改配置以使用该钩子,如下格式:
custom_hooks = [
dict(type='MMCVHook', a=a_value, b=b_value, priority='NORMAL')
]
例如使用 EMAHook
,进行一次 EMA 的间隔是100个迭代:
custom_hooks = [
dict(type='EMAHook', interval=100, priority='HIGH')
]
自定义钩子¶
创建一个新钩子¶
这里举一个在 MMClassification 中创建一个新钩子,并在训练中使用它的示例:
from mmcv.runner import HOOKS, Hook
@HOOKS.register_module()
class MyHook(Hook):
def __init__(self, a, b):
pass
def before_run(self, runner):
pass
def after_run(self, runner):
pass
def before_epoch(self, runner):
pass
def after_epoch(self, runner):
pass
def before_iter(self, runner):
pass
def after_iter(self, runner):
pass
根据钩子的功能,用户需要指定钩子在训练的每个阶段将要执行的操作,比如 before_run
,after_run
,before_epoch
,after_epoch
,before_iter
和 after_iter
。
注册新钩子¶
之后,需要导入 MyHook
。假设该文件在 mmcls/core/utils/my_hook.py
,有两种办法导入它:
修改
mmcls/core/utils/__init__.py
进行导入新定义的模块应导入到
mmcls/core/utils/__init__py
中,以便注册器能找到并添加新模块:
from .my_hook import MyHook
__all__ = ['MyHook']
使用配置文件中的
custom_imports
变量手动导入
custom_imports = dict(imports=['mmcls.core.utils.my_hook'], allow_failed_imports=False)
修改配置¶
custom_hooks = [
dict(type='MyHook', a=a_value, b=b_value)
]
还可通过 priority
参数设置钩子优先级,如下所示:
custom_hooks = [
dict(type='MyHook', a=a_value, b=b_value, priority='NORMAL')
]
默认情况下,在注册过程中,钩子的优先级设置为“NORMAL”。
常见问题¶
1. resume_from, load_from,init_cfg.Pretrained 区别¶
load_from
:仅仅加载模型权重,主要用于加载预训练或者训练好的模型;resume_from
:不仅导入模型权重,还会导入优化器信息,当前轮次(epoch)信息,主要用于从断点继续训练。init_cfg.Pretrained
:在权重初始化期间加载权重,您可以指定要加载的模块。 这通常在微调模型时使用,请参阅教程 2:如何微调模型
模型库统计¶
论文数量: 34
ALGORITHM: 34
模型权重文件数量: 224
[ALGORITHM] Conformer: Local Features Coupling Global Representations for Visual Recognition (4 ckpts)
[ALGORITHM] Patches Are All You Need? (3 ckpts)
[ALGORITHM] A ConvNet for the 2020s (13 ckpts)
[ALGORITHM] CSPNet: A New Backbone that can Enhance Learning Capability of CNN (3 ckpts)
[ALGORITHM] Residual Attention: A Simple but Effective Method for Multi-Label Recognition (1 ckpts)
[ALGORITHM] Training data-efficient image transformers & distillation through attention (9 ckpts)
[ALGORITHM] Densely Connected Convolutional Networks (4 ckpts)
[ALGORITHM] EfficientFormer: Vision Transformers at MobileNet Speed (3 ckpts)
[ALGORITHM] Rethinking Model Scaling for Convolutional Neural Networks (23 ckpts)
[ALGORITHM] HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions (9 ckpts)
[ALGORITHM] Deep High-Resolution Representation Learning for Visual Recognition (9 ckpts)
[ALGORITHM] MLP-Mixer: An all-MLP Architecture for Vision (2 ckpts)
[ALGORITHM] MobileNetV2: Inverted Residuals and Linear Bottlenecks (1 ckpts)
[ALGORITHM] Searching for MobileNetV3 (2 ckpts)
[ALGORITHM] MViTv2: Improved Multiscale Vision Transformers for Classification and Detection (4 ckpts)
[ALGORITHM] MetaFormer is Actually What You Need for Vision (5 ckpts)
[ALGORITHM] Designing Network Design Spaces (16 ckpts)
[ALGORITHM] RepMLP: Re-parameterizing Convolutions into Fully-connected Layers forImage Recognition (2 ckpts)
[ALGORITHM] Repvgg: Making vgg-style convnets great again (12 ckpts)
[ALGORITHM] Res2Net: A New Multi-scale Backbone Architecture (3 ckpts)
[ALGORITHM] Deep Residual Learning for Image Recognition (26 ckpts)
[ALGORITHM] Aggregated Residual Transformations for Deep Neural Networks (4 ckpts)
[ALGORITHM] Squeeze-and-Excitation Networks (2 ckpts)
[ALGORITHM] ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices (1 ckpts)
[ALGORITHM] Shufflenet v2: Practical guidelines for efficient cnn architecture design (1 ckpts)
[ALGORITHM] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (14 ckpts)
[ALGORITHM] Swin Transformer V2: Scaling Up Capacity and Resolution (12 ckpts)
[ALGORITHM] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (3 ckpts)
[ALGORITHM] Transformer in Transformer (1 ckpts)
[ALGORITHM] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (6 ckpts)
[ALGORITHM] Visual Attention Network (8 ckpts)
[ALGORITHM] Very Deep Convolutional Networks for Large-Scale Image Recognition (8 ckpts)
[ALGORITHM] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (7 ckpts)
[ALGORITHM] Wide Residual Networks (3 ckpts)
Model Zoo¶
ImageNet¶
ImageNet has multiple versions, but the most commonly used one is ILSVRC 2012. The ResNet family models below are trained by standard data augmentations, i.e., RandomResizedCrop, RandomHorizontalFlip and Normalize.
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
VGG-11 |
132.86 |
7.63 |
68.75 |
88.87 |
||
VGG-13 |
133.05 |
11.34 |
70.02 |
89.46 |
||
VGG-16 |
138.36 |
15.5 |
71.62 |
90.49 |
||
VGG-19 |
143.67 |
19.67 |
72.41 |
90.80 |
||
VGG-11-BN |
132.87 |
7.64 |
70.75 |
90.12 |
||
VGG-13-BN |
133.05 |
11.36 |
72.15 |
90.71 |
||
VGG-16-BN |
138.37 |
15.53 |
73.72 |
91.68 |
||
VGG-19-BN |
143.68 |
19.7 |
74.70 |
92.24 |
||
RepVGG-A0* |
9.11(train) | 8.31 (deploy) |
1.52 (train) | 1.36 (deploy) |
72.41 |
90.50 |
||
RepVGG-A1* |
14.09 (train) | 12.79 (deploy) |
2.64 (train) | 2.37 (deploy) |
74.47 |
91.85 |
||
RepVGG-A2* |
28.21 (train) | 25.5 (deploy) |
5.7 (train) | 5.12 (deploy) |
76.48 |
93.01 |
||
RepVGG-B0* |
15.82 (train) | 14.34 (deploy) |
3.42 (train) | 3.06 (deploy) |
75.14 |
92.42 |
||
RepVGG-B1* |
57.42 (train) | 51.83 (deploy) |
13.16 (train) | 11.82 (deploy) |
78.37 |
94.11 |
||
RepVGG-B1g2* |
45.78 (train) | 41.36 (deploy) |
9.82 (train) | 8.82 (deploy) |
77.79 |
93.88 |
||
RepVGG-B1g4* |
39.97 (train) | 36.13 (deploy) |
8.15 (train) | 7.32 (deploy) |
77.58 |
93.84 |
||
RepVGG-B2* |
89.02 (train) | 80.32 (deploy) |
20.46 (train) | 18.39 (deploy) |
78.78 |
94.42 |
||
RepVGG-B2g4* |
61.76 (train) | 55.78 (deploy) |
12.63 (train) | 11.34 (deploy) |
79.38 |
94.68 |
||
RepVGG-B3* |
123.09 (train) | 110.96 (deploy) |
29.17 (train) | 26.22 (deploy) |
80.52 |
95.26 |
||
RepVGG-B3g4* |
83.83 (train) | 75.63 (deploy) |
17.9 (train) | 16.08 (deploy) |
80.22 |
95.10 |
||
RepVGG-D2se* |
133.33 (train) | 120.39 (deploy) |
36.56 (train) | 32.85 (deploy) |
81.81 |
95.94 |
||
ResNet-18 |
11.69 |
1.82 |
70.07 |
89.44 |
||
ResNet-34 |
21.8 |
3.68 |
73.85 |
91.53 |
||
ResNet-50 (rsb-a1) |
25.56 |
4.12 |
80.12 |
94.78 |
||
ResNet-101 |
44.55 |
7.85 |
78.18 |
94.03 |
||
ResNet-152 |
60.19 |
11.58 |
78.63 |
94.16 |
||
Res2Net-50-14w-8s* |
25.06 |
4.22 |
78.14 |
93.85 |
||
Res2Net-50-26w-8s* |
48.40 |
8.39 |
79.20 |
94.36 |
||
Res2Net-101-26w-4s* |
45.21 |
8.12 |
79.19 |
94.44 |
||
ResNeSt-50* |
27.48 |
5.41 |
81.13 |
95.59 |
||
ResNeSt-101* |
48.28 |
10.27 |
82.32 |
96.24 |
||
ResNeSt-200* |
70.2 |
17.53 |
82.41 |
96.22 |
||
ResNeSt-269* |
110.93 |
22.58 |
82.70 |
96.28 |
||
ResNetV1D-50 |
25.58 |
4.36 |
77.54 |
93.57 |
||
ResNetV1D-101 |
44.57 |
8.09 |
78.93 |
94.48 |
||
ResNetV1D-152 |
60.21 |
11.82 |
79.41 |
94.7 |
||
ResNeXt-32x4d-50 |
25.03 |
4.27 |
77.90 |
93.66 |
||
ResNeXt-32x4d-101 |
44.18 |
8.03 |
78.71 |
94.12 |
||
ResNeXt-32x8d-101 |
88.79 |
16.5 |
79.23 |
94.58 |
||
ResNeXt-32x4d-152 |
59.95 |
11.8 |
78.93 |
94.41 |
||
SE-ResNet-50 |
28.09 |
4.13 |
77.74 |
93.84 |
||
SE-ResNet-101 |
49.33 |
7.86 |
78.26 |
94.07 |
||
RegNetX-400MF |
5.16 |
0.41 |
72.56 |
90.78 |
||
RegNetX-800MF |
7.26 |
0.81 |
74.76 |
92.32 |
||
RegNetX-1.6GF |
9.19 |
1.63 |
76.84 |
93.31 |
||
RegNetX-3.2GF |
15.3 |
3.21 |
78.09 |
94.08 |
||
RegNetX-4.0GF |
22.12 |
4.0 |
78.60 |
94.17 |
||
RegNetX-6.4GF |
26.21 |
6.51 |
79.38 |
94.65 |
||
RegNetX-8.0GF |
39.57 |
8.03 |
79.12 |
94.51 |
||
RegNetX-12GF |
46.11 |
12.15 |
79.67 |
95.03 |
||
ShuffleNetV1 1.0x (group=3) |
1.87 |
0.146 |
68.13 |
87.81 |
||
ShuffleNetV2 1.0x |
2.28 |
0.149 |
69.55 |
88.92 |
||
MobileNet V2 |
3.5 |
0.319 |
71.86 |
90.42 |
||
ViT-B/16* |
86.86 |
33.03 |
85.43 |
97.77 |
||
ViT-B/32* |
88.3 |
8.56 |
84.01 |
97.08 |
||
ViT-L/16* |
304.72 |
116.68 |
85.63 |
97.63 |
||
Swin-Transformer tiny |
28.29 |
4.36 |
81.18 |
95.61 |
||
Swin-Transformer small |
49.61 |
8.52 |
83.02 |
96.29 |
||
Swin-Transformer base |
87.77 |
15.14 |
83.36 |
96.44 |
||
Transformer in Transformer small* |
23.76 |
3.36 |
81.52 |
95.73 |
||
T2T-ViT_t-14 |
21.47 |
4.34 |
81.83 |
95.84 |
||
T2T-ViT_t-19 |
39.08 |
7.80 |
82.63 |
96.18 |
||
T2T-ViT_t-24 |
64.00 |
12.69 |
82.71 |
96.09 |
||
Mixer-B/16* |
59.88 |
12.61 |
76.68 |
92.25 |
||
Mixer-L/16* |
208.2 |
44.57 |
72.34 |
88.02 |
||
DeiT-tiny |
5.72 |
1.08 |
74.50 |
92.24 |
||
DeiT-tiny distilled* |
5.72 |
1.08 |
74.51 |
91.90 |
||
DeiT-small |
22.05 |
4.24 |
80.69 |
95.06 |
||
DeiT-small distilled* |
22.05 |
4.24 |
81.17 |
95.40 |
||
DeiT-base |
86.57 |
16.86 |
81.76 |
95.81 |
||
DeiT-base distilled* |
86.57 |
16.86 |
83.33 |
96.49 |
||
DeiT-base 384px* |
86.86 |
49.37 |
83.04 |
96.31 |
||
DeiT-base distilled 384px* |
86.86 |
49.37 |
85.55 |
97.35 |
||
Conformer-tiny-p16* |
23.52 |
4.90 |
81.31 |
95.60 |
||
Conformer-small-p32* |
38.85 |
7.09 |
81.96 |
96.02 |
||
Conformer-small-p16* |
37.67 |
10.31 |
83.32 |
96.46 |
||
Conformer-base-p16* |
83.29 |
22.89 |
83.82 |
96.59 |
||
PCPVT-small* |
24.11 |
3.67 |
81.14 |
95.69 |
||
PCPVT-base* |
43.83 |
6.45 |
82.66 |
96.26 |
||
PCPVT-large* |
60.99 |
9.51 |
83.09 |
96.59 |
||
SVT-small* |
24.06 |
2.82 |
81.77 |
95.57 |
||
SVT-base* |
56.07 |
8.35 |
83.13 |
96.29 |
||
SVT-large* |
99.27 |
14.82 |
83.60 |
96.50 |
||
EfficientNet-B0* |
5.29 |
0.02 |
76.74 |
93.17 |
||
EfficientNet-B0 (AA)* |
5.29 |
0.02 |
77.26 |
93.41 |
||
EfficientNet-B0 (AA + AdvProp)* |
5.29 |
0.02 |
77.53 |
93.61 |
||
EfficientNet-B1* |
7.79 |
0.03 |
78.68 |
94.28 |
||
EfficientNet-B1 (AA)* |
7.79 |
0.03 |
79.20 |
94.42 |
||
EfficientNet-B1 (AA + AdvProp)* |
7.79 |
0.03 |
79.52 |
94.43 |
||
EfficientNet-B2* |
9.11 |
0.03 |
79.64 |
94.80 |
||
EfficientNet-B2 (AA)* |
9.11 |
0.03 |
80.21 |
94.96 |
||
EfficientNet-B2 (AA + AdvProp)* |
9.11 |
0.03 |
80.45 |
95.07 |
||
EfficientNet-B3* |
12.23 |
0.06 |
81.01 |
95.34 |
||
EfficientNet-B3 (AA)* |
12.23 |
0.06 |
81.58 |
95.67 |
||
EfficientNet-B3 (AA + AdvProp)* |
12.23 |
0.06 |
81.81 |
95.69 |
||
EfficientNet-B4* |
19.34 |
0.12 |
82.57 |
96.09 |
||
EfficientNet-B4 (AA)* |
19.34 |
0.12 |
82.95 |
96.26 |
||
EfficientNet-B4 (AA + AdvProp)* |
19.34 |
0.12 |
83.25 |
96.44 |
||
EfficientNet-B5* |
30.39 |
0.24 |
83.18 |
96.47 |
||
EfficientNet-B5 (AA)* |
30.39 |
0.24 |
83.82 |
96.76 |
||
EfficientNet-B5 (AA + AdvProp)* |
30.39 |
0.24 |
84.21 |
96.98 |
||
EfficientNet-B6 (AA)* |
43.04 |
0.41 |
84.05 |
96.82 |
||
EfficientNet-B6 (AA + AdvProp)* |
43.04 |
0.41 |
84.74 |
97.14 |
||
EfficientNet-B7 (AA)* |
66.35 |
0.72 |
84.38 |
96.88 |
||
EfficientNet-B7 (AA + AdvProp)* |
66.35 |
0.72 |
85.14 |
97.23 |
||
EfficientNet-B8 (AA + AdvProp)* |
87.41 |
1.09 |
85.38 |
97.28 |
||
ConvNeXt-T* |
28.59 |
4.46 |
82.05 |
95.86 |
||
ConvNeXt-S* |
50.22 |
8.69 |
83.13 |
96.44 |
||
ConvNeXt-B* |
88.59 |
15.36 |
83.85 |
96.74 |
||
ConvNeXt-B* |
88.59 |
15.36 |
85.81 |
97.86 |
||
ConvNeXt-L* |
197.77 |
34.37 |
84.30 |
96.89 |
||
ConvNeXt-L* |
197.77 |
34.37 |
86.61 |
98.04 |
||
ConvNeXt-XL* |
350.20 |
60.93 |
86.97 |
98.20 |
||
HRNet-W18* |
21.30 |
4.33 |
76.75 |
93.44 |
||
HRNet-W30* |
37.71 |
8.17 |
78.19 |
94.22 |
||
HRNet-W32* |
41.23 |
8.99 |
78.44 |
94.19 |
||
HRNet-W40* |
57.55 |
12.77 |
78.94 |
94.47 |
||
HRNet-W44* |
67.06 |
14.96 |
78.88 |
94.37 |
||
HRNet-W48* |
77.47 |
17.36 |
79.32 |
94.52 |
||
HRNet-W64* |
128.06 |
29.00 |
79.46 |
94.65 |
||
HRNet-W18 (ssld)* |
21.30 |
4.33 |
81.06 |
95.70 |
||
HRNet-W48 (ssld)* |
77.47 |
17.36 |
83.63 |
96.79 |
||
WRN-50* |
68.88 |
11.44 |
81.45 |
95.53 |
||
WRN-101* |
126.89 |
22.81 |
78.84 |
94.28 |
||
CSPDarkNet50* |
27.64 |
5.04 |
80.05 |
95.07 |
||
CSPResNet50* |
21.62 |
3.48 |
79.55 |
94.68 |
||
CSPResNeXt50* |
20.57 |
3.11 |
79.96 |
94.96 |
||
DenseNet121* |
7.98 |
2.88 |
74.96 |
92.21 |
||
DenseNet169* |
14.15 |
3.42 |
76.08 |
93.11 |
||
DenseNet201* |
20.01 |
4.37 |
77.32 |
93.64 |
||
DenseNet161* |
28.68 |
7.82 |
77.61 |
93.83 |
||
VAN-T* |
4.11 |
0.88 |
75.41 |
93.02 |
||
VAN-S* |
13.86 |
2.52 |
81.01 |
95.63 |
||
VAN-B* |
26.58 |
5.03 |
82.80 |
96.21 |
||
VAN-L* |
44.77 |
8.99 |
83.86 |
96.73 |
||
MViTv2-tiny* |
24.17 |
4.70 |
82.33 |
96.15 |
||
MViTv2-small* |
34.87 |
7.00 |
83.63 |
96.51 |
||
MViTv2-base* |
51.47 |
10.20 |
84.34 |
96.86 |
||
MViTv2-large* |
217.99 |
42.10 |
85.25 |
97.14 |
||
EfficientFormer-l1* |
12.19 |
1.30 |
80.46 |
94.99 |
||
EfficientFormer-l3* |
31.41 |
3.93 |
82.45 |
96.18 |
||
EfficientFormer-l7* |
82.23 |
10.16 |
83.40 |
96.60 |
Models with * are converted from other repos, others are trained by ourselves.
CIFAR10¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Config |
Download |
---|---|---|---|---|---|
ResNet-18-b16x8 |
11.17 |
0.56 |
94.82 |
||
ResNet-34-b16x8 |
21.28 |
1.16 |
95.34 |
||
ResNet-50-b16x8 |
23.52 |
1.31 |
95.55 |
||
ResNet-101-b16x8 |
42.51 |
2.52 |
95.58 |
||
ResNet-152-b16x8 |
58.16 |
3.74 |
95.76 |
Conformer¶
Conformer: Local Features Coupling Global Representations for Visual Recognition
Abstract¶
Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
Conformer-tiny-p16* |
23.52 |
4.90 |
81.31 |
95.60 |
||
Conformer-small-p32* |
38.85 |
7.09 |
81.96 |
96.02 |
||
Conformer-small-p16* |
37.67 |
10.31 |
83.32 |
96.46 |
||
Conformer-base-p16* |
83.29 |
22.89 |
83.82 |
96.59 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@article{peng2021conformer,
title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
journal={arXiv preprint arXiv:2105.03889},
year={2021},
}
ConvMixer¶
Abstract¶
Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ConvMixer-768/32* |
21.11 |
19.62 |
80.16 |
95.08 |
||
ConvMixer-1024/20* |
24.38 |
5.55 |
76.94 |
93.36 |
||
ConvMixer-1536/20* |
51.63 |
48.71 |
81.37 |
95.61 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@misc{trockman2022patches,
title={Patches Are All You Need?},
author={Asher Trockman and J. Zico Kolter},
year={2022},
eprint={2201.09792},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
ConvNeXt¶
Abstract¶
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Results and models¶
ImageNet-1k¶
Model |
Pretrain |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
ConvNeXt-T* |
From scratch |
28.59 |
4.46 |
82.05 |
95.86 |
||
ConvNeXt-S* |
From scratch |
50.22 |
8.69 |
83.13 |
96.44 |
||
ConvNeXt-B* |
From scratch |
88.59 |
15.36 |
83.85 |
96.74 |
||
ConvNeXt-B* |
ImageNet-21k |
88.59 |
15.36 |
85.81 |
97.86 |
||
ConvNeXt-L* |
From scratch |
197.77 |
34.37 |
84.30 |
96.89 |
||
ConvNeXt-L* |
ImageNet-21k |
197.77 |
34.37 |
86.61 |
98.04 |
||
ConvNeXt-XL* |
ImageNet-21k |
350.20 |
60.93 |
86.97 |
98.20 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Pre-trained Models¶
The pre-trained models on ImageNet-1k or ImageNet-21k are used to fine-tune on the downstream tasks.
Model |
Training Data |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|
ConvNeXt-T* |
ImageNet-1k |
28.59 |
4.46 |
|
ConvNeXt-S* |
ImageNet-1k |
50.22 |
8.69 |
|
ConvNeXt-B* |
ImageNet-1k |
88.59 |
15.36 |
|
ConvNeXt-B* |
ImageNet-21k |
88.59 |
15.36 |
|
ConvNeXt-L* |
ImageNet-21k |
197.77 |
34.37 |
|
ConvNeXt-XL* |
ImageNet-21k |
350.20 |
60.93 |
Models with * are converted from the official repo.
Citation¶
@Article{liu2022convnet,
author = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
title = {A ConvNet for the 2020s},
journal = {arXiv preprint arXiv:2201.03545},
year = {2022},
}
CSPNet¶
CSPNet: A New Backbone that can Enhance Learning Capability of CNN
Abstract¶
Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at this https URL.

Results and models¶
ImageNet-1k¶
Model |
Pretrain |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
CSPDarkNet50* |
From scratch |
27.64 |
5.04 |
80.05 |
95.07 |
||
CSPResNet50* |
From scratch |
21.62 |
3.48 |
79.55 |
94.68 |
||
CSPResNeXt50* |
From scratch |
20.57 |
3.11 |
79.96 |
94.96 |
Models with * are converted from the timm repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@inproceedings{wang2020cspnet,
title={CSPNet: A new backbone that can enhance learning capability of CNN},
author={Wang, Chien-Yao and Liao, Hong-Yuan Mark and Wu, Yueh-Hua and Chen, Ping-Yang and Hsieh, Jun-Wei and Yeh, I-Hau},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops},
pages={390--391},
year={2020}
}
CSRA¶
Residual Attention: A Simple but Effective Method for Multi-Label Recognition
Abstract¶
Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.

Results and models¶
VOC2007¶
Model |
Pretrain |
Params(M) |
Flops(G) |
mAP |
OF1 (%) |
CF1 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|
Resnet101-CSRA |
23.55 |
4.12 |
94.98 |
90.80 |
89.16 |
Citation¶
@misc{https://doi.org/10.48550/arxiv.2108.02456,
doi = {10.48550/ARXIV.2108.02456},
url = {https://arxiv.org/abs/2108.02456},
author = {Zhu, Ke and Wu, Jianxin},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Residual Attention: A Simple but Effective Method for Multi-Label Recognition},
publisher = {arXiv},
year = {2021},
copyright = {arXiv.org perpetual, non-exclusive license}
}
DeiT¶
Training data-efficient image transformers & distillation through attention
Abstract¶
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Results and models¶
ImageNet-1k¶
The teacher of the distilled version DeiT is RegNetY-16GF.
Model |
Pretrain |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
DeiT-tiny |
From scratch |
5.72 |
1.08 |
74.50 |
92.24 |
||
DeiT-tiny distilled* |
From scratch |
5.72 |
1.08 |
74.51 |
91.90 |
||
DeiT-small |
From scratch |
22.05 |
4.24 |
80.69 |
95.06 |
||
DeiT-small distilled* |
From scratch |
22.05 |
4.24 |
81.17 |
95.40 |
||
DeiT-base |
From scratch |
86.57 |
16.86 |
81.76 |
95.81 |
||
DeiT-base* |
From scratch |
86.57 |
16.86 |
81.79 |
95.59 |
||
DeiT-base distilled* |
From scratch |
86.57 |
16.86 |
83.33 |
96.49 |
||
DeiT-base 384px* |
ImageNet-1k |
86.86 |
49.37 |
83.04 |
96.31 |
||
DeiT-base distilled 384px* |
ImageNet-1k |
86.86 |
49.37 |
85.55 |
97.35 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
警告
MMClassification doesn’t support training the distilled version DeiT. And we provide distilled version checkpoints for inference only.
Citation¶
@InProceedings{pmlr-v139-touvron21a,
title = {Training data-efficient image transformers & distillation through attention},
author = {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
booktitle = {International Conference on Machine Learning},
pages = {10347--10357},
year = {2021},
volume = {139},
month = {July}
}
DenseNet¶
Densely Connected Convolutional Networks
Abstract¶
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
DenseNet121* |
7.98 |
2.88 |
74.96 |
92.21 |
||
DenseNet169* |
14.15 |
3.42 |
76.08 |
93.11 |
||
DenseNet201* |
20.01 |
4.37 |
77.32 |
93.64 |
||
DenseNet161* |
28.68 |
7.82 |
77.61 |
93.83 |
Models with * are converted from pytorch, guided by original repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@misc{https://doi.org/10.48550/arxiv.1608.06993,
doi = {10.48550/ARXIV.1608.06993},
url = {https://arxiv.org/abs/1608.06993},
author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Densely Connected Convolutional Networks},
publisher = {arXiv},
year = {2016},
copyright = {arXiv.org perpetual, non-exclusive license}
}
EfficientFormer¶
EfficientFormer: Vision Transformers at MobileNet Speed
Abstract¶
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
EfficientFormer-l1* |
12.19 |
1.30 |
80.46 |
94.99 |
||
EfficientFormer-l3* |
31.41 |
3.93 |
82.45 |
96.18 |
||
EfficientFormer-l7* |
82.23 |
10.16 |
83.40 |
96.60 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@misc{https://doi.org/10.48550/arxiv.2206.01191,
doi = {10.48550/ARXIV.2206.01191},
url = {https://arxiv.org/abs/2206.01191},
author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {EfficientFormer: Vision Transformers at MobileNet Speed},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
EfficientNet¶
Rethinking Model Scaling for Convolutional Neural Networks
Abstract¶
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.

Results and models¶
ImageNet-1k¶
In the result table, AA means trained with AutoAugment pre-processing, more details can be found in the paper, and AdvProp is a method to train with adversarial examples, more details can be found in the paper.
Note: In MMClassification, we support training with AutoAugment, don’t support AdvProp by now.
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
EfficientNet-B0* |
5.29 |
0.02 |
76.74 |
93.17 |
||
EfficientNet-B0 (AA)* |
5.29 |
0.02 |
77.26 |
93.41 |
||
EfficientNet-B0 (AA + AdvProp)* |
5.29 |
0.02 |
77.53 |
93.61 |
||
EfficientNet-B1* |
7.79 |
0.03 |
78.68 |
94.28 |
||
EfficientNet-B1 (AA)* |
7.79 |
0.03 |
79.20 |
94.42 |
||
EfficientNet-B1 (AA + AdvProp)* |
7.79 |
0.03 |
79.52 |
94.43 |
||
EfficientNet-B2* |
9.11 |
0.03 |
79.64 |
94.80 |
||
EfficientNet-B2 (AA)* |
9.11 |
0.03 |
80.21 |
94.96 |
||
EfficientNet-B2 (AA + AdvProp)* |
9.11 |
0.03 |
80.45 |
95.07 |
||
EfficientNet-B3* |
12.23 |
0.06 |
81.01 |
95.34 |
||
EfficientNet-B3 (AA)* |
12.23 |
0.06 |
81.58 |
95.67 |
||
EfficientNet-B3 (AA + AdvProp)* |
12.23 |
0.06 |
81.81 |
95.69 |
||
EfficientNet-B4* |
19.34 |
0.12 |
82.57 |
96.09 |
||
EfficientNet-B4 (AA)* |
19.34 |
0.12 |
82.95 |
96.26 |
||
EfficientNet-B4 (AA + AdvProp)* |
19.34 |
0.12 |
83.25 |
96.44 |
||
EfficientNet-B5* |
30.39 |
0.24 |
83.18 |
96.47 |
||
EfficientNet-B5 (AA)* |
30.39 |
0.24 |
83.82 |
96.76 |
||
EfficientNet-B5 (AA + AdvProp)* |
30.39 |
0.24 |
84.21 |
96.98 |
||
EfficientNet-B6 (AA)* |
43.04 |
0.41 |
84.05 |
96.82 |
||
EfficientNet-B6 (AA + AdvProp)* |
43.04 |
0.41 |
84.74 |
97.14 |
||
EfficientNet-B7 (AA)* |
66.35 |
0.72 |
84.38 |
96.88 |
||
EfficientNet-B7 (AA + AdvProp)* |
66.35 |
0.72 |
85.14 |
97.23 |
||
EfficientNet-B8 (AA + AdvProp)* |
87.41 |
1.09 |
85.38 |
97.28 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@inproceedings{tan2019efficientnet,
title={Efficientnet: Rethinking model scaling for convolutional neural networks},
author={Tan, Mingxing and Le, Quoc},
booktitle={International Conference on Machine Learning},
pages={6105--6114},
year={2019},
organization={PMLR}
}
HorNet¶
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
Abstract¶
Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.

Results and models¶
ImageNet-1k¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|
HorNet-T* |
From scratch |
224x224 |
22.41 |
3.98 |
82.84 |
96.24 |
||
HorNet-T-GF* |
From scratch |
224x224 |
22.99 |
3.9 |
82.98 |
96.38 |
||
HorNet-S* |
From scratch |
224x224 |
49.53 |
8.83 |
83.79 |
96.75 |
||
HorNet-S-GF* |
From scratch |
224x224 |
50.4 |
8.71 |
83.98 |
96.77 |
||
HorNet-B* |
From scratch |
224x224 |
87.26 |
15.59 |
84.24 |
96.94 |
||
HorNet-B-GF* |
From scratch |
224x224 |
88.42 |
15.42 |
84.32 |
96.95 |
*Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Pre-trained Models¶
The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks.
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|---|
HorNet-L* |
ImageNet-21k |
224x224 |
194.54 |
34.83 |
|
HorNet-L-GF* |
ImageNet-21k |
224x224 |
196.29 |
34.58 |
|
HorNet-L-GF384* |
ImageNet-21k |
384x384 |
201.23 |
101.63 |
*Models with * are converted from the official repo.
Citation¶
@article{rao2022hornet,
title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
journal={arXiv preprint arXiv:2207.14284},
year={2022}
}
HRNet¶
Deep High-Resolution Representation Learning for Visual Recognition
Abstract¶
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
HRNet-W18* |
21.30 |
4.33 |
76.75 |
93.44 |
||
HRNet-W30* |
37.71 |
8.17 |
78.19 |
94.22 |
||
HRNet-W32* |
41.23 |
8.99 |
78.44 |
94.19 |
||
HRNet-W40* |
57.55 |
12.77 |
78.94 |
94.47 |
||
HRNet-W44* |
67.06 |
14.96 |
78.88 |
94.37 |
||
HRNet-W48* |
77.47 |
17.36 |
79.32 |
94.52 |
||
HRNet-W64* |
128.06 |
29.00 |
79.46 |
94.65 |
||
HRNet-W18 (ssld)* |
21.30 |
4.33 |
81.06 |
95.70 |
||
HRNet-W48 (ssld)* |
77.47 |
17.36 |
83.63 |
96.79 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@article{WangSCJDZLMTWLX19,
title={Deep High-Resolution Representation Learning for Visual Recognition},
author={Jingdong Wang and Ke Sun and Tianheng Cheng and
Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and
Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
journal = {TPAMI}
year={2019}
}
Mlp-Mixer¶
MLP-Mixer: An all-MLP Architecture for Vision
Abstract¶
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. “mixing” the per-location features), and one with MLPs applied across patches (i.e. “mixing” spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
Mixer-B/16* |
59.88 |
12.61 |
76.68 |
92.25 |
||
Mixer-L/16* |
208.2 |
44.57 |
72.34 |
88.02 |
Models with * are converted from timm. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@misc{tolstikhin2021mlpmixer,
title={MLP-Mixer: An all-MLP Architecture for Vision},
author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
year={2021},
eprint={2105.01601},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
MobileNet V2¶
MobileNetV2: Inverted Residuals and Linear Bottlenecks
Abstract¶
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
MobileNet V2 |
3.5 |
0.319 |
71.86 |
90.42 |
Citation¶
@INPROCEEDINGS{8578572,
author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
title={MobileNetV2: Inverted Residuals and Linear Bottlenecks},
year={2018},
volume={},
number={},
pages={4510-4520},
doi={10.1109/CVPR.2018.00474}}
}
MobileNet V3¶
Abstract¶
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
MobileNetV3-Small* |
2.54 |
0.06 |
67.66 |
87.41 |
||
MobileNetV3-Large* |
5.48 |
0.23 |
74.04 |
91.34 |
Models with * are converted from torchvision. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@inproceedings{Howard_2019_ICCV,
author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
title = {Searching for MobileNetV3},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}
MViT V2¶
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Abstract¶
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s’ pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.

Results and models¶
ImageNet-1k¶
Model |
Pretrain |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
MViTv2-tiny* |
From scratch |
24.17 |
4.70 |
82.33 |
96.15 |
||
MViTv2-small* |
From scratch |
34.87 |
7.00 |
83.63 |
96.51 |
||
MViTv2-base* |
From scratch |
51.47 |
10.20 |
84.34 |
96.86 |
||
MViTv2-large* |
From scratch |
217.99 |
42.10 |
85.25 |
97.14 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@inproceedings{li2021improved,
title={MViTv2: Improved multiscale vision transformers for classification and detection},
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
booktitle={CVPR},
year={2022}
}
PoolFormer¶
MetaFormer is Actually What You Need for Vision
Abstract¶
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model’s performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of “MetaFormer”, a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
PoolFormer-S12* |
11.92 |
1.87 |
77.24 |
93.51 |
||
PoolFormer-S24* |
21.39 |
3.51 |
80.33 |
95.05 |
||
PoolFormer-S36* |
30.86 |
5.15 |
81.43 |
95.45 |
||
PoolFormer-M36* |
56.17 |
8.96 |
82.14 |
95.71 |
||
PoolFormer-M48* |
73.47 |
11.80 |
82.51 |
95.95 |
Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@article{yu2021metaformer,
title={MetaFormer is Actually What You Need for Vision},
author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
journal={arXiv preprint arXiv:2111.11418},
year={2021}
}
RegNet¶
Designing Network Design Spaces
Abstract¶
In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
RegNetX-400MF |
5.16 |
0.41 |
72.56 |
90.78 |
||
RegNetX-800MF |
7.26 |
0.81 |
74.76 |
92.32 |
||
RegNetX-1.6GF |
9.19 |
1.63 |
76.84 |
93.31 |
||
RegNetX-3.2GF |
15.3 |
3.21 |
78.09 |
94.08 |
||
RegNetX-4.0GF |
22.12 |
4.0 |
78.60 |
94.17 |
||
RegNetX-6.4GF |
26.21 |
6.51 |
79.38 |
94.65 |
||
RegNetX-8.0GF |
39.57 |
8.03 |
79.12 |
94.51 |
||
RegNetX-12GF |
46.11 |
12.15 |
79.67 |
95.03 |
||
RegNetX-400MF* |
5.16 |
0.41 |
72.55 |
90.91 |
||
RegNetX-800MF* |
7.26 |
0.81 |
75.21 |
92.37 |
||
RegNetX-1.6GF* |
9.19 |
1.63 |
77.04 |
93.51 |
||
RegNetX-3.2GF* |
15.3 |
3.21 |
78.26 |
94.20 |
||
RegNetX-4.0GF* |
22.12 |
4.0 |
78.72 |
94.22 |
||
RegNetX-6.4GF* |
26.21 |
6.51 |
79.22 |
94.61 |
||
RegNetX-8.0GF* |
39.57 |
8.03 |
79.31 |
94.57 |
||
RegNetX-12GF* |
46.11 |
12.15 |
79.91 |
94.78 |
Models with * are converted from pycls. The config files of these models are only for validation.
Citation¶
@article{radosavovic2020designing,
title={Designing Network Design Spaces},
author={Ilija Radosavovic and Raj Prateek Kosaraju and Ross Girshick and Kaiming He and Piotr Dollár},
year={2020},
eprint={2003.13678},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
RepMLP¶
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers forImage Recognition
Abstract¶
We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
RepMLP-B224* |
68.24 |
6.71 |
80.41 |
95.12 |
||
RepMLP-B256* |
96.45 |
9.69 |
81.11 |
95.5 |
Models with * are converted from the official repo.. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
How to use¶
The checkpoints provided are all training-time
models. Use the reparameterize tool to switch them to more efficient inference-time
architecture, which not only has fewer parameters but also less calculations.
Use tool¶
Use provided tool to reparameterize the given model and save the checkpoint:
python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
${CFG_PATH}
is the config file, ${SRC_CKPT_PATH}
is the source chenpoint file, ${TARGET_CKPT_PATH}
is the target deploy weight file path.
To use reparameterized weights, the config file must switch to the deploy config files.
python tools/test.py ${Deploy_CFG} ${Deploy_Checkpoint} --metrics accuracy
In the code¶
Use backbone.switch_to_deploy()
or classificer.backbone.switch_to_deploy()
to switch to the deploy mode. For example:
from mmcls.models import build_backbone
backbone_cfg=dict(type='RepMLPNet', arch='B', img_size=224, reparam_conv_kernels=(1, 3), deploy=False)
backbone = build_backbone(backbone_cfg)
backbone.switch_to_deploy()
or
from mmcls.models import build_classifier
cfg = dict(
type='ImageClassifier',
backbone=dict(
type='RepMLPNet',
arch='B',
img_size=224,
reparam_conv_kernels=(1, 3),
deploy=False),
neck=dict(type='GlobalAveragePooling'),
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
topk=(1, 5),
))
classifier = build_classifier(cfg)
classifier.backbone.switch_to_deploy()
Citation¶
@article{ding2021repmlp,
title={Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition},
author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
journal={arXiv preprint arXiv:2105.01883},
year={2021}
}
RepVGG¶
Repvgg: Making vgg-style convnets great again
Abstract¶
We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.

Results and models¶
ImageNet-1k¶
Model |
Epochs |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
RepVGG-A0* |
120 |
9.11(train) | 8.31 (deploy) |
1.52 (train) | 1.36 (deploy) |
72.41 |
90.50 |
||
RepVGG-A1* |
120 |
14.09 (train) | 12.79 (deploy) |
2.64 (train) | 2.37 (deploy) |
74.47 |
91.85 |
||
RepVGG-A2* |
120 |
28.21 (train) | 25.5 (deploy) |
5.7 (train) | 5.12 (deploy) |
76.48 |
93.01 |
||
RepVGG-B0* |
120 |
15.82 (train) | 14.34 (deploy) |
3.42 (train) | 3.06 (deploy) |
75.14 |
92.42 |
||
RepVGG-B1* |
120 |
57.42 (train) | 51.83 (deploy) |
13.16 (train) | 11.82 (deploy) |
78.37 |
94.11 |
||
RepVGG-B1g2* |
120 |
45.78 (train) | 41.36 (deploy) |
9.82 (train) | 8.82 (deploy) |
77.79 |
93.88 |
||
RepVGG-B1g4* |
120 |
39.97 (train) | 36.13 (deploy) |
8.15 (train) | 7.32 (deploy) |
77.58 |
93.84 |
||
RepVGG-B2* |
120 |
89.02 (train) | 80.32 (deploy) |
20.46 (train) | 18.39 (deploy) |
78.78 |
94.42 |
||
RepVGG-B2g4* |
200 |
61.76 (train) | 55.78 (deploy) |
12.63 (train) | 11.34 (deploy) |
79.38 |
94.68 |
||
RepVGG-B3* |
200 |
123.09 (train) | 110.96 (deploy) |
29.17 (train) | 26.22 (deploy) |
80.52 |
95.26 |
||
RepVGG-B3g4* |
200 |
83.83 (train) | 75.63 (deploy) |
17.9 (train) | 16.08 (deploy) |
80.22 |
95.10 |
||
RepVGG-D2se* |
200 |
133.33 (train) | 120.39 (deploy) |
36.56 (train) | 32.85 (deploy) |
81.81 |
95.94 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
How to use¶
The checkpoints provided are all training-time
models. Use the reparameterize tool to switch them to more efficient inference-time
architecture, which not only has fewer parameters but also less calculations.
Use tool¶
Use provided tool to reparameterize the given model and save the checkpoint:
python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
${CFG_PATH}
is the config file, ${SRC_CKPT_PATH}
is the source chenpoint file, ${TARGET_CKPT_PATH}
is the target deploy weight file path.
To use reparameterized weights, the config file must switch to the deploy config files.
python tools/test.py ${Deploy_CFG} ${Deploy_Checkpoint} --metrics accuracy
In the code¶
Use backbone.switch_to_deploy()
or classificer.backbone.switch_to_deploy()
to switch to the deploy mode. For example:
from mmcls.models import build_backbone
backbone_cfg=dict(type='RepVGG',arch='A0'),
backbone = build_backbone(backbone_cfg)
backbone.switch_to_deploy()
or
from mmcls.models import build_classifier
cfg = dict(
type='ImageClassifier',
backbone=dict(
type='RepVGG',
arch='A0'),
neck=dict(type='GlobalAveragePooling'),
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=1280,
loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
topk=(1, 5),
))
classifier = build_classifier(cfg)
classifier.backbone.switch_to_deploy()
Citation¶
@inproceedings{ding2021repvgg,
title={Repvgg: Making vgg-style convnets great again},
author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13733--13742},
year={2021}
}
Res2Net¶
Res2Net: A New Multi-scale Backbone Architecture
Abstract¶
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.

Results and models¶
ImageNet-1k¶
Model |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
Res2Net-50-14w-8s* |
224x224 |
25.06 |
4.22 |
78.14 |
93.85 |
model | log |
|
Res2Net-50-26w-8s* |
224x224 |
48.40 |
8.39 |
79.20 |
94.36 |
model | log |
|
Res2Net-101-26w-4s* |
224x224 |
45.21 |
8.12 |
79.19 |
94.44 |
model | log |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@article{gao2019res2net,
title={Res2Net: A New Multi-scale Backbone Architecture},
author={Gao, Shang-Hua and Cheng, Ming-Ming and Zhao, Kai and Zhang, Xin-Yu and Yang, Ming-Hsuan and Torr, Philip},
journal={IEEE TPAMI},
year={2021},
doi={10.1109/TPAMI.2019.2938758},
}
ResNet¶
Deep Residual Learning for Image Recognition
Abstract¶
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Results and models¶
The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.
Model |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|
ResNet-50-mill |
224x224 |
86.74 |
15.14 |
The “mill” means using the mutil-label pretrain weight from ImageNet-21K Pretraining for the Masses.
Cifar10¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ResNet-18 |
11.17 |
0.56 |
94.82 |
99.87 |
||
ResNet-34 |
21.28 |
1.16 |
95.34 |
99.87 |
||
ResNet-50 |
23.52 |
1.31 |
95.55 |
99.91 |
||
ResNet-101 |
42.51 |
2.52 |
95.58 |
99.87 |
||
ResNet-152 |
58.16 |
3.74 |
95.76 |
99.89 |
Cifar100¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ResNet-50 |
23.71 |
1.31 |
79.90 |
95.19 |
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ResNet-18 |
11.69 |
1.82 |
69.90 |
89.43 |
||
ResNet-34 |
21.8 |
3.68 |
73.62 |
91.59 |
||
ResNet-50 |
25.56 |
4.12 |
76.55 |
93.06 |
||
ResNet-101 |
44.55 |
7.85 |
77.97 |
94.06 |
||
ResNet-152 |
60.19 |
11.58 |
78.48 |
94.13 |
||
ResNetV1C-50 |
25.58 |
4.36 |
77.01 |
93.58 |
||
ResNetV1C-101 |
44.57 |
8.09 |
78.30 |
94.27 |
||
ResNetV1C-152 |
60.21 |
11.82 |
78.76 |
94.41 |
||
ResNetV1D-50 |
25.58 |
4.36 |
77.54 |
93.57 |
||
ResNetV1D-101 |
44.57 |
8.09 |
78.93 |
94.48 |
||
ResNetV1D-152 |
60.21 |
11.82 |
79.41 |
94.70 |
||
ResNet-50 (fp16) |
25.56 |
4.12 |
76.30 |
93.07 |
||
Wide-ResNet-50* |
68.88 |
11.44 |
78.48 |
94.08 |
||
Wide-ResNet-101* |
126.89 |
22.81 |
78.84 |
94.28 |
||
ResNet-50 (rsb-a1) |
25.56 |
4.12 |
80.12 |
94.78 |
||
ResNet-50 (rsb-a2) |
25.56 |
4.12 |
79.55 |
94.37 |
||
ResNet-50 (rsb-a3) |
25.56 |
4.12 |
78.30 |
93.80 |
The “rsb” means using the training settings from ResNet strikes back: An improved training procedure in timm.
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
CUB-200-2011¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
ResNet-50 |
448x448 |
23.92 |
16.48 |
88.45 |
Stanford-Cars¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
ResNet-50 |
448x448 |
23.92 |
16.48 |
92.82 |
Citation¶
@inproceedings{he2016deep,
title={Deep residual learning for image recognition},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={770--778},
year={2016}
}
ResNeXt¶
Aggregated Residual Transformations for Deep Neural Networks
Abstract¶
We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ResNeXt-32x4d-50 |
25.03 |
4.27 |
77.90 |
93.66 |
||
ResNeXt-32x4d-101 |
44.18 |
8.03 |
78.61 |
94.17 |
||
ResNeXt-32x8d-101 |
88.79 |
16.5 |
79.27 |
94.58 |
||
ResNeXt-32x4d-152 |
59.95 |
11.8 |
78.88 |
94.33 |
Citation¶
@inproceedings{xie2017aggregated,
title={Aggregated residual transformations for deep neural networks},
author={Xie, Saining and Girshick, Ross and Doll{\'a}r, Piotr and Tu, Zhuowen and He, Kaiming},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={1492--1500},
year={2017}
}
SE-ResNet¶
Squeeze-and-Excitation Networks
Abstract¶
The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
SE-ResNet-50 |
28.09 |
4.13 |
77.74 |
93.84 |
||
SE-ResNet-101 |
49.33 |
7.86 |
78.26 |
94.07 |
Citation¶
@inproceedings{hu2018squeeze,
title={Squeeze-and-excitation networks},
author={Hu, Jie and Shen, Li and Sun, Gang},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={7132--7141},
year={2018}
}
ShuffleNet V1¶
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Abstract¶
We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ShuffleNetV1 1.0x (group=3) |
1.87 |
0.146 |
68.13 |
87.81 |
Citation¶
@inproceedings{zhang2018shufflenet,
title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},
author={Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={6848--6856},
year={2018}
}
ShuffleNet V2¶
Shufflenet v2: Practical guidelines for efficient cnn architecture design
Abstract¶
Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
ShuffleNetV2 1.0x |
2.28 |
0.149 |
69.55 |
88.92 |
Citation¶
@inproceedings{ma2018shufflenet,
title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},
author={Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian},
booktitle={Proceedings of the European conference on computer vision (ECCV)},
pages={116--131},
year={2018}
}
Swin Transformer¶
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Abstract¶
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.

Results and models¶
ImageNet-21k¶
The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.
Model |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|
Swin-B |
224x224 |
86.74 |
15.14 |
|
Swin-B |
384x384 |
86.88 |
44.49 |
|
Swin-L |
224x224 |
195.00 |
34.04 |
|
Swin-L |
384x384 |
195.20 |
100.04 |
ImageNet-1k¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|
Swin-T |
From scratch |
224x224 |
28.29 |
4.36 |
81.18 |
95.61 |
||
Swin-S |
From scratch |
224x224 |
49.61 |
8.52 |
83.02 |
96.29 |
||
Swin-B |
From scratch |
224x224 |
87.77 |
15.14 |
83.36 |
96.44 |
||
Swin-S* |
From scratch |
224x224 |
49.61 |
8.52 |
83.21 |
96.25 |
||
Swin-B* |
From scratch |
224x224 |
87.77 |
15.14 |
83.42 |
96.44 |
||
Swin-B* |
From scratch |
384x384 |
87.90 |
44.49 |
84.49 |
96.95 |
||
Swin-B* |
ImageNet-21k |
224x224 |
87.77 |
15.14 |
85.16 |
97.50 |
||
Swin-B* |
ImageNet-21k |
384x384 |
87.90 |
44.49 |
86.44 |
98.05 |
||
Swin-L* |
ImageNet-21k |
224x224 |
196.53 |
34.04 |
86.24 |
97.88 |
||
Swin-L* |
ImageNet-21k |
384x384 |
196.74 |
100.04 |
87.25 |
98.25 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
CUB-200-2011¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|
Swin-L |
384x384 |
195.51 |
100.04 |
91.87 |
Citation¶
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Swin Transformer V2¶
Swin Transformer V2: Scaling Up Capacity and Resolution
Abstract¶
Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.

Results and models¶
ImageNet-21k¶
The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.
Model |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|
Swin-B* |
192x192 |
87.92 |
8.51 |
|
Swin-L* |
192x192 |
196.74 |
19.04 |
ImageNet-1k¶
Model |
Pretrain |
resolution |
window |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|---|
Swin-T* |
From scratch |
256x256 |
8x8 |
28.35 |
4.35 |
81.76 |
95.87 |
||
Swin-T* |
From scratch |
256x256 |
16x16 |
28.35 |
4.4 |
82.81 |
96.23 |
||
Swin-S* |
From scratch |
256x256 |
8x8 |
49.73 |
8.45 |
83.74 |
96.6 |
||
Swin-S* |
From scratch |
256x256 |
16x16 |
49.73 |
8.57 |
84.13 |
96.83 |
||
Swin-B* |
From scratch |
256x256 |
8x8 |
87.92 |
14.99 |
84.2 |
96.86 |
||
Swin-B* |
From scratch |
256x256 |
16x16 |
87.92 |
15.14 |
84.6 |
97.05 |
||
Swin-B* |
ImageNet-21k |
256x256 |
16x16 |
87.92 |
15.14 |
86.17 |
97.88 |
||
Swin-B* |
ImageNet-21k |
384x384 |
24x24 |
87.92 |
34.07 |
87.14 |
98.23 |
||
Swin-L* |
ImageNet-21k |
256X256 |
16x16 |
196.75 |
33.86 |
86.93 |
98.06 |
||
Swin-L* |
ImageNet-21k |
384x384 |
24x24 |
196.75 |
76.2 |
87.59 |
98.27 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
ImageNet-21k pretrained models with input resolution of 256x256 and 384x384 both fine-tuned from the same pre-training model using a smaller input resolution of 192x192.
Citation¶
@article{https://doi.org/10.48550/arxiv.2111.09883,
doi = {10.48550/ARXIV.2111.09883},
url = {https://arxiv.org/abs/2111.09883},
author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution 4.0 International}
}
Tokens-to-Token ViT¶
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Abstract¶
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
T2T-ViT_t-14 |
21.47 |
4.34 |
81.83 |
95.84 |
||
T2T-ViT_t-19 |
39.08 |
7.80 |
82.63 |
96.18 |
||
T2T-ViT_t-24 |
64.00 |
12.69 |
82.71 |
96.09 |
In consistent with the official repo, we adopt the best checkpoints during training.
Citation¶
@article{yuan2021tokens,
title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
journal={arXiv preprint arXiv:2101.11986},
year={2021}
}
TNT¶
Abstract¶
Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as “visual sentences” and present to further divide them into smaller patches (e.g., 4×4) as “visual words”. The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
TNT-small* |
23.76 |
3.36 |
81.52 |
95.73 |
Models with * are converted from timm. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@misc{han2021transformer,
title={Transformer in Transformer},
author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
year={2021},
eprint={2103.00112},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Twins¶
Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Abstract¶
Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at this https URL.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
PCPVT-small* |
24.11 |
3.67 |
81.14 |
95.69 |
||
PCPVT-base* |
43.83 |
6.45 |
82.66 |
96.26 |
||
PCPVT-large* |
60.99 |
9.51 |
83.09 |
96.59 |
||
SVT-small* |
24.06 |
2.82 |
81.77 |
95.57 |
||
SVT-base* |
56.07 |
8.35 |
83.13 |
96.29 |
||
SVT-large* |
99.27 |
14.82 |
83.60 |
96.50 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results. The validation accuracy is a little different from the official paper because of the PyTorch version. This result is get in PyTorch=1.9 while the official result is get in PyTorch=1.7
Citation¶
@article{chu2021twins,
title={Twins: Revisiting spatial attention design in vision transformers},
author={Chu, Xiangxiang and Tian, Zhi and Wang, Yuqing and Zhang, Bo and Ren, Haibing and Wei, Xiaolin and Xia, Huaxia and Shen, Chunhua},
journal={arXiv preprint arXiv:2104.13840},
year={2021}altgvt
}
Visual Attention Network¶
Abstract¶
While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

Results and models¶
ImageNet-1k¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|
VAN-B0* |
From scratch |
224x224 |
4.11 |
0.88 |
75.41 |
93.02 |
||
VAN-B1* |
From scratch |
224x224 |
13.86 |
2.52 |
81.01 |
95.63 |
||
VAN-B2* |
From scratch |
224x224 |
26.58 |
5.03 |
82.80 |
96.21 |
||
VAN-B3* |
From scratch |
224x224 |
44.77 |
8.99 |
83.86 |
96.73 |
||
VAN-B4* |
From scratch |
224x224 |
60.28 |
12.22 |
84.13 |
96.86 |
*Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Pre-trained Models¶
The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks.
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|---|
VAN-B4* |
ImageNet-21k |
224x224 |
60.28 |
12.22 |
|
VAN-B5* |
ImageNet-21k |
224x224 |
89.97 |
17.21 |
|
VAN-B6* |
ImageNet-21k |
224x224 |
283.9 |
55.28 |
*Models with * are converted from the official repo.
Citation¶
@article{guo2022visual,
title={Visual Attention Network},
author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
journal={arXiv preprint arXiv:2202.09741},
year={2022}
}
VGG¶
Very Deep Convolutional Networks for Large-Scale Image Recognition
Abstract¶
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
VGG-11 |
132.86 |
7.63 |
68.75 |
88.87 |
||
VGG-13 |
133.05 |
11.34 |
70.02 |
89.46 |
||
VGG-16 |
138.36 |
15.5 |
71.62 |
90.49 |
||
VGG-19 |
143.67 |
19.67 |
72.41 |
90.80 |
||
VGG-11-BN |
132.87 |
7.64 |
70.67 |
90.16 |
||
VGG-13-BN |
133.05 |
11.36 |
72.12 |
90.66 |
||
VGG-16-BN |
138.37 |
15.53 |
73.74 |
91.66 |
||
VGG-19-BN |
143.68 |
19.7 |
74.68 |
92.27 |
Citation¶
@article{simonyan2014very,
title={Very deep convolutional networks for large-scale image recognition},
author={Simonyan, Karen and Zisserman, Andrew},
journal={arXiv preprint arXiv:1409.1556},
year={2014}
}
Vision Transformer¶
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Abstract¶
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Results and models¶
The training step of Vision Transformers is divided into two steps. The first step is training the model on a large dataset, like ImageNet-21k, and get the pre-trained model. And the second step is training the model on the target dataset, like ImageNet-1k, and get the fine-tuned model. Here, we provide both pre-trained models and fine-tuned models.
ImageNet-21k¶
The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.
Model |
resolution |
Params(M) |
Flops(G) |
Download |
---|---|---|---|---|
ViT-B16* |
224x224 |
86.86 |
33.03 |
|
ViT-B32* |
224x224 |
88.30 |
8.56 |
|
ViT-L16* |
224x224 |
304.72 |
116.68 |
Models with * are converted from the official repo.
ImageNet-1k¶
Model |
Pretrain |
resolution |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|---|---|
ViT-B16* |
ImageNet-21k |
384x384 |
86.86 |
33.03 |
85.43 |
97.77 |
||
ViT-B32* |
ImageNet-21k |
384x384 |
88.30 |
8.56 |
84.01 |
97.08 |
||
ViT-L16* |
ImageNet-21k |
384x384 |
304.72 |
116.68 |
85.63 |
97.63 |
||
ViT-B16 (IPU) |
ImageNet-21k |
224x224 |
86.86 |
33.03 |
81.22 |
95.56 |
Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@inproceedings{
dosovitskiy2021an,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=YicbFdNTTy}
}
Wide-ResNet¶
Abstract¶
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.

Results and models¶
ImageNet-1k¶
Model |
Params(M) |
Flops(G) |
Top-1 (%) |
Top-5 (%) |
Config |
Download |
---|---|---|---|---|---|---|
WRN-50* |
68.88 |
11.44 |
78.48 |
94.08 |
||
WRN-101* |
126.89 |
22.81 |
78.84 |
94.28 |
||
WRN-50 (timm)* |
68.88 |
11.44 |
81.45 |
95.53 |
Models with * are converted from the TorchVision and TIMM. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.
Citation¶
@INPROCEEDINGS{Zagoruyko2016WRN,
author = {Sergey Zagoruyko and Nikos Komodakis},
title = {Wide Residual Networks},
booktitle = {BMVC},
year = {2016}}
Pytorch 转 ONNX (试验性的)¶
如何将模型从 PyTorch 转换到 ONNX¶
准备工作¶
请参照 安装指南 从源码安装 MMClassification。
安装 onnx 和 onnxruntime。
pip install onnx onnxruntime==1.5.1
使用方法¶
python tools/deployment/pytorch2onnx.py \
${CONFIG_FILE} \
--checkpoint ${CHECKPOINT_FILE} \
--output-file ${OUTPUT_FILE} \
--shape ${IMAGE_SHAPE} \
--opset-version ${OPSET_VERSION} \
--dynamic-shape \
--show \
--simplify \
--verify \
所有参数的说明:
config
: 模型配置文件的路径。--checkpoint
: 模型权重文件的路径。--output-file
: ONNX 模型的输出路径。如果没有指定,默认为当前脚本执行路径下的tmp.onnx
。--shape
: 模型输入的高度和宽度。如果没有指定,默认为224 224
。--opset-version
: ONNX 的 opset 版本。如果没有指定,默认为11
。--dynamic-shape
: 是否以动态输入尺寸导出 ONNX。 如果没有指定,默认为False
。--show
: 是否打印导出模型的架构。如果没有指定,默认为False
。--simplify
: 是否精简导出的 ONNX 模型。如果没有指定,默认为False
。--verify
: 是否验证导出模型的正确性。如果没有指定,默认为False
。
示例:
python tools/deployment/pytorch2onnx.py \
configs/resnet/resnet18_8xb16_cifar10.py \
--checkpoint checkpoints/resnet/resnet18_b16x8_cifar10.pth \
--output-file checkpoints/resnet/resnet18_b16x8_cifar10.onnx \
--dynamic-shape \
--show \
--simplify \
--verify \
支持导出至 ONNX 的模型列表¶
下表列出了保证可导出至 ONNX,并在 ONNX Runtime 中运行的模型。
模型 |
配置文件 |
批推理 |
动态输入尺寸 |
备注 |
---|---|---|---|---|
MobileNetV2 |
|
Y |
Y |
|
ResNet |
|
Y |
Y |
|
ResNeXt |
|
Y |
Y |
|
SE-ResNet |
|
Y |
Y |
|
ShuffleNetV1 |
|
Y |
Y |
|
ShuffleNetV2 |
|
Y |
Y |
注:
以上所有模型转换测试基于 Pytorch==1.6.0 进行
提示¶
如果你在上述模型的转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。未在上表中列出的模型,由于资源限制,我们可能无法提供很多帮助,如果遇到问题,请尝试自行解决。
常见问题¶
无
ONNX 转 TensorRT(试验性的)¶
如何将模型从 ONNX 转换到 TensorRT¶
准备工作¶
请参照 安装指南 从源码安装 MMClassification。
使用我们的工具 pytorch2onnx.md 将 PyTorch 模型转换至 ONNX。
使用方法¶
python tools/deployment/onnx2tensorrt.py \
${MODEL} \
--trt-file ${TRT_FILE} \
--shape ${IMAGE_SHAPE} \
--workspace-size {WORKSPACE_SIZE} \
--show \
--verify \
所有参数的说明:
model
: ONNX 模型的路径。--trt-file
: TensorRT 引擎文件的输出路径。如果没有指定,默认为当前脚本执行路径下的tmp.trt
。--shape
: 模型输入的高度和宽度。如果没有指定,默认为224 224
。--workspace-size
: 构建 TensorRT 引擎所需要的 GPU 空间大小,单位为 GiB。如果没有指定,默认为1
GiB。--show
: 是否展示模型的输出。如果没有指定,默认为False
。--verify
: 是否使用 ONNXRuntime 和 TensorRT 验证模型转换的正确性。如果没有指定,默认为False
。
示例:
python tools/deployment/onnx2tensorrt.py \
checkpoints/resnet/resnet18_b16x8_cifar10.onnx \
--trt-file checkpoints/resnet/resnet18_b16x8_cifar10.trt \
--shape 224 224 \
--show \
--verify \
支持转换至 TensorRT 的模型列表¶
下表列出了保证可转换为 TensorRT 的模型。
模型 |
配置文件 |
状态 |
---|---|---|
MobileNetV2 |
|
Y |
ResNet |
|
Y |
ResNeXt |
|
Y |
ShuffleNetV1 |
|
Y |
ShuffleNetV2 |
|
Y |
注:
以上所有模型转换测试基于 Pytorch==1.6.0 和 TensorRT-7.2.1.6.Ubuntu-16.04.x86_64-gnu.cuda-10.2.cudnn8.0 进行
提示¶
如果你在上述模型的转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。未在上表中列出的模型,由于资源限制,我们可能无法提供很多帮助,如果遇到问题,请尝试自行解决。
常见问题¶
无
Pytorch 转 TorchScript (试验性的)¶
如何将 PyTorch 模型转换至 TorchScript¶
使用方法¶
python tools/deployment/pytorch2torchscript.py \
${CONFIG_FILE} \
--checkpoint ${CHECKPOINT_FILE} \
--output-file ${OUTPUT_FILE} \
--shape ${IMAGE_SHAPE} \
--verify \
所有参数的说明:
config
: 模型配置文件的路径。--checkpoint
: 模型权重文件的路径。--output-file
: TorchScript 模型的输出路径。如果没有指定,默认为当前脚本执行路径下的tmp.pt
。--shape
: 模型输入的高度和宽度。如果没有指定,默认为224 224
。--verify
: 是否验证导出模型的正确性。如果没有指定,默认为False
。
示例:
python tools/deployment/pytorch2torchscript.py \
configs/resnet/resnet18_8xb16_cifar10.py \
--checkpoint checkpoints/resnet/resnet18_b16x8_cifar10.pth \
--output-file checkpoints/resnet/resnet18_b16x8_cifar10.pt \
--verify \
注:
所有模型基于 Pytorch==1.8.1 通过了转换测试
提示¶
由于
torch.jit.is_tracing()
只在 PyTorch 1.6 之后的版本中得到支持,对于 PyTorch 1.3-1.5 的用户,我们建议手动提前返回结果。如果你在本仓库的模型转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。
常见问题¶
无
模型部署至 TorchServe¶
为了使用 TorchServe
部署一个 MMClassification
模型,需要进行以下几步:
1. 转换 MMClassification 模型至 TorchServe¶
python tools/deployment/mmcls2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
--output-folder ${MODEL_STORE} \
--model-name ${MODEL_NAME}
备注
${MODEL_STORE} 需要是一个文件夹的绝对路径。
示例:
python tools/deployment/mmcls2torchserve.py \
configs/resnet/resnet18_8xb32_in1k.py \
checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
--output-folder ./checkpoints \
--model-name resnet18_in1k
2. 构建 mmcls-serve
docker 镜像¶
docker build -t mmcls-serve:latest docker/serve/
3. 运行 mmcls-serve
镜像¶
请参考官方文档 基于 docker 运行 TorchServe.
为了使镜像能够使用 GPU 资源,需要安装 nvidia-docker。之后可以传递 --gpus
参数以在 GPU 上运。
示例:
docker run --rm \
--cpus 8 \
--gpus device=0 \
-p8080:8080 -p8081:8081 -p8082:8082 \
--mount type=bind,source=`realpath ./checkpoints`,target=/home/model-server/model-store \
mmcls-serve:latest
备注
realpath ./checkpoints
是 “./checkpoints” 的绝对路径,你可以将其替换为你保存 TorchServe 模型的目录的绝对路径。
参考 该文档 了解关于推理 (8080),管理 (8081) 和指标 (8082) 等 API 的信息。
4. 测试部署¶
curl http://127.0.0.1:8080/predictions/${MODEL_NAME} -T demo/demo.JPEG
您应该获得类似于以下内容的响应:
{
"pred_label": 58,
"pred_score": 0.38102269172668457,
"pred_class": "water snake"
}
另外,你也可以使用 test_torchserver.py
来比较 TorchServe 和 PyTorch 的结果,并进行可视化。
python tools/deployment/test_torchserver.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${MODEL_NAME}
[--inference-addr ${INFERENCE_ADDR}] [--device ${DEVICE}]
示例:
python tools/deployment/test_torchserver.py \
demo/demo.JPEG \
configs/resnet/resnet18_8xb32_in1k.py \
checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
resnet18_in1k
可视化¶
数据流水线可视化¶
python tools/visualizations/vis_pipeline.py \
${CONFIG_FILE} \
[--output-dir ${OUTPUT_DIR}] \
[--phase ${DATASET_PHASE}] \
[--number ${BUNBER_IMAGES_DISPLAY}] \
[--skip-type ${SKIP_TRANSFORM_TYPE}] \
[--mode ${DISPLAY_MODE}] \
[--show] \
[--adaptive] \
[--min-edge-length ${MIN_EDGE_LENGTH}] \
[--max-edge-length ${MAX_EDGE_LENGTH}] \
[--bgr2rgb] \
[--window-size ${WINDOW_SIZE}] \
[--cfg-options ${CFG_OPTIONS}]
所有参数的说明:
config
: 模型配置文件的路径。--output-dir
: 保存图片文件夹,如果没有指定,默认为''
,表示不保存。--phase
: 可视化数据集的阶段,只能为[train, val, test]
之一,默认为train
。--number
: 可视化样本数量。如果没有指定,默认展示数据集的所有图片。--skip-type
: 预设跳过的数据流水线过程。如果没有指定,默认为['ToTensor', 'Normalize', 'ImageToTensor', 'Collect']
。--mode
: 可视化的模式,只能为[original, transformed, concat, pipeline]
之一,如果没有指定,默认为concat
。--show
: 将可视化图片以弹窗形式展示。--adaptive
: 自动调节可视化图片的大小。--min-edge-length
: 最短边长度,当使用了--adaptive
时有效。 当图片任意边小于${MIN_EDGE_LENGTH}
时,会保持长宽比不变放大图片,短边对齐至${MIN_EDGE_LENGTH}
,默认为200。--max-edge-length
: 最长边长度,当使用了--adaptive
时有效。 当图片任意边大于${MAX_EDGE_LENGTH}
时,会保持长宽比不变缩小图片,短边对齐至${MAX_EDGE_LENGTH}
,默认为1000。--bgr2rgb
: 将图片的颜色通道翻转。--window-size
: 可视化窗口大小,如果没有指定,默认为12*7
。如果需要指定,按照格式'W*H'
。--cfg-options
: 对配置文件的修改,参考教程 1:如何编写配置文件。
备注
如果不指定
--mode
,默认设置为concat
,获取原始图片和预处理后图片拼接的图片;如果--mode
设置为original
,则获取原始图片;如果--mode
设置为transformed
,则获取预处理后的图片;如果--mode
设置为pipeline
,则获得数据流水线所有中间过程图片。当指定了
--adaptive
选项时,会自动的调整尺寸过大和过小的图片,你可以通过设定--min-edge-length
与--max-edge-length
来指定自动调整的图片尺寸。
示例:
‘original’ 模式,可视化
CIFAR100
验证集中的100张原始图片,显示并保存在./tmp
文件夹下:
python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100 --show --adaptive --bgr2rgb

‘transformed’ 模式,可视化
ImageNet
训练集的所有经过预处理的图片,并以弹窗形式显示:
python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode transformed

‘concat’ 模式,可视化
ImageNet
训练集的10张原始图片与预处理后图片对比图,保存在./tmp
文件夹下:
python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
‘pipeline’ 模式,可视化
ImageNet
训练集经过数据流水线的过程图像:
python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --adaptive --mode pipeline --show
学习率策略可视化¶
python tools/visualizations/vis_lr.py \
${CONFIG_FILE} \
[--dataset-size ${Dataset_Size}] \
[--ngpus ${NUM_GPUs}] \
[--save-path ${SAVE_PATH}] \
[--title ${TITLE}] \
[--style ${STYLE}] \
[--window-size ${WINDOW_SIZE}] \
[--cfg-options ${CFG_OPTIONS}] \
所有参数的说明:
config
: 模型配置文件的路径。--dataset-size
: 数据集的大小。如果指定,build_dataset
将被跳过并使用这个大小作为数据集大小,默认使用build_dataset
所得数据集的大小。--ngpus
: 使用 GPU 的数量。--save-path
: 保存的可视化图片的路径,默认不保存。--title
: 可视化图片的标题,默认为配置文件名。--style
: 可视化图片的风格,默认为whitegrid
。--window-size
: 可视化窗口大小,如果没有指定,默认为12*7
。如果需要指定,按照格式'W*H'
。--cfg-options
: 对配置文件的修改,参考教程 1:如何编写配置文件。
备注
部分数据集在解析标注阶段比较耗时,可直接将 dataset-size
指定数据集的大小,以节约时间。
示例:
python tools/visualizations/vis_lr.py configs/resnet/resnet50_b16x8_cifar100.py

当数据集为 ImageNet 时,通过直接指定数据集大小来节约时间,并保存图片:
python tools/visualizations/vis_lr.py configs/repvgg/repvgg-B3g4_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py --dataset-size 1281167 --ngpus 4 --save-path ./repvgg-B3g4_4xb64-lr.jpg

类别激活图可视化¶
MMClassification 提供 tools\visualizations\vis_cam.py
工具来可视化类别激活图。请使用 pip install "grad-cam>=1.3.6"
安装依赖的 pytorch-grad-cam。
目前支持的方法有:
Method |
What it does |
---|---|
GradCAM |
使用平均梯度对 2D 激活进行加权 |
GradCAM++ |
类似 GradCAM,但使用了二阶梯度 |
XGradCAM |
类似 GradCAM,但通过归一化的激活对梯度进行了加权 |
EigenCAM |
使用 2D 激活的第一主成分(无法区分类别,但效果似乎不错) |
EigenGradCAM |
类似 EigenCAM,但支持类别区分,使用了激活 * 梯度的第一主成分,看起来和 GradCAM 差不多,但是更干净 |
LayerCAM |
使用正梯度对激活进行空间加权,对于浅层有更好的效果 |
命令行:
python tools/visualizations/vis_cam.py \
${IMG} \
${CONFIG_FILE} \
${CHECKPOINT} \
[--target-layers ${TARGET-LAYERS}] \
[--preview-model] \
[--method ${METHOD}] \
[--target-category ${TARGET-CATEGORY}] \
[--save-path ${SAVE_PATH}] \
[--vit-like] \
[--num-extra-tokens ${NUM-EXTRA-TOKENS}]
[--aug_smooth] \
[--eigen_smooth] \
[--device ${DEVICE}] \
[--cfg-options ${CFG-OPTIONS}]
所有参数的说明:
img
:目标图片路径。config
:模型配置文件的路径。checkpoint
:权重路径。--target-layers
:所查看的网络层名称,可输入一个或者多个网络层, 如果不设置,将使用最后一个block
中的norm
层。--preview-model
:是否查看模型所有网络层。--method
:类别激活图图可视化的方法,目前支持GradCAM
,GradCAM++
,XGradCAM
,EigenCAM
,EigenGradCAM
,LayerCAM
,不区分大小写。如果不设置,默认为GradCAM
。--target-category
:查看的目标类别,如果不设置,使用模型检测出来的类别做为目标类别。--save-path
:保存的可视化图片的路径,默认不保存。--eigen-smooth
:是否使用主成分降低噪音,默认不开启。--vit-like
: 是否为ViT
类似的 Transformer-based 网络--num-extra-tokens
:ViT
类网络的额外的 tokens 通道数,默认使用主干网络的num_extra_tokens
。--aug-smooth
:是否使用测试时增强--device
:使用的计算设备,如果不设置,默认为’cpu’。--cfg-options
:对配置文件的修改,参考教程 1:如何编写配置文件。
备注
在指定 --target-layers
时,如果不知道模型有哪些网络层,可使用命令行添加 --preview-model
查看所有网络层名称;
示例(CNN):
--target-layers
在 Resnet-50
中的一些示例如下:
'backbone.layer4'
,表示第四个ResLayer
层的输出。'backbone.layer4.2'
表示第四个ResLayer
层中第三个BottleNeck
块的输出。'backbone.layer4.2.conv1'
表示上述BottleNeck
块中conv1
层的输出。
备注
对于 ModuleList
或者 Sequential
类型的网络层,可以直接使用索引的方式指定子模块。比如 backbone.layer4[-1]
和 backbone.layer4.2
是相同的,因为 layer4
是一个拥有三个子模块的 Sequential
。
使用不同方法可视化
ResNet50
,默认target-category
为模型检测的结果,使用默认推导的target-layers
。python tools/visualizations/vis_cam.py \ demo/bird.JPEG \ configs/resnet/resnet50_8xb32_in1k.py \ https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \ --method GradCAM # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
Image
GradCAM
GradCAM++
EigenGradCAM
LayerCAM
同一张图不同类别的激活图效果图,在
ImageNet
数据集中,类别238为 ‘Greater Swiss Mountain dog’,类别281为 ‘tabby, tabby cat’。python tools/visualizations/vis_cam.py \ demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \ https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \ --target-layers 'backbone.layer4.2' \ --method GradCAM \ --target-category 238 # --target-category 281
Category
Image
GradCAM
XGradCAM
LayerCAM
Dog
Cat
使用
--eigen-smooth
以及--aug-smooth
获取更好的可视化效果。python tools/visualizations/vis_cam.py \ demo/dog.jpg \ configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py \ https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \ --target-layers 'backbone.layer16' \ --method LayerCAM \ --eigen-smooth --aug-smooth
Image
LayerCAM
eigen-smooth
aug-smooth
eigen&aug
示例(Transformer):
--target-layers
在 Transformer-based 网络中的一些示例如下:
Swin-Transformer 中:
'backbone.norm3'
ViT 中:
'backbone.layers[-1].ln1'
对于 Transformer-based 的网络,比如 ViT、T2T-ViT 和 Swin-Transformer,特征是被展平的。为了绘制 CAM 图,我们需要指定 --vit-like
选项,从而让被展平的特征恢复方形的特征图。
除了特征被展平之外,一些类 ViT 的网络还会添加额外的 tokens。比如 ViT 和 T2T-ViT 中添加了分类 token,DeiT 中还添加了蒸馏 token。在这些网络中,分类计算在最后一个注意力模块之后就已经完成了,分类得分也只和这些额外的 tokens 有关,与特征图无关,也就是说,分类得分对这些特征图的导数为 0。因此,我们不能使用最后一个注意力模块的输出作为 CAM 绘制的目标层。
另外,为了去除这些额外的 toekns 以获得特征图,我们需要知道这些额外 tokens 的数量。MMClassification 中几乎所有 Transformer-based 的网络都拥有 num_extra_tokens
属性。而如果你希望将此工具应用于新的,或者第三方的网络,而且该网络没有指定 num_extra_tokens
属性,那么可以使用 --num-extra-tokens
参数手动指定其数量。
对
Swin Transformer
使用默认target-layers
进行 CAM 可视化:python tools/visualizations/vis_cam.py \ demo/bird.JPEG \ configs/swin_transformer/swin-tiny_16xb64_in1k.py \ https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \ --vit-like
对
Vision Transformer(ViT)
进行 CAM 可视化:python tools/visualizations/vis_cam.py \ demo/bird.JPEG \ configs/vision_transformer/vit-base-p16_ft-64xb64_in1k-384.py \ https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \ --vit-like \ --target-layers 'backbone.layers[-1].ln1'
对
T2T-ViT
进行 CAM 可视化:python tools/visualizations/vis_cam.py \ demo/bird.JPEG \ configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \ https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \ --vit-like \ --target-layers 'backbone.encoder[-1].ln1'
Image |
ResNet50 |
ViT |
Swin |
T2T-ViT |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
常见问题¶
无
分析¶
日志分析¶
绘制曲线图¶
指定一个训练日志文件,可通过 tools/analysis_tools/analyze_logs.py
脚本绘制指定键值的变化曲线

python tools/analysis_tools/analyze_logs.py plot_curve \
${JSON_LOGS} \
[--keys ${KEYS}] \
[--title ${TITLE}] \
[--legend ${LEGEND}] \
[--backend ${BACKEND}] \
[--style ${STYLE}] \
[--out ${OUT_FILE}] \
[--window-size ${WINDOW_SIZE}]
所有参数的说明
json_logs
:模型配置文件的路径(可同时传入多个,使用空格分开)。--keys
:分析日志的关键字段,数量为len(${JSON_LOGS}) * len(${KEYS})
默认为 ‘loss’。--title
:分析日志的图片名称,默认使用配置文件名, 默认为空。--legend
:图例名(可同时传入多个,使用空格分开,数目与${JSON_LOGS} * ${KEYS}
数目一致)。默认使用"${JSON_LOG}-${KEYS}"
。--backend
:matplotlib 的绘图后端,默认由 matplotlib 自动选择。--style
:绘图配色风格,默认为whitegrid
。--out
:保存分析图片的路径,如不指定则不保存。--window-size
: 可视化窗口大小,如果没有指定,默认为12*7
。如果需要指定,需按照格式'W*H'
。
备注
--style
选项依赖于第三方库 seaborn
,需要设置绘图风格请现安装该库。
例如:
绘制某日志文件对应的损失曲线图。
python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys loss --legend loss
绘制某日志文件对应的 top-1 和 top-5 准确率曲线图,并将曲线图导出为 results.jpg 文件。
python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys accuracy_top-1 accuracy_top-5 --legend top1 top5 --out results.jpg
在同一图像内绘制两份日志文件对应的 top-1 准确率曲线图。
python tools/analysis_tools/analyze_logs.py plot_curve log1.json log2.json --keys accuracy_top-1 --legend run1 run2
备注
本工具会自动根据关键字段选择从日志的训练部分还是验证部分读取,因此如果你添加了
自定义的验证指标,请把相对应的关键字段加入到本工具的 TEST_METRICS
变量中。
统计训练时间¶
tools/analysis_tools/analyze_logs.py
也可以根据日志文件统计训练耗时。
python tools/analysis_tools/analyze_logs.py cal_train_time \
${JSON_LOGS}
[--include-outliers]
所有参数的说明:
json_logs
:模型配置文件的路径(可同时传入多个,使用空格分开)。--include-outliers
:如果指定,将不会排除每个轮次中第一轮迭代的记录(有时第一轮迭代会耗时较长)
示例:
python tools/analysis_tools/analyze_logs.py cal_train_time work_dirs/some_exp/20200422_153324.log.json
预计输出结果如下所示:
-----Analyze train time of work_dirs/some_exp/20200422_153324.log.json-----
slowest epoch 68, average time is 0.3818
fastest epoch 1, average time is 0.3694
time std over epochs is 0.0020
average iter time: 0.3777 s/iter
结果分析¶
利用 tools/test.py
的 --out
参数,我们可以将所有的样本的推理结果保存到输出
文件中。利用这一文件,我们可以进行进一步的分析。
评估结果¶
tools/analysis_tools/eval_metric.py
可以用来再次计算评估结果。
python tools/analysis_tools/eval_metric.py \
${CONFIG} \
${RESULT} \
[--metrics ${METRICS}] \
[--cfg-options ${CFG_OPTIONS}] \
[--metric-options ${METRIC_OPTIONS}]
所有参数说明:
config
:配置文件的路径。result
:tools/test.py
的输出结果文件。metrics
: 评估的衡量指标,可接受的值取决于数据集类。--cfg-options
: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件。--metric-options
: 如果指定了,这些选项将被传递给数据集evaluate
函数的metric_options
参数。
备注
在 tools/test.py
中,我们支持使用 --out-items
选项来选择保存哪些结果。为了使用本工具,请确保结果文件中包含 “class_scores”。
示例:
python tools/analysis_tools/eval_metric.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py ./result.pkl --metrics accuracy --metric-options "topk=(1,5)"
查看典型结果¶
tools/analysis_tools/analyze_results.py
可以保存预测成功/失败,同时得分最高的 k 个图像。
python tools/analysis_tools/analyze_results.py \
${CONFIG} \
${RESULT} \
[--out-dir ${OUT_DIR}] \
[--topk ${TOPK}] \
[--cfg-options ${CFG_OPTIONS}]
所有参数说明:
config
:配置文件的路径。result
:tools/test.py
的输出结果文件。--out-dir
:保存结果分析的文件夹路径。--topk
:分别保存多少张预测成功/失败的图像。如果不指定,默认为20
。--cfg-options
: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件。
备注
在 tools/test.py
中,我们支持使用 --out-items
选项来选择保存哪些结果。为了使用本工具,请确保结果文件中包含 “pred_score”、”pred_label” 和 “pred_class”。
示例:
python tools/analysis_tools/analyze_results.py \
configs/resnet/resnet50_xxxx.py \
result.pkl \
--out-dir results \
--topk 50
模型复杂度分析¶
计算 FLOPs 和参数量(试验性的)¶
我们根据 flops-counter.pytorch 提供了一个脚本用于计算给定模型的 FLOPs 和参数量。
python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
所有参数说明:
config
:配置文件的路径。--shape
: 输入尺寸,支持单值或者双值, 如:--shape 256
、--shape 224 256
。默认为224 224
。
用户将获得如下结果:
==============================
Input shape: (3, 224, 224)
Flops: 4.12 GFLOPs
Params: 25.56 M
==============================
警告
此工具仍处于试验阶段,我们不保证该数字正确无误。您最好将结果用于简单比较,但在技术报告或论文中采用该结果之前,请仔细检查。
FLOPs 与输入的尺寸有关,而参数量与输入尺寸无关。默认输入尺寸为 (1, 3, 224, 224)
一些运算不会被计入 FLOPs 的统计中,例如 GN 和自定义运算。详细信息请参考
mmcv.cnn.get_model_complexity_info()
常见问题¶
无
其他工具¶
打印完整配置¶
tools/misc/print_config.py
脚本会解析所有输入变量,并打印完整配置信息。
python tools/misc/print_config.py ${CONFIG} [--cfg-options ${CFG_OPTIONS}]
所有参数说明:
config
:配置文件的路径。--cfg-options
: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件。
示例:
python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
检查数据集¶
tools/misc/verify_dataset.py
脚本会检查数据集的所有图片,查看是否有已经损坏的图片。
python tools/print_config.py \
${CONFIG} \
[--out-path ${OUT-PATH}] \
[--phase ${PHASE}] \
[--num-process ${NUM-PROCESS}]
[--cfg-options ${CFG_OPTIONS}]
所有参数说明:
config
: 配置文件的路径。--out-path
: 输出结果路径,默认为 ‘brokenfiles.log’。--phase
: 检查哪个阶段的数据集,可用值为 “train” 、”test” 或者 “val”, 默认为 “train”。--num-process
: 指定的进程数,默认为1。--cfg-options
: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件。
示例:
python tools/misc/verify_dataset.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py --out-path broken_imgs.log --phase val --num-process 8
常见问题¶
无
参与贡献 OpenMMLab¶
欢迎任何类型的贡献,包括但不限于
修改拼写错误或代码错误
添加文档或将文档翻译成其他语言
添加新功能和新组件
工作流程¶
fork 并 pull 最新的 OpenMMLab 仓库 (MMClassification)
签出到一个新分支(不要使用 master 分支提交 PR)
进行修改并提交至 fork 出的自己的远程仓库
在我们的仓库中创建一个 PR
备注
如果你计划添加一些新的功能,并引入大量改动,请尽量首先创建一个 issue 来进行讨论。
代码风格¶
Python¶
我们采用 PEP8 作为统一的代码风格。
我们使用下列工具来进行代码风格检查与格式化:
flake8: Python 官方发布的代码规范检查工具,是多个检查工具的封装
isort: 自动调整模块导入顺序的工具
yapf: 一个 Python 文件的格式化工具。
codespell: 检查单词拼写是否有误
mdformat: 检查 markdown 文件的工具
docformatter: 一个 docstring 格式化工具。
yapf 和 isort 的格式设置位于 setup.cfg
我们使用 pre-commit hook 来保证每次提交时自动进行代
码检查和格式化,启用的功能包括 flake8
, yapf
, isort
, trailing whitespaces
, markdown files
, 修复 end-of-files
, double-quoted-strings
,
python-encoding-pragma
, mixed-line-ending
, 对 requirments.txt
的排序等。
pre-commit hook 的配置文件位于 .pre-commit-config
在你克隆仓库后,你需要按照如下步骤安装并初始化 pre-commit hook。
pip install -U pre-commit
在仓库文件夹中执行
pre-commit install
在此之后,每次提交,代码规范检查和格式化工具都将被强制执行。
重要
在创建 PR 之前,请确保你的代码完成了代码规范检查,并经过了 yapf 的格式化。
C++ 和 CUDA¶
C++ 和 CUDA 的代码规范遵从 Google C++ Style Guide
mmcls.apis¶
These are some high-level APIs for classification tasks.
Train¶
Test¶
Inference¶
mmcls.core¶
This package includes some runtime components. These components are useful in classification tasks but not supported by MMCV yet.
备注
Some components may be moved to MMCV in the future.
mmcls.core
Evaluation¶
Evaluation metrics calculation functions
Hook¶
Optimizers¶
mmcls.models¶
The models
package contains several sub-packages for addressing the different components of a model.
Classifier: The top-level module which defines the whole process of a classification model.
Backbones: Usually a feature extraction network, e.g., ResNet, MobileNet.
Necks: The component between backbones and heads, e.g., GlobalAveragePooling.
Heads: The component for specific tasks. In MMClassification, we provides heads for classification.
Losses: Loss functions.
Classifier¶
Backbones¶
Necks¶
Heads¶
Losses¶
mmcls.models.utils¶
This package includes some helper functions and common components used in various networks.
mmcls.models.utils
Common Components¶
Helper Functions¶
channel_shuffle¶
make_divisible¶
to_ntuple¶
is_tracing¶
mmcls.datasets¶
The datasets
package contains several usual datasets for image classification tasks and some dataset wrappers.
Custom Dataset¶
ImageNet¶
CIFAR¶
MNIST¶
VOC¶
StanfordCars Cars¶
Base classes¶
Dataset Wrappers¶
Data Transformations¶
In MMClassification, the data preparation and the dataset is decomposed. The datasets only define how to get samples’ basic information from the file system. These basic information includes the ground-truth label and raw images data / the paths of images.
To prepare the inputs data, we need to do some transformations on these basic
information. These transformations includes loading, preprocessing and
formatting. And a series of data transformations makes up a data pipeline.
Therefore, you can find the a pipeline
argument in the configs of dataset,
for example:
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=256),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
data = dict(
train=dict(..., pipeline=train_pipeline),
val=dict(..., pipeline=test_pipeline),
test=dict(..., pipeline=test_pipeline),
)
Every item of a pipeline list is one of the following data transformations class. And if you want to add a custom data transformation class, the tutorial Custom Data Pipelines will help you.
mmcls.datasets.pipelines
Loading¶
LoadImageFromFile¶
Preprocessing and Augmentation¶
CenterCrop¶
Lighting¶
Normalize¶
Pad¶
Resize¶
RandomCrop¶
RandomErasing¶
RandomFlip¶
RandomGrayscale¶
RandomResizedCrop¶
ColorJitter¶
Composed Augmentation¶
Composed augmentation is a kind of methods which compose a series of data
augmentation transformations, such as AutoAugment
and RandAugment
.
In composed augmentation, we need to specify several data transformations or
several groups of data transformations (The policies
argument) as the
random sampling space. These data transformations are chosen from the below
table. In addition, we provide some preset policies in this folder.
Formatting¶
Collect¶
ImageToTensor¶
ToNumpy¶
ToPIL¶
ToTensor¶
Transpose¶
Batch Augmentation¶
Batch augmentation is the augmentation which involve multiple samples, such as Mixup and CutMix.
In MMClassification, these batch augmentation is used as a part of Classifier. A typical usage is as below:
model = dict(
backbone = ...,
neck = ...,
head = ...,
train_cfg=dict(augments=[
dict(type='BatchMixup', alpha=0.8, prob=0.5, num_classes=num_classes),
dict(type='BatchCutMix', alpha=1.0, prob=0.5, num_classes=num_classes),
]))
)
Mixup¶
CutMix¶
ResizeMix¶
mmcls.utils¶
These are some useful help function in the utils
package.
Changelog¶
v0.25.0(06/12/2022)¶
Highlights¶
Support MLU backend.
New Features¶
Improvements¶
Add
dist_train_arm.sh
for ARM device and update NPU results. (#1218)
Bug Fixes¶
Docs Update¶
v0.24.1(31/10/2022)¶
New Features¶
Support mmcls with NPU backend. (#1072)
Bug Fixes¶
Fix performance issue in convnext DDP train. (#1098)
v0.24.0(30/9/2022)¶
Highlights¶
Support HorNet, EfficientFormerm, SwinTransformer V2 and MViT backbones.
Support Standford Cars dataset.
New Features¶
Improvements¶
[Improve] replace loop of progressbar in api/test. (#878)
[Enhance] RepVGG for YOLOX-PAI. (#1025)
[Enhancement] Update VAN. (#1017)
[Refactor] Re-write
get_sinusoid_encoding
from third-party implementation. (#965)[Improve] Upgrade onnxsim to v0.4.0. (#915)
[Improve] Fixed typo in
RepVGG
. (#985)[Improve] Using
train_step
instead offorward
in PreciseBNHook (#964)[Improve] Use
forward_dummy
to calculate FLOPS. (#953)
Bug Fixes¶
Fix warning with
torch.meshgrid
. (#860)Add matplotlib minimum version requriments. (#909)
val loader should not drop last by default. (#857)
Fix config.device bug in toturial. (#1059)
Fix attenstion clamp max params (#1034)
Fix device mismatch in Swin-v2. (#976)
Fix the output position of Swin-Transformer. (#947)
Docs Update¶
v0.23.2(28/7/2022)¶
New Features¶
Support MPS device. (#894)
Bug Fixes¶
Fix a bug in Albu which caused crashing. (#918)
v0.23.1(2/6/2022)¶
New Features¶
Dedicated MMClsWandbHook for MMClassification (Weights and Biases Integration) (#764)
Improvements¶
Use mdformat instead of markdownlint to format markdown. (#844)
Bug Fixes¶
Fix wrong
--local_rank
.
Docs Update¶
v0.23.0(1/5/2022)¶
New Features¶
Improvements¶
Support training on IPU and add fine-tuning configs of ViT. (#723)
Docs Update¶
v0.22.1(15/4/2022)¶
New Features¶
Improvements¶
v0.22.0(30/3/2022)¶
Highlights¶
Support a series of CSP Network, such as CSP-ResNet, CSP-ResNeXt and CSP-DarkNet.
A new
CustomDataset
class to help you build dataset of yourself!Support ConvMixer, RepMLP and new dataset - CUB dataset.
New Features¶
[Feature] Add CSPNet and backbone and checkpoints (#735)
[Feature] Add
CustomDataset
. (#738)[Feature] Add diff seeds to diff ranks. (#744)
[Feature] Support ConvMixer. (#716)
[Feature] Our
dist_train
&dist_test
tools support distributed training on multiple machines. (#734)[Feature] Add RepMLP backbone and checkpoints. (#709)
[Feature] Support CUB dataset. (#703)
[Feature] Support ResizeMix. (#676)
Improvements¶
Bug Fixes¶
[Fix] Fix the discontiguous output feature map of ConvNeXt. (#743)
Docs Update¶
v0.21.0(04/03/2022)¶
Highlights¶
Support ResNetV1c and Wide-ResNet, and provide pre-trained models.
Support dynamic input shape for ViT-based algorithms. Now our ViT, DeiT, Swin-Transformer and T2T-ViT support forwarding with any input shape.
Reproduce training results of DeiT. And our DeiT-T and DeiT-S have higher accuracy comparing with the official weights.
New Features¶
Improvements¶
Reproduce training results of DeiT. (#711)
Add ConvNeXt pretrain models on ImageNet-1k. (#707)
Support dynamic input shape for ViT-based algorithms. (#706)
Add
evaluate
function for ConcatDataset. (#650)Enhance vis-pipeline tool. (#604)
Return code 1 if scripts runs failed. (#694)
Use PyTorch official
one_hot
to implementconvert_to_one_hot
. (#696)Add a new pre-commit-hook to automatically add a copyright. (#710)
Add deprecation message for deploy tools. (#697)
Upgrade isort pre-commit hooks. (#687)
Use
--gpu-id
instead of--gpu-ids
in non-distributed multi-gpu training/testing. (#688)Remove deprecation. (#633)
Bug Fixes¶
v0.20.1(07/02/2022)¶
Bug Fixes¶
Fix the MMCV dependency version.
v0.20.0(30/01/2022)¶
Highlights¶
Support K-fold cross-validation. The tutorial will be released later.
Support HRNet, ConvNeXt, Twins and EfficientNet.
Support model conversion from PyTorch to Core-ML by a tool.
New Features¶
Support K-fold cross-validation. (#563)
Support HRNet and add pre-trained models. (#660)
Support ConvNeXt and add pre-trained models. (#670)
Support Twins and add pre-trained models. (#642)
Support EfficientNet and add pre-trained models.(#649)
Support
features_only
option inTIMMBackbone
. (#668)Add conversion script from pytorch to Core-ML model. (#597)
Improvements¶
New-style CPU training and inference. (#674)
Add setup multi-processing both in train and test. (#671)
Rewrite channel split operation in ShufflenetV2. (#632)
Deprecate the support for “python setup.py test”. (#646)
Support single-label, softmax, custom eps by asymmetric loss. (#609)
Save class names in best checkpoint created by evaluation hook. (#641)
Bug Fixes¶
Docs Update¶
v0.19.0(31/12/2021)¶
Highlights¶
The feature extraction function has been enhanced. See #593 for more details.
Provide the high-acc ResNet-50 training settings from ResNet strikes back.
Reproduce the training accuracy of T2T-ViT & RegNetX, and provide self-training checkpoints.
Support DeiT & Conformer backbone and checkpoints.
Provide a CAM visualization tool based on pytorch-grad-cam, and detailed user guide!
New Features¶
Support Precise BN. (#401)
Add CAM visualization tool. (#577)
Repeated Aug and Sampler Registry. (#588)
Add DeiT backbone and checkpoints. (#576)
Support LAMB optimizer. (#591)
Implement the conformer backbone. (#494)
Add the frozen function for Swin Transformer model. (#574)
Support using checkpoint in Swin Transformer to save memory. (#557)
Improvements¶
[Reproduction] Reproduce RegNetX training accuracy. (#587)
[Reproduction] Reproduce training results of T2T-ViT. (#610)
[Enhance] Provide high-acc training settings of ResNet. (#572)
[Enhance] Set a random seed when the user does not set a seed. (#554)
[Enhance] Added
NumClassCheckHook
and unit tests. (#559)[Enhance] Enhance feature extraction function. (#593)
[Enhance] Improve efficiency of precision, recall, f1_score and support. (#595)
[Enhance] Improve accuracy calculation performance. (#592)
[Refactor] Refactor
analysis_log.py
. (#529)[Refactor] Use new API of matplotlib to handle blocking input in visualization. (#568)
[CI] Cancel previous runs that are not completed. (#583)
[CI] Skip build CI if only configs or docs modification. (#575)
Bug Fixes¶
Docs Update¶
v0.18.0(30/11/2021)¶
Highlights¶
Support MLP-Mixer backbone and provide pre-trained checkpoints.
Add a tool to visualize the learning rate curve of the training phase. Welcome to use with the tutorial!
New Features¶
Improvements¶
Use CircleCI to do unit tests. (#567)
Focal loss for single label tasks. (#548)
Remove useless
import_modules_from_string
. (#544)Rename config files according to the config name standard. (#508)
Use
reset_classifier
to remove head of timm backbones. (#534)Support passing arguments to loss from head. (#523)
Refactor
Resize
transform and addPad
transform. (#506)Update mmcv dependency version. (#509)
Bug Fixes¶
Fix bug when using
ClassBalancedDataset
. (#555)Fix a bug when using iter-based runner with ‘val’ workflow. (#542)
Fix interpolation method checking in
Resize
. (#547)Fix a bug when load checkpoints in mulit-GPUs environment. (#527)
Fix an error on indexing scalar metrics in
analyze_result.py
. (#518)Fix wrong condition judgment in
analyze_logs.py
and prevent empty curve. (#510)
Docs Update¶
Fix vit config and model broken links. (#564)
Add abstract and image for every paper. (#546)
Add mmflow and mim in banner and readme. (#543)
Add schedule and runtime tutorial docs. (#499)
Add the top-5 acc in ResNet-CIFAR README. (#531)
Fix TOC of
visualization.md
and add example images. (#513)Use docs link of other projects and add MMCV docs. (#511)
v0.17.0(29/10/2021)¶
Highlights¶
Support Tokens-to-Token ViT backbone and Res2Net backbone. Welcome to use!
Support ImageNet21k dataset.
Add a pipeline visualization tool. Try it with the tutorials!
New Features¶
Add Tokens-to-Token ViT backbone and converted checkpoints. (#467)
Add Res2Net backbone and converted weights. (#465)
Support ImageNet21k dataset. (#461)
Support seesaw loss. (#500)
Add a pipeline visualization tool. (#406)
Add a tool to find broken files. (#482)
Add a tool to test TorchServe. (#468)
Improvements¶
Bug Fixes¶
Docs Update¶
v0.16.0(30/9/2021)¶
Highlights¶
We have improved compatibility with downstream repositories like MMDetection and MMSegmentation. We will add some examples about how to use our backbones in MMDetection.
Add RepVGG backbone and checkpoints. Welcome to use it!
Add timm backbones wrapper, now you can simply use backbones of pytorch-image-models in MMClassification!
New Features¶
Improvements¶
Fix TnT compatibility and verbose warning. (#436)
Support setting
--out-items
intools/test.py
. (#437)Add datetime info and saving model using torch<1.6 format. (#439)
Improve downstream repositories compatibility. (#421)
Rename the option
--options
to--cfg-options
in some tools. (#425)Add PyTorch 1.9 and Python 3.9 build workflow, and remove some CI. (#422)
Bug Fixes¶
Docs Update¶
v0.15.0(31/8/2021)¶
Highlights¶
Support
hparams
argument inAutoAugment
andRandAugment
to provide hyperparameters for sub-policies.Support custom squeeze channels in
SELayer
.Support classwise weight in losses.
New Features¶
Code Refactor¶
Better result visualization. (#419)
Use
post_process
function to handle pred result processing. (#390)Update
digit_version
function. (#402)Avoid albumentations to install both opencv and opencv-headless. (#397)
Avoid unnecessary listdir when building ImageNet. (#396)
Use dynamic mmcv download link in TorchServe dockerfile. (#387)
Docs Improvement¶
v0.14.0(4/8/2021)¶
Highlights¶
Add transformer-in-transformer backbone and pretrain checkpoints, refers to the paper.
Add Chinese colab tutorial.
Provide dockerfile to build mmcls dev docker image.
New Features¶
Improvements¶
Bug Fixes¶
Fix ImageNet dataset annotation file parse bug. (#370)
Fix docstring typo and init bug in ShuffleNetV1. (#374)
Use local ATTENTION registry to avoid conflict with other repositories. (#376)
Fix swin transformer config bug. (#355)
Fix
patch_cfg
argument bug in SwinTransformer. (#368)Fix duplicate
init_weights
call in ViT init function. (#373)Fix broken
_base_
link in a resnet config. (#361)Fix vgg-19 model link missing. (#363)
v0.13.0(3/7/2021)¶
Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet.
New Features¶
Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet. (#271)
Add pretained model of RegNetX. (#269)
Support adding custom hooks in config file. (#305)
Improve and add Chinese translation of
CONTRIBUTING.md
and all tools tutorials. (#320)Dump config before training. (#282)
Add torchscript and torchserve deployment tools. (#279, #284)
Improvements¶
Improve test tools and add some new tools. (#322)
Correct MobilenetV3 backbone structure and add pretained models. (#291)
Refactor
PatchEmbed
andHybridEmbed
as independent components. (#330)Refactor mixup and cutmix as
Augments
to support more functions. (#278)Refactor weights initialization method. (#270, #318, #319)
Refactor
LabelSmoothLoss
to support multiple calculation formulas. (#285)
Bug Fixes¶
Fix bug for CPU training. (#286)
Fix missing test data when
num_imgs
can not be evenly divided bynum_gpus
. (#299)Fix build compatible with pytorch v1.3-1.5. (#301)
Fix
magnitude_std
bug inRandAugment
. (#309)Fix bug when
samples_per_gpu
is 1. (#311)
v0.12.0(3/6/2021)¶
Finish adding Chinese tutorials and build Chinese documentation on readthedocs.
Update ResNeXt checkpoints and ResNet checkpoints on CIFAR.
New Features¶
Improve and add Chinese translation of
data_pipeline.md
andnew_modules.md
. (#265)Build Chinese translation on readthedocs. (#267)
Add an argument efficientnet_style to
RandomResizedCrop
andCenterCrop
. (#268)
Improvements¶
Only allow directory operation when rank==0 when testing. (#258)
Fix typo in
base_head
. (#274)Update ResNeXt checkpoints. (#283)
Bug Fixes¶
Add attribute
data.test
in MNIST configs. (#264)Download CIFAR/MNIST dataset only on rank 0. (#273)
Fix MMCV version compatibility. (#276)
Fix CIFAR color channels bug and update checkpoints in model zoo. (#280)
v0.11.1(21/5/2021)¶
Refine
new_dataset.md
and add Chinese translation offinture.md
,new_dataset.md
.
New Features¶
Add
dim
argument forGlobalAveragePooling
. (#236)Add random noise to
RandAugment
magnitude. (#240)Refine
new_dataset.md
and add Chinese translation offinture.md
,new_dataset.md
. (#243)
Improvements¶
Refactor arguments passing for Heads. (#239)
Allow more flexible
magnitude_range
inRandAugment
. (#249)Inherits MMCV registry so that in the future OpenMMLab repos like MMDet and MMSeg could directly use the backbones supported in MMCls. (#252)
Bug Fixes¶
Fix typo in
analyze_results.py
. (#237)Fix typo in unittests. (#238)
Check if specified tmpdir exists when testing to avoid deleting existing data. (#242 & #258)
Add missing config files in
MANIFEST.in
. (#250 & #255)Use temporary directory under shared directory to collect results to avoid unavailability of temporary directory for multi-node testing. (#251)
v0.11.0(1/5/2021)¶
Support cutmix trick.
Support random augmentation.
Add
tools/deployment/test.py
as a ONNX runtime test tool.Support ViT backbone and add training configs for ViT on ImageNet.
Add Chinese
README.md
and some Chinese tutorials.
New Features¶
Support cutmix trick. (#198)
Add
simplify
option inpytorch2onnx.py
. (#200)Support random augmentation. (#201)
Add config and checkpoint for training ResNet on CIFAR-100. (#208)
Add
tools/deployment/test.py
as a ONNX runtime test tool. (#212)Support ViT backbone and add training configs for ViT on ImageNet. (#214)
Add finetuning configs for ViT on ImageNet. (#217)
Add
device
option to support training on CPU. (#219)Add Chinese
README.md
and some Chinese tutorials. (#221)Add
metafile.yml
in configs to support interaction with paper with code(PWC) and MMCLI. (#225)Upload configs and converted checkpoints for ViT fintuning on ImageNet. (#230)
Improvements¶
Fix
LabelSmoothLoss
so that label smoothing and mixup could be enabled at the same time. (#203)Add
cal_acc
option inClsHead
. (#206)Check
CLASSES
in checkpoint to avoid unexpected key error. (#207)Check mmcv version when importing mmcls to ensure compatibility. (#209)
Update
CONTRIBUTING.md
to align with that in MMCV. (#210)Change tags to html comments in configs README.md. (#226)
Clean codes in ViT backbone. (#227)
Reformat
pytorch2onnx.md
tutorial. (#229)Update
setup.py
to support MMCLI. (#232)
Bug Fixes¶
Fix missing
cutmix_prob
in ViT configs. (#220)Fix backend for resize in ResNeXt configs. (#222)
v0.10.0(1/4/2021)¶
Support AutoAugmentation
Add tutorials for installation and usage.
New Features¶
Add
Rotate
pipeline for data augmentation. (#167)Add
Invert
pipeline for data augmentation. (#168)Add
Color
pipeline for data augmentation. (#171)Add
Solarize
andPosterize
pipeline for data augmentation. (#172)Support fp16 training. (#178)
Add tutorials for installation and basic usage of MMClassification.(#176)
Support
AutoAugmentation
,AutoContrast
,Equalize
,Contrast
,Brightness
andSharpness
pipelines for data augmentation. (#179)
Improvements¶
Support dynamic shape export to onnx. (#175)
Release training configs and update model zoo for fp16 (#184)
Use MMCV’s EvalHook in MMClassification (#182)
Bug Fixes¶
Fix wrong naming in vgg config (#181)
v0.9.0(1/3/2021)¶
Implement mixup trick.
Add a new tool to create TensorRT engine from ONNX, run inference and verify outputs in Python.
New Features¶
Implement mixup and provide configs of training ResNet50 using mixup. (#160)
Add
Shear
pipeline for data augmentation. (#163)Add
Translate
pipeline for data augmentation. (#165)Add
tools/onnx2tensorrt.py
as a tool to create TensorRT engine from ONNX, run inference and verify outputs in Python. (#153)
Improvements¶
Add
--eval-options
intools/test.py
to support eval options override, matching the behavior of other open-mmlab projects. (#158)Support showing and saving painted results in
mmcls.apis.test
andtools/test.py
, matching the behavior of other open-mmlab projects. (#162)
Bug Fixes¶
Fix configs for VGG, replace checkpoints converted from other repos with the ones trained by ourselves and upload the missing logs in the model zoo. (#161)
v0.8.0(31/1/2021)¶
Support multi-label task.
Support more flexible metrics settings.
Fix bugs.
New Features¶
Add evaluation metrics: mAP, CP, CR, CF1, OP, OR, OF1 for multi-label task. (#123)
Add BCE loss for multi-label task. (#130)
Add focal loss for multi-label task. (#131)
Support PASCAL VOC 2007 dataset for multi-label task. (#134)
Add asymmetric loss for multi-label task. (#132)
Add analyze_results.py to select images for success/fail demonstration. (#142)
Support new metric that calculates the total number of occurrences of each label. (#143)
Support class-wise evaluation results. (#143)
Add thresholds in eval_metrics. (#146)
Add heads and a baseline config for multilabel task. (#145)
Improvements¶
Remove the models with 0 checkpoint and ignore the repeated papers when counting papers to gain more accurate model statistics. (#135)
Add tags in README.md. (#137)
Fix optional issues in docstring. (#138)
Update stat.py to classify papers. (#139)
Fix mismatched columns in README.md. (#150)
Fix test.py to support more evaluation metrics. (#155)
Bug Fixes¶
Fix bug in VGG weight_init. (#140)
Fix bug in 2 ResNet configs in which outdated heads were used. (#147)
Fix bug of misordered height and width in
RandomCrop
andRandomResizedCrop
. (#151)Fix missing
meta_keys
inCollect
. (#149 & #152)
v0.7.0(31/12/2020)¶
Add more evaluation metrics.
Fix bugs.
New Features¶
Remove installation of MMCV from requirements. (#90)
Add 3 evaluation metrics: precision, recall and F-1 score. (#93)
Allow config override during testing and inference with
--options
. (#91 & #96)
Improvements¶
Use
build_runner
to make runners more flexible. (#54)Support to get category ids in
BaseDataset
. (#72)Allow
CLASSES
override duringBaseDateset
initialization. (#85)Allow input image as ndarray during inference. (#87)
Optimize MNIST config. (#98)
Add config links in model zoo documentation. (#99)
Use functions from MMCV to collect environment. (#103)
Refactor config files so that they are now categorized by methods. (#116)
Add README in config directory. (#117)
Add model statistics. (#119)
Refactor documentation in consistency with other MM repositories. (#126)
Bug Fixes¶
Add missing
CLASSES
argument to dataset wrappers. (#66)Fix slurm evaluation error during training. (#69)
Resolve error caused by shape in
Accuracy
. (#104)Fix bug caused by extremely insufficient data in distributed sampler.(#108)
Fix bug in
gpu_ids
in distributed training. (#107)Fix bug caused by extremely insufficient data in collect results during testing (#114)
v0.6.0(11/10/2020)¶
Support new method: ResNeSt and VGG.
Support new dataset: CIFAR10.
Provide new tools to do model inference, model conversion from pytorch to onnx.
New Features¶
Add model inference. (#16)
Add pytorch2onnx. (#20)
Add PIL backend for transform
Resize
. (#21)Add ResNeSt. (#25)
Add VGG and its pretained models. (#27)
Add CIFAR10 configs and models. (#38)
Add albumentations transforms. (#45)
Visualize results on image demo. (#58)
Improvements¶
Replace urlretrieve with urlopen in dataset.utils. (#13)
Resize image according to its short edge. (#22)
Update ShuffleNet config. (#31)
Update pre-trained models for shufflenet_v2, shufflenet_v1, se-resnet50, se-resnet101. (#33)
Bug Fixes¶
Fix init_weights in
shufflenet_v2.py
. (#29)Fix the parameter
size
in test_pipeline. (#30)Fix the parameter in cosine lr schedule. (#32)
Fix the convert tools for mobilenet_v2. (#34)
Fix crash in CenterCrop transform when image is greyscale (#40)
Fix outdated configs. (#53)
0.x 相关兼容性问题¶
MMClassification 0.20.1¶
MMCV 兼容性¶
在 Twins 骨干网络中,我们使用了 MMCV 提供的 PatchEmbed
模块,该模块是在 MMCV 1.4.2 版本加入的,因此我们需要将 MMCV 依赖版本升至 1.4.2。
常见问题¶
我们在这里列出了一些常见问题及其相应的解决方案。如果您发现任何常见问题并有方法 帮助解决,欢迎随时丰富列表。如果这里的内容没有涵盖您的问题,请按照 提问模板 在 GitHub 上提出问题,并补充模板中需要的信息。
安装¶
MMCV 与 MMClassification 的兼容问题。如遇到 “AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx.”
这里我们列举了各版本 MMClassification 对 MMCV 版本的依赖,请选择合适的 MMCV 版本来避免安装和使用中的问题。
MMClassification version
MMCV version
dev
mmcv>=1.7.0, <1.9.0
0.25.0 (master)
mmcv>=1.4.2, <1.9.0
0.24.1
mmcv>=1.4.2, <1.9.0
0.23.2
mmcv>=1.4.2, <1.7.0
0.22.1
mmcv>=1.4.2, <1.6.0
0.21.0
mmcv>=1.4.2, <=1.5.0
0.20.1
mmcv>=1.4.2, <=1.5.0
0.19.0
mmcv>=1.3.16, <=1.5.0
0.18.0
mmcv>=1.3.16, <=1.5.0
0.17.0
mmcv>=1.3.8, <=1.5.0
0.16.0
mmcv>=1.3.8, <=1.5.0
0.15.0
mmcv>=1.3.8, <=1.5.0
0.15.0
mmcv>=1.3.8, <=1.5.0
0.14.0
mmcv>=1.3.8, <=1.5.0
0.13.0
mmcv>=1.3.8, <=1.5.0
0.12.0
mmcv>=1.3.1, <=1.5.0
0.11.1
mmcv>=1.3.1, <=1.5.0
0.11.0
mmcv>=1.3.0
0.10.0
mmcv>=1.3.0
0.9.0
mmcv>=1.1.4
0.8.0
mmcv>=1.1.4
0.7.0
mmcv>=1.1.4
0.6.0
mmcv>=1.1.4
备注
由于
dev
分支处于频繁开发中,MMCV 版本依赖可能不准确。如果您在使用dev
分支时遇到问题,请尝试更新 MMCV 到最新版。使用 Albumentations
如果你希望使用
albumentations
相关的功能,我们建议使用pip install -r requirements/optional.txt
或者pip install -U albumentations>=0.3.2 --no-binary qudida,albumentations
命令进行安装。如果你直接使用
pip install albumentations>=0.3.2
来安装,它会同时安装opencv-python-headless
(即使你已经安装了opencv-python
)。具体细节可参阅 官方文档。
开发¶
如果我对源码进行了改动,需要重新安装以使改动生效吗?
如果你遵照最佳实践的指引,从源码安装 mmcls,那么任何本地修改都不需要重新安装即可生效。
如何在多个 MMClassification 版本下进行开发?
通常来说,我们推荐通过不同虚拟环境来管理多个开发目录下的 MMClassification。 但如果你希望在不同目录(如 mmcls-0.21, mmcls-0.23 等)使用同一个环境进行开发, 我们提供的训练和测试 shell 脚本会自动使用当前目录的 mmcls,其他 Python 脚本 则可以在命令前添加
PYTHONPATH=`pwd`
来使用当前目录的代码。反过来,如果你希望 shell 脚本使用环境中安装的 MMClassification,而不是当前目录的, 则可以去掉 shell 脚本中如下一行代码:
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
NPU (华为昇腾)¶
使用方法¶
首先,请参考 教程 安装带有 NPU 支持的 MMCV。
使用如下命令,可以利用 8 个 NPU 在机器上训练模型(以 ResNet 为例):
bash tools/dist_train.sh configs/cspnet/resnet50_8xb32_in1k.py 8 --device npu
或者,使用如下命令,在一个 NPU 上训练模型(以 ResNet 为例):
python tools/train.py configs/cspnet/resnet50_8xb32_in1k.py --device npu
经过验证的模型¶
模型 |
Top-1 (%) |
Top-5 (%) |
配置文件 |
相关下载 |
---|---|---|---|---|
77.10 |
93.55 |
model | log |
||
72.62 |
91.04 |
model | log |
||
75.55 |
92.86 |
model | log |
||
77.01 |
93.46 |
model | log |
||
77.11 |
94.54 |
model | log |
||
76.40 |
- |
model | log |
||
77.55 |
93.75 |
model | log |
||
77.64 |
93.76 |
model | log |
||
68.92 |
88.83 |
model | log |
||
69.53 |
88.82 |
model | log |
以上所有模型权重及训练日志均由华为昇腾团队提供