Shortcuts

备注

您正在阅读 MMClassification 0.x 版本的文档。MMClassification 0.x 会在 2022 年末被切换为次要分支。建议您升级到 MMClassification 1.0 版本,体验更多新特性和新功能。请查阅 MMClassification 1.0 的安装教程迁移教程以及更新日志

欢迎来到 MMClassification 中文教程!

You can switch between Chinese and English documentation in the lower-left corner of the layout.

您可以在页面左下角切换中英文文档。

依赖环境

在本节中,我们将演示如何准备 PyTorch 相关的依赖环境。

MMClassification 适用于 Linux、Windows 和 macOS。它需要 Python 3.6+、CUDA 9.2+ 和 PyTorch 1.5+。

备注

如果你对配置 PyTorch 环境已经很熟悉,并且已经完成了配置,可以直接进入下一节。 否则的话,请依照以下步骤完成配置。

第 1 步官网下载并安装 Miniconda。

第 2 步 创建一个 conda 虚拟环境并激活它。

conda create --name openmmlab python=3.8 -y
conda activate openmmlab

第 3 步 按照官方指南安装 PyTorch。例如:

在 GPU 平台:

conda install pytorch torchvision -c pytorch

警告

以上命令会自动安装最新版的 PyTorch 与对应的 cudatoolkit,请检查它们是否与你的环境匹配。

在 CPU 平台:

conda install pytorch torchvision cpuonly -c pytorch

安装

我们推荐用户按照我们的最佳实践来安装 MMClassification。但除此之外,如果你想根据 你的习惯完成安装流程,也可以参见自定义安装一节来获取更多信息。

最佳实践

第 1 步 使用 MIM 安装 MMCV

pip install -U openmim
mim install mmcv-full

第 2 步 安装 MMClassification

根据具体需求,我们支持两种安装模式:

  • 从源码安装(推荐):希望基于 MMClassification 框架开发自己的图像分类任务,需要添加新的功能,比如新的模型或是数据集,或者使用我们提供的各种工具。

  • 作为 Python 包安装:只是希望调用 MMClassification 的 API 接口,或者在自己的项目中导入 MMClassification 中的模块。

从源码安装

这种情况下,从源码按如下方式安装 mmcls:

git clone https://github.com/open-mmlab/mmclassification.git
cd mmclassification
pip install -v -e .
# "-v" 表示输出更多安装相关的信息
# "-e" 表示以可编辑形式安装,这样可以在不重新安装的情况下,让本地修改直接生效

另外,如果你希望向 MMClassification 贡献代码,或者使用试验中的功能,请签出到 dev 分支。

git checkout dev

作为 Python 包安装

直接使用 pip 安装即可。

pip install mmcls

验证安装

为了验证 MMClassification 的安装是否正确,我们提供了一些示例代码来执行模型推理。

第 1 步 我们需要下载配置文件和模型权重文件

mim download mmcls --config resnet50_8xb32_in1k --dest .

第 2 步 验证示例的推理流程

如果你是从源码安装的 mmcls,那么直接运行以下命令进行验证:

python demo/image_demo.py demo/demo.JPEG resnet50_8xb32_in1k.py resnet50_8xb32_in1k_20210831-ea4938fc.pth --device cpu

你可以看到命令行中输出了结果字典,包括 pred_labelpred_scorepred_class 三个字段。另外如果你拥有图形 界面(而不是使用远程终端),那么可以启用 --show 选项,将示例图像和对应的预测结果在窗口中进行显示。

如果你是作为 PyThon 包安装,那么可以打开你的 Python 解释器,并粘贴如下代码:

from mmcls.apis import init_model, inference_model

config_file = 'resnet50_8xb32_in1k.py'
checkpoint_file = 'resnet50_8xb32_in1k_20210831-ea4938fc.pth'
model = init_model(config_file, checkpoint_file, device='cpu')  # 或者 device='cuda:0'
inference_model(model, 'demo/demo.JPEG')

你会看到输出一个字典,包含预测的标签、得分及类别名。

自定义安装

CUDA 版本

安装 PyTorch 时,需要指定 CUDA 版本。如果您不清楚选择哪个,请遵循我们的建议:

  • 对于 Ampere 架构的 NVIDIA GPU,例如 GeForce 30 series 以及 NVIDIA A100,CUDA 11 是必需的。

  • 对于更早的 NVIDIA GPU,CUDA 11 是向前兼容的,但 CUDA 10.2 能够提供更好的兼容性,也更加轻量。

请确保你的 GPU 驱动版本满足最低的版本需求,参阅这张表

备注

如果按照我们的最佳实践进行安装,CUDA 运行时库就足够了,因为我们提供相关 CUDA 代码的预编译,你不需要进行本地编译。 但如果你希望从源码进行 MMCV 的编译,或是进行其他 CUDA 算子的开发,那么就必须安装完整的 CUDA 工具链,参见 NVIDIA 官网,另外还需要确保该 CUDA 工具链的版本与 PyTorch 安装时 的配置相匹配(如用 conda install 安装 PyTorch 时指定的 cudatoolkit 版本)。

不使用 MIM 安装 MMCV

MMCV 包含 C++ 和 CUDA 扩展,因此其对 PyTorch 的依赖比较复杂。MIM 会自动解析这些 依赖,选择合适的 MMCV 预编译包,使安装更简单,但它并不是必需的。

要使用 pip 而不是 MIM 来安装 MMCV,请遵照 MMCV 安装指南。 它需要你用指定 url 的形式手动指定对应的 PyTorch 和 CUDA 版本。

举个例子,如下命令将会安装基于 PyTorch 1.10.x 和 CUDA 11.3 编译的 mmcv-full。

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html

在 CPU 环境中安装

MMClassification 可以仅在 CPU 环境中安装,在 CPU 模式下,你可以完成训练(需要 MMCV 版本 >= 1.4.4)、测试和模型推理等所有操作。

在 CPU 模式下,MMCV 的部分功能将不可用,通常是一些 GPU 编译的算子。不过不用担心, MMClassification 中几乎所有的模型都不会依赖这些算子。

在 Google Colab 中安装

Google Colab 通常已经包含了 PyTorch 环境,因此我们只需要安装 MMCV 和 MMClassification 即可,命令如下:

第 1 步 使用 MIM 安装 MMCV

!pip3 install openmim
!mim install mmcv-full

第 2 步 从源码安装 MMClassification

!git clone https://github.com/open-mmlab/mmclassification.git
%cd mmclassification
!pip install -e .

第 3 步 验证

import mmcls
print(mmcls.__version__)
# 预期输出: 0.23.0 或更新的版本号

备注

在 Jupyter 中,感叹号 ! 用于执行外部命令,而 %cd 是一个魔术命令,用于切换 Python 的工作路径。

通过 Docker 使用 MMClassification

MMClassification 提供 Dockerfile 用于构建镜像。请确保你的 Docker 版本 >=19.03。

# 构建默认的 PyTorch 1.8.1,CUDA 10.2 版本镜像
# 如果你希望使用其他版本,请修改 Dockerfile
docker build -t mmclassification docker/

用以下命令运行 Docker 镜像:

docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmclassification/data mmclassification

故障解决

如果你在安装过程中遇到了什么问题,请先查阅常见问题。如果没有找到解决方法,可以在 GitHub 上提出 issue

基础教程

本文档提供 MMClassification 相关用法的基本教程。

准备数据集

MMClassification 建议用户将数据集根目录链接到 $MMCLASSIFICATION/data 下。 如果用户的文件夹结构与默认结构不同,则需要在配置文件中进行对应路径的修改。

mmclassification
├── mmcls
├── tools
├── configs
├── docs
├── data
│   ├── imagenet
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── cifar
│   │   ├── cifar-10-batches-py
│   ├── mnist
│   │   ├── train-images-idx3-ubyte
│   │   ├── train-labels-idx1-ubyte
│   │   ├── t10k-images-idx3-ubyte
│   │   ├── t10k-labels-idx1-ubyte

对于 ImageNet,其存在多个版本,但最为常用的一个是 ILSVRC 2012,可以通过以下步骤获取该数据集。

  1. 注册账号并登录 下载页面

  2. 获取 ILSVRC2012 下载链接并下载以下文件

    • ILSVRC2012_img_train.tar (~138GB)

    • ILSVRC2012_img_val.tar (~6.3GB)

  3. 解压下载的文件

  4. 使用 该脚本 获取元数据

对于 MNIST,CIFAR10 和 CIFAR100,程序将会在需要的时候自动下载数据集。

对于用户自定义数据集的准备,请参阅 教程 3:如何自定义数据集

使用预训练模型进行推理

MMClassification 提供了一些脚本用于进行单张图像的推理、数据集的推理和数据集的测试(如 ImageNet 等)

单张图像的推理

python demo/image_demo.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE}

# Example
python demo/image_demo.py demo/demo.JPEG configs/resnet/resnet50_8xb32_in1k.py \
  https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth

数据集的推理与测试

  • 支持单 GPU

  • 支持 CPU

  • 支持单节点多 GPU

  • 支持多节点

用户可使用以下命令进行数据集的推理:

# 单 GPU
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# CPU: 禁用 GPU 并运行单 GPU 测试脚本
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# 多 GPU
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--metrics ${METRICS}] [--out ${RESULT_FILE}]

# 基于 slurm 分布式环境的多节点
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--metrics ${METRICS}] [--out ${RESULT_FILE}] --launcher slurm

可选参数:

  • RESULT_FILE:输出结果的文件名。如果未指定,结果将不会保存到文件中。支持 json, yaml, pickle 格式。

  • METRICS:数据集测试指标,如准确率 (accuracy), 精确率 (precision), 召回率 (recall) 等

例子:

在 CIFAR10 验证集上,使用 ResNet-50 进行推理并获得预测标签及其对应的预测得分。

python tools/test.py configs/resnet/resnet50_8xb16_cifar10.py \
  https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
  --out result.pkl

模型训练

MMClassification 使用 MMDistributedDataParallel 进行分布式训练,使用 MMDataParallel 进行非分布式训练。

所有的输出(日志文件和模型权重文件)会被将保存到工作目录下。工作目录通过配置文件中的参数 work_dir 指定。

默认情况下,MMClassification 在每个周期后会在验证集上评估模型,可以通过在训练配置中修改 interval 参数来更改评估间隔

evaluation = dict(interval=12)  # 每进行 12 轮训练后评估一次模型

使用单个 GPU 进行训练

python tools/train.py ${CONFIG_FILE} [optional arguments]

如果用户想在命令中指定工作目录,则需要增加参数 --work-dir ${YOUR_WORK_DIR}

使用 CPU 训练

使用 CPU 训练的流程和使用单 GPU 训练的流程一致,我们仅需要在训练流程开始前禁用 GPU。

export CUDA_VISIBLE_DEVICES=-1

之后运行单 GPU 训练脚本即可。

警告

我们不推荐用户使用 CPU 进行训练,这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。

使用单台机器多个 GPU 进行训练

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

可选参数为:

  • --no-validate (不建议): 默认情况下,程序将会在训练期间的每 k (默认为 1) 个周期进行一次验证。要禁用这一功能,使用 --no-validate

  • --work-dir ${WORK_DIR}:覆盖配置文件中指定的工作目录。

  • --resume-from ${CHECKPOINT_FILE}:从以前的模型权重文件恢复训练。

resume-fromload-from 的不同点: resume-from 加载模型参数和优化器状态,并且保留检查点所在的周期数,常被用于恢复意外被中断的训练。 load-from 只加载模型参数,但周期数从 0 开始计数,常被用于微调模型。

使用多台机器进行训练

如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:

在第一台机器上:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

在第二台机器上:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS

但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。

如果用户在 slurm 集群上运行 MMClassification,可使用 slurm_train.sh 脚本。(该脚本也支持单台机器上进行训练)

[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

用户可以在 slurm_train.sh 中检查所有的参数和环境变量

如果用户的多台机器通过 Ethernet 连接,则可以参考 pytorch launch utility。如果用户没有高速网络,如 InfiniBand,速度将会非常慢。

使用单台机器启动多个任务

如果用使用单台机器启动多个任务,如在有 8 块 GPU 的单台机器上启动 2 个需要 4 块 GPU 的训练任务,则需要为每个任务指定不同端口,以避免通信冲突。

如果用户使用 dist_train.sh 脚本启动训练任务,则可以通过以下命令指定端口

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4

如果用户在 slurm 集群下启动多个训练任务,则需要修改配置文件中的 dist_params 变量,以设置不同的通信端口。

config1.py 中,

dist_params = dict(backend='nccl', port=29500)

config2.py 中,

dist_params = dict(backend='nccl', port=29501)

之后便可启动两个任务,分别对应 config1.pyconfig2.py

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

实用工具

我们在 tools/ 目录下提供的一些对训练和测试十分有用的工具

计算 FLOPs 和参数量(试验性的)

我们根据 flops-counter.pytorch 提供了一个脚本用于计算给定模型的 FLOPs 和参数量

python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]

用户将获得如下结果:

==============================
Input shape: (3, 224, 224)
Flops: 4.12 GFLOPs
Params: 25.56 M
==============================

警告

此工具仍处于试验阶段,我们不保证该数字正确无误。您最好将结果用于简单比较,但在技术报告或论文中采用该结果之前,请仔细检查。

  • FLOPs 与输入的尺寸有关,而参数量与输入尺寸无关。默认输入尺寸为 (1, 3, 224, 224)

  • 一些运算不会被计入 FLOPs 的统计中,例如 GN 和自定义运算。详细信息请参考 mmcv.cnn.get_model_complexity_info()

模型发布

在发布模型之前,你也许会需要

  1. 转换模型权重至 CPU 张量

  2. 删除优化器状态

  3. 计算模型权重文件的哈希值,并添加至文件名之后

python tools/convert_models/publish_model.py ${INPUT_FILENAME} ${OUTPUT_FILENAME}

例如:

python tools/convert_models/publish_model.py work_dirs/resnet50/latest.pth imagenet_resnet50.pth

最终输出的文件名将会是 imagenet_resnet50_{date}-{hash id}.pth

详细教程

目前,MMClassification 提供以下几种更详细的教程:

教程 1:如何编写配置文件

MMClassification 主要使用 python 文件作为配置文件。其配置文件系统的设计将模块化与继承整合进来,方便用户进行各种实验。所有配置文件都放置在 configs 文件夹下,主要包含 _base_ 原始配置文件夹 以及 resnet, swin_transformervision_transformer 等诸多算法文件夹。

可以使用 python tools/misc/print_config.py /PATH/TO/CONFIG 命令来查看完整的配置信息,从而方便检查所对应的配置文件。

配置文件以及权重命名规则

MMClassification 按照以下风格进行配置文件命名,代码库的贡献者需要遵循相同的命名规则。文件名总体分为四部分:算法信息,模块信息,训练信息和数据信息。逻辑上属于不同部分的单词之间用下划线 '_' 连接,同一部分有多个单词用短横线 '-' 连接。

{algorithm info}_{module info}_{training info}_{data info}.py
  • algorithm info:算法信息,算法名称或者网络架构,如 resnet 等;

  • module info: 模块信息,因任务而异,用以表示一些特殊的 neck、head 和 pretrain 信息;

  • training info:一些训练信息,训练策略设置,包括 batch size,schedule 数据增强等;

  • data info:数据信息,数据集名称、模态、输入尺寸等,如 imagenet, cifar 等;

算法信息

指论文中的算法名称缩写,以及相应的分支架构信息。例如:

  • resnet50

  • mobilenet-v3-large

  • vit-small-patch32 : patch32 表示 ViT 切分的分块大小

  • seresnext101-32x4d : SeResNet101 基本网络结构,32x4d 表示在 Bottleneckgroupswidth_per_group 分别为32和4

模块信息

指一些特殊的 neckhead 或者 pretrain 的信息, 在分类中常见为预训练信息,比如:

  • in21k-pre : 在 ImageNet21k 上预训练

  • in21k-pre-3rd-party : 在 ImageNet21k 上预训练,其权重来自其他仓库

训练信息

训练策略的一些设置,包括训练类型、 batch sizelr schedule、 数据增强以及特殊的损失函数等等,比如: Batch size 信息:

  • 格式为{gpu x batch_per_gpu}, 如 8xb32

训练类型(主要见于 transformer 网络,如 ViT 算法,这类算法通常分为预训练和微调两种模式):

  • ft : Finetune config,用于微调的配置文件

  • pt : Pretrain config,用于预训练的配置文件

训练策略信息,训练策略以复现配置文件为基础,此基础不必标注训练策略。但如果在此基础上进行改进,则需注明训练策略,按照应用点位顺序排列,如:{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}

  • coslr-200e : 使用 cosine scheduler, 训练 200 个 epoch

  • autoaug-mixup-lbs-coslr-50e : 使用了 autoaugmixuplabel smoothcosine scheduler, 训练了 50 个轮次

数据信息

  • in1k : ImageNet1k 数据集,默认使用 224x224 大小的图片

  • in21k : ImageNet21k 数据集,有些地方也称为 ImageNet22k 数据集,默认使用 224x224 大小的图片

  • in1k-384px : 表示训练的输出图片大小为 384x384

  • cifar100

配置文件命名案例:

repvgg-D2se_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py
  • repvgg-D2se: 算法信息

    • repvgg: 主要算法名称。

    • D2se: 模型的结构。

  • deploy:模块信息,该模型为推理状态。

  • 4xb64-autoaug-lbs-mixup-coslr-200e: 训练信息

    • 4xb64: 使用4块 GPU 并且 每块 GPU 的批大小为64。

    • autoaug: 使用 AutoAugment 数据增强方法。

    • lbs: 使用 label smoothing 损失函数。

    • mixup: 使用 mixup 训练增强方法。

    • coslr: 使用 cosine scheduler 优化策略。

    • 200e: 训练 200 轮次。

  • in1k: 数据信息。 配置文件用于 ImageNet1k 数据集上使用 224x224 大小图片训练。

备注

部分配置文件目前还没有遵循此命名规范,相关文件命名近期会更新。

权重命名规则

权重的命名主要包括配置文件名,日期和哈希值。

{config_name}_{date}-{hash}.pth

配置文件结构

configs/_base_ 文件夹下有 4 个基本组件类型,分别是:

你可以通过继承一些基本配置文件轻松构建自己的训练配置文件。由来自_base_ 的组件组成的配置称为 primitive

为了帮助用户对 MMClassification 检测系统中的完整配置和模块有一个基本的了解,我们使用 ResNet50 原始配置文件 作为案例进行说明并注释每一行含义。更详细的用法和各个模块对应的替代方案,请参考 API 文档。

_base_ = [
    '../_base_/models/resnet50.py',           # 模型
    '../_base_/datasets/imagenet_bs32.py',    # 数据
    '../_base_/schedules/imagenet_bs256.py',  # 训练策略
    '../_base_/default_runtime.py'            # 默认运行设置
]

下面对这四个部分分别进行说明,仍然以上述 ResNet50 原始配置文件作为案例。

模型

模型参数 model 在配置文件中为一个 python 字典,主要包括网络结构、损失函数等信息:

  • type : 分类器名称, 目前 MMClassification 只支持 ImageClassifier, 参考 API 文档

  • backbone : 主干网类型,可用选项参考 API 文档

  • neck : 颈网络类型,目前 MMClassification 只支持 GlobalAveragePooling, 参考 API 文档

  • head : 头网络类型, 包括单标签分类与多标签分类头网络,可用选项参考 API 文档

  • train_cfg :训练配置, 支持 mixup, cutmix 等训练增强。

备注

配置文件中的 ‘type’ 不是构造时的参数,而是类名。

model = dict(
    type='ImageClassifier',     # 分类器类型
    backbone=dict(
        type='ResNet',          # 主干网络类型
        depth=50,               # 主干网网络深度, ResNet 一般有18, 34, 50, 101, 152 可以选择
        num_stages=4,           # 主干网络状态(stages)的数目,这些状态产生的特征图作为后续的 head 的输入。
        out_indices=(3, ),      # 输出的特征图输出索引。越远离输入图像,索引越大
        frozen_stages=-1,       # 网络微调时,冻结网络的stage(训练时不执行反相传播算法),若num_stages=4,backbone包含stem 与 4 个 stages。frozen_stages为-1时,不冻结网络; 为0时,冻结 stem; 为1时,冻结 stem 和 stage1; 为4时,冻结整个backbone
        style='pytorch'),       # 主干网络的风格,'pytorch' 意思是步长为2的层为 3x3 卷积, 'caffe' 意思是步长为2的层为 1x1 卷积。
    neck=dict(type='GlobalAveragePooling'),    # 颈网络类型
    head=dict(
        type='LinearClsHead',     # 线性分类头,
        num_classes=1000,         # 输出类别数,这与数据集的类别数一致
        in_channels=2048,         # 输入通道数,这与 neck 的输出通道一致
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0), # 损失函数配置信息
        topk=(1, 5),              # 评估指标,Top-k 准确率, 这里为 top1 与 top5 准确率
    ))

数据

数据参数 data 在配置文件中为一个 python 字典,主要包含构造数据集加载器(dataloader)配置信息:

  • samples_per_gpu : 构建 dataloader 时,每个 GPU 的 Batch Size

  • workers_per_gpu : 构建 dataloader 时,每个 GPU 的 线程数

  • train val test : 构造数据集

    • type : 数据集类型, MMClassification 支持 ImageNetCifar 等 ,参考API 文档

    • data_prefix : 数据集根目录

    • pipeline : 数据处理流水线,参考相关教程文档 如何设计数据处理流水线

评估参数 evaluation 也是一个字典, 为 evaluation hook 的配置信息, 主要包括评估间隔、评估指标等。

# dataset settings
dataset_type = 'ImageNet'  # 数据集名称,
img_norm_cfg = dict(       #图像归一化配置,用来归一化输入的图像。
    mean=[123.675, 116.28, 103.53],  # 预训练里用于预训练主干网络模型的平均值。
    std=[58.395, 57.12, 57.375],     # 预训练里用于预训练主干网络模型的标准差。
    to_rgb=True)                     # 是否反转通道,使用 cv2, mmcv 读取图片默认为 BGR 通道顺序,这里 Normalize 均值方差数组的数值是以 RGB 通道顺序, 因此需要反转通道顺序。
# 训练数据流水线
train_pipeline = [
    dict(type='LoadImageFromFile'),                # 读取图片
    dict(type='RandomResizedCrop', size=224),      # 随机缩放抠图
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),  # 以概率为0.5随机水平翻转图片
    dict(type='Normalize', **img_norm_cfg),        # 归一化
    dict(type='ImageToTensor', keys=['img']),      # image 转为 torch.Tensor
    dict(type='ToTensor', keys=['gt_label']),      # gt_label 转为 torch.Tensor
    dict(type='Collect', keys=['img', 'gt_label']) # 决定数据中哪些键应该传递给检测器的流程
]
# 测试数据流水线
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', size=(256, -1)),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'])             # test 时不传递 gt_label
]
data = dict(
    samples_per_gpu=32,    # 单个 GPU 的 Batch size
    workers_per_gpu=2,     # 单个 GPU 的 线程数
    train=dict(            # 训练数据信息
        type=dataset_type,                  # 数据集名称
        data_prefix='data/imagenet/train',  # 数据集目录,当不存在 ann_file 时,类别信息从文件夹自动获取
        pipeline=train_pipeline),           # 数据集需要经过的 数据流水线
    val=dict(              # 验证数据集信息
        type=dataset_type,
        data_prefix='data/imagenet/val',
        ann_file='data/imagenet/meta/val.txt',   # 标注文件路径,存在 ann_file 时,不通过文件夹自动获取类别信息
        pipeline=test_pipeline),
    test=dict(             # 测试数据集信息
        type=dataset_type,
        data_prefix='data/imagenet/val',
        ann_file='data/imagenet/meta/val.txt',
        pipeline=test_pipeline))
evaluation = dict(       # evaluation hook 的配置
    interval=1,          # 验证期间的间隔,单位为 epoch 或者 iter, 取决于 runner 类型。
    metric='accuracy')   # 验证期间使用的指标。

训练策略

主要包含 优化器设置、 optimizer hook 设置、学习率策略和 runner设置:

  • optimizer : 优化器设置信息, 支持 pytorch 所有的优化器,参考相关 mmcv 文档

  • optimizer_config : optimizer hook 的配置文件,如设置梯度限制,参考相关 mmcv 代码

  • lr_config : 学习率策略,支持 “CosineAnnealing”、 “Step”、 “Cyclic” 等等,参考相关 mmcv 文档

  • runner : 有关 runner 可以参考 mmcv 对于 runner 介绍文档

# 用于构建优化器的配置文件。支持 PyTorch 中的所有优化器,同时它们的参数与 PyTorch 里的优化器参数一致。
optimizer = dict(type='SGD',         # 优化器类型
                lr=0.1,              # 优化器的学习率,参数的使用细节请参照对应的 PyTorch 文档。
                momentum=0.9,        # 动量(Momentum)
                weight_decay=0.0001) # 权重衰减系数(weight decay)。
 # optimizer hook 的配置文件
optimizer_config = dict(grad_clip=None)  # 大多数方法不使用梯度限制(grad_clip)。
# 学习率调整配置,用于注册 LrUpdater hook。
lr_config = dict(policy='step',          # 调度流程(scheduler)的策略,也支持 CosineAnnealing, Cyclic, 等。
                 step=[30, 60, 90])      # 在 epoch 为 30, 60, 90 时, lr 进行衰减
runner = dict(type='EpochBasedRunner',   # 将使用的 runner 的类别,如 IterBasedRunner 或 EpochBasedRunner。
            max_epochs=100)              # runner 总回合数, 对于 IterBasedRunner 使用 `max_iters`

运行设置

本部分主要包括保存权重策略、日志配置、训练参数、断点权重路径和工作目录等等。

# Checkpoint hook 的配置文件。
checkpoint_config = dict(interval=1)   # 保存的间隔是 1,单位会根据 runner 不同变动,可以为 epoch 或者 iter。
# 日志配置信息。
log_config = dict(
    interval=100,                      # 打印日志的间隔, 单位 iters
    hooks=[
        dict(type='TextLoggerHook'),          # 用于记录训练过程的文本记录器(logger)。
        # dict(type='TensorboardLoggerHook')  # 同样支持 Tensorboard 日志
    ])

dist_params = dict(backend='nccl')   # 用于设置分布式训练的参数,端口也同样可被设置。
log_level = 'INFO'             # 日志的输出级别。
resume_from = None             # 从给定路径里恢复检查点(checkpoints),训练模式将从检查点保存的轮次开始恢复训练。
workflow = [('train', 1)]      # runner 的工作流程,[('train', 1)] 表示只有一个工作流且工作流仅执行一次。
work_dir = 'work_dir'          # 用于保存当前实验的模型检查点和日志的目录文件地址。

继承并修改配置文件

为了精简代码、更快的修改配置文件以及便于理解,我们建议继承现有方法。

对于在同一算法文件夹下的所有配置文件,MMClassification 推荐只存在 一个 对应的 原始配置 文件。 所有其他的配置文件都应该继承 原始配置 文件,这样就能保证配置文件的最大继承深度为 3。

例如,如果在 ResNet 的基础上做了一些修改,用户首先可以通过指定 _base_ = './resnet50_8xb32_in1k.py'(相对于你的配置文件的路径),来继承基础的 ResNet 结构、数据集以及其他训练配置信息,然后修改配置文件中的必要参数以完成继承。如想在基础 resnet50 的基础上将训练轮数由 100 改为 300 和修改学习率衰减轮数,同时修改数据集路径,可以建立新的配置文件 configs/resnet/resnet50_8xb32-300e_in1k.py, 文件中写入以下内容:

_base_ = './resnet50_8xb32_in1k.py'

runner = dict(max_epochs=300)
lr_config = dict(step=[150, 200, 250])

data = dict(
    train=dict(data_prefix='mydata/imagenet/train'),
    val=dict(data_prefix='mydata/imagenet/train', ),
    test=dict(data_prefix='mydata/imagenet/train', )
)

使用配置文件里的中间变量

用一些中间变量,中间变量让配置文件更加清晰,也更容易修改。

例如数据集里的 train_pipeline / test_pipeline 是作为数据流水线的中间变量。我们首先要定义 train_pipeline / test_pipeline,然后将它们传递到 data 中。如果想修改训练或测试时输入图片的大小,就需要修改 train_pipeline / test_pipeline 这些中间变量。

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='RandomResizedCrop', size=384, backend='pillow',),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', size=384, backend='pillow'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'])
]
data = dict(
    train=dict(pipeline=train_pipeline),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline))

忽略基础配置文件里的部分内容

有时,您需要设置 _delete_=True 去忽略基础配置文件里的一些域内容。 可以参照 mmcv 来获得一些简单的指导。

以下是一个简单应用案例。 如果在上述 ResNet50 案例中 使用 cosine schedule ,使用继承并直接修改会报 get unexcepected keyword 'step' 错, 因为基础配置文件 lr_config 域信息的 'step' 字段被保留下来了,需要加入 _delete_=True 去忽略基础配置文件里的 lr_config 相关域内容:

_base_ = '../../configs/resnet/resnet50_8xb32_in1k.py'

lr_config = dict(
    _delete_=True,
    policy='CosineAnnealing',
    min_lr=0,
    warmup='linear',
    by_epoch=True,
    warmup_iters=5,
    warmup_ratio=0.1
)

引用基础配置文件里的变量

有时,您可以引用 _base_ 配置信息的一些域内容,这样可以避免重复定义。 可以参照 mmcv 来获得一些简单的指导。

以下是一个简单应用案例,在训练数据预处理流水线中使用 auto augment 数据增强,参考配置文件 configs/_base_/datasets/imagenet_bs64_autoaug.py。 在定义 train_pipeline 时,可以直接在 _base_ 中加入定义 auto augment 数据增强的文件命名,再通过 {{_base_.auto_increasing_policies}} 引用变量:

_base_ = ['./pipelines/auto_aug.py']

# dataset settings
dataset_type = 'ImageNet'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='RandomResizedCrop', size=224),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='AutoAugment', policies={{_base_.auto_increasing_policies}}),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [...]
data = dict(
    samples_per_gpu=64,
    workers_per_gpu=2,
    train=dict(..., pipeline=train_pipeline),
    val=dict(..., pipeline=test_pipeline))
evaluation = dict(interval=1, metric='accuracy')

通过命令行参数修改配置信息

当用户使用脚本 “tools/train.py” 或者 “tools/test.py” 提交任务,以及使用一些工具脚本时,可以通过指定 --cfg-options 参数来直接修改所使用的配置文件内容。

  • 更新配置文件内的字典

    可以按照原始配置文件中字典的键的顺序指定配置选项。 例如,--cfg-options model.backbone.norm_eval=False 将主干网络中的所有 BN 模块更改为 train 模式。

  • 更新配置文件内列表的键

    一些配置字典在配置文件中会形成一个列表。例如,训练流水线 data.train.pipeline 通常是一个列表。 例如,[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...] 。如果要将流水线中的 'flip_prob=0.5' 更改为 'flip_prob=0.0',您可以这样指定 --cfg-options data.train.pipeline.1.flip_prob=0.0

  • 更新列表/元组的值。

    当配置文件中需要更新的是一个列表或者元组,例如,配置文件通常会设置 workflow=[('train', 1)],用户如果想更改, 需要指定 --cfg-options workflow="[(train,1),(val,1)]"。注意这里的引号 ” 对于列表以及元组数据类型的修改是必要的, 并且 不允许 引号内所指定的值的书写存在空格。

导入用户自定义模块

备注

本部分仅在当将 MMClassification 当作库构建自己项目时可能用到,初学者可跳过。

在学习完后续教程 如何添加新数据集如何设计数据处理流程如何增加新模块 后,您可能使用 MMClassification 完成自己的项目并在项目中自定义了数据集、模型、数据增强等。为了精简代码,可以将 MMClassification 作为一个第三方库,只需要保留自己的额外的代码,并在配置文件中导入自定义的模块。案例可以参考 OpenMMLab 算法大赛项目

只需要在你的配置文件中添加以下代码:

custom_imports = dict(
    imports=['your_dataset_class',
             'your_transforme_class',
             'your_model_class',
             'your_module_class'],
    allow_failed_imports=False)

常见问题

教程 2:如何微调模型

已经证明,在 ImageNet 数据集上预先训练的分类模型对于其他数据集和其他下游任务有很好的效果。

该教程提供了如何将 Model Zoo 中提供的预训练模型用于其他数据集,已获得更好的效果。

在新数据集上微调模型分为两步:

假设我们现在有一个在 ImageNet-2012 数据集上训练好的 ResNet-50 模型,并且希望在 CIFAR-10 数据集上进行模型微调,我们需要修改配置文件中的五个部分。

继承基础配置

首先,创建一个新的配置文件 configs/tutorial/resnet50_finetune_cifar.py 来保存我们的配置,当然,这个文件名可以自由设定。

为了重用不同配置之间的通用部分,我们支持从多个现有配置中继承配置。要微调 ResNet-50 模型,新配置需要继承 _base_/models/resnet50.py 来搭建模型的基本结构。 为了使用 CIFAR10 数据集,新的配置文件可以直接继承 _base_/datasets/cifar10.py。 而为了保留运行相关设置,比如训练调整器,新的配置文件需要继承 _base_/default_runtime.py

要继承以上这些配置文件,只需要把下面一段代码放在我们的配置文件开头。

_base_ = [
    '../_base_/models/resnet50.py',
    '../_base_/datasets/cifar10.py', '../_base_/default_runtime.py'
]

除此之外,你也可以不使用继承,直接编写完整的配置文件,例如 configs/lenet/lenet5_mnist.py

修改模型

在进行模型微调是,我们通常希望在主干网络(backbone)加载预训练模型,再用我们的数据集训练一个新的分类头(head)。

为了在主干网络加载预训练模型,我们需要修改主干网络的初始化设置,使用 Pretrained 类型的初始化函数。另外,在初始化设置中,我们使用 prefix='backbone' 来告诉初始化函数移除权重文件中键值名称的前缀,比如把 backbone.conv1 变成 conv1。方便起见,我们这里使用一个在线的权重文件链接,它 会在训练前自动下载对应的文件,你也可以提前下载这个模型,然后使用本地路径。

接下来,新的配置文件需要按照新数据集的类别数目来修改分类头的配置。只需要修改分 类头中的 num_classes 设置即可。

model = dict(
    backbone=dict(
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
            prefix='backbone',
        )),
    head=dict(num_classes=10),
)

小技巧

这里我们只需要设定我们想要修改的部分配置,其他配置将会自动从我们的父配置文件中获取。

另外,有时我们在进行微调时会希望冻结主干网络前面几层的参数,这么做有助于在后续 训练中,保持网络从预训练权重中获得的提取低阶特征的能力。在 MMClassification 中, 这一功能可以通过简单的一个 frozen_stages 参数来实现。比如我们需要冻结前两层网 络的参数,只需要在上面的配置中添加一行:

model = dict(
    backbone=dict(
        frozen_stages=2,
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
            prefix='backbone',
        )),
    head=dict(num_classes=10),
)

备注

目前还不是所有的网络都支持 frozen_stages 参数,在使用之前,请先检查 文档 以确认你所使用的主干网络是否支持。

修改数据集

当针对一个新的数据集进行微调时,我们通常都需要修改一些数据集相关的配置。比如这 里,我们就需要把 CIFAR-10 数据集中的图像大小从 32 缩放到 224 来配合 ImageNet 上 预训练模型的输入。这一需要可以通过修改数据集的预处理流水线(pipeline)来实现。

img_norm_cfg = dict(
    mean=[125.307, 122.961, 113.8575],
    std=[51.5865, 50.847, 51.255],
    to_rgb=False,
)
train_pipeline = [
    dict(type='RandomCrop', size=32, padding=4),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Resize', size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label']),
]
test_pipeline = [
    dict(type='Resize', size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img']),
]
data = dict(
    train=dict(pipeline=train_pipeline),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline),
)

修改训练策略设置

用于微调任务的超参数与默认配置不同,通常只需要较小的学习率和较少的训练时间。

# 用于批大小为 128 的优化器学习率
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# 学习率衰减策略
lr_config = dict(policy='step', step=[15])
runner = dict(type='EpochBasedRunner', max_epochs=200)
log_config = dict(interval=100)

开始训练

现在,我们完成了用于微调的配置文件,完整的文件如下:

_base_ = [
    '../_base_/models/resnet50.py',
    '../_base_/datasets/cifar10_bs16.py', '../_base_/default_runtime.py'
]

# 模型设置
model = dict(
    backbone=dict(
        frozen_stages=2,
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
            prefix='backbone',
        )),
    head=dict(num_classes=10),
)

# 数据集设置
img_norm_cfg = dict(
    mean=[125.307, 122.961, 113.8575],
    std=[51.5865, 50.847, 51.255],
    to_rgb=False,
)
train_pipeline = [
    dict(type='RandomCrop', size=32, padding=4),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Resize', size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label']),
]
test_pipeline = [
    dict(type='Resize', size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img']),
]
data = dict(
    train=dict(pipeline=train_pipeline),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline),
)

# 训练策略设置
# 用于批大小为 128 的优化器学习率
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# 学习率衰减策略
lr_config = dict(policy='step', step=[15])
runner = dict(type='EpochBasedRunner', max_epochs=200)
log_config = dict(interval=100)

接下来,我们使用一台 8 张 GPU 的电脑来训练我们的模型,指令如下:

bash tools/dist_train.sh configs/tutorial/resnet50_finetune_cifar.py 8

当然,我们也可以使用单张 GPU 来进行训练,使用如下命令:

python tools/train.py configs/tutorial/resnet50_finetune_cifar.py

但是如果我们使用单张 GPU 进行训练的话,需要在数据集设置部分作如下修改:

data = dict(
    samples_per_gpu=128,
    train=dict(pipeline=train_pipeline),
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline),
)

这是因为我们的训练策略是针对批次大小(batch size)为 128 设置的。在父配置文件中, 设置了 samples_per_gpu=16,如果使用 8 张 GPU,总的批次大小就是 128。而如果使 用单张 GPU,就必须手动修改 samples_per_gpu=128 来匹配训练策略。

教程 3:如何自定义数据集

我们支持许多常用的图像分类领域公开数据集,你可以在 此页面中找到它们。

在本节中,我们将介绍如何使用自己的数据集以及如何使用数据集包装

使用自己的数据集

将数据集重新组织为已有格式

想要使用自己的数据集,最简单的方法就是将数据集转换为现有的数据集格式。

对于多分类任务,我们推荐使用 CustomDataset 格式。

CustomDataset 支持两种类型的数据格式:

  1. 提供一个标注文件,其中每一行表示一张样本图片。

    样本图片可以以任意的结构进行组织,比如:

    train/
    ├── folder_1
    │   ├── xxx.png
    │   ├── xxy.png
    │   └── ...
    ├── 123.png
    ├── nsdf3.png
    └── ...
    

    而标注文件则记录了所有样本图片的文件路径以及相应的类别序号。其中第一列表示图像 相对于主目录(本例中为 train 目录)的路径,第二列表示类别序号:

    folder_1/xxx.png 0
    folder_1/xxy.png 1
    123.png 1
    nsdf3.png 2
    ...
    

    备注

    类别序号的值应当属于 [0, num_classes - 1] 范围。

  2. 将所有样本文件按如下结构进行组织:

    train/
    ├── cat
    │   ├── xxx.png
    │   ├── xxy.png
    │   └── ...
    │       └── xxz.png
    ├── bird
    │   ├── bird1.png
    │   ├── bird2.png
    │   └── ...
    └── dog
        ├── 123.png
        ├── nsdf3.png
        ├── ...
        └── asd932_.png
    

    这种情况下,你不需要提供标注文件,所有位于 cat 目录下的图片文件都会被视为 cat 类别的样本。

通常而言,我们会将整个数据集分为三个子数据集:trainvaltest,分别用于训练、验证和测试。每一个子 数据集都需要被组织成如上的一种结构。

举个例子,完整的数据集结构如下所示(使用第一种组织结构):

mmclassification
└── data
    └── my_dataset
        ├── meta
        │   ├── train.txt
        │   ├── val.txt
        │   └── test.txt
        ├── train
        ├── val
        └── test

之后在你的配置文件中,可以修改其中的 data 字段为如下格式:

...
dataset_type = 'CustomDataset'
classes = ['cat', 'bird', 'dog']  # 数据集中各类别的名称

data = dict(
    train=dict(
        type=dataset_type,
        data_prefix='data/my_dataset/train',
        ann_file='data/my_dataset/meta/train.txt',
        classes=classes,
        pipeline=train_pipeline
    ),
    val=dict(
        type=dataset_type,
        data_prefix='data/my_dataset/val',
        ann_file='data/my_dataset/meta/val.txt',
        classes=classes,
        pipeline=test_pipeline
    ),
    test=dict(
        type=dataset_type,
        data_prefix='data/my_dataset/test',
        ann_file='data/my_dataset/meta/test.txt',
        classes=classes,
        pipeline=test_pipeline
    )
)
...

创建一个新的数据集类

用户可以编写一个继承自 BasesDataset 的新数据集类,并重载 load_annotations(self) 方法, 类似 CIFAR10ImageNet

通常,此方法返回一个包含所有样本的列表,其中的每个样本都是一个字典。字典中包含了必要的数据信息,例如 imggt_label

假设我们将要实现一个 Filelist 数据集,该数据集将使用文件列表进行训练和测试。注释列表的格式如下:

000001.jpg 0
000002.jpg 1

我们可以在 mmcls/datasets/filelist.py 中创建一个新的数据集类以加载数据。

import mmcv
import numpy as np

from .builder import DATASETS
from .base_dataset import BaseDataset


@DATASETS.register_module()
class Filelist(BaseDataset):

    def load_annotations(self):
        assert isinstance(self.ann_file, str)

        data_infos = []
        with open(self.ann_file) as f:
            samples = [x.strip().split(' ') for x in f.readlines()]
            for filename, gt_label in samples:
                info = {'img_prefix': self.data_prefix}
                info['img_info'] = {'filename': filename}
                info['gt_label'] = np.array(gt_label, dtype=np.int64)
                data_infos.append(info)
            return data_infos

将新的数据集类加入到 mmcls/datasets/__init__.py 中:

from .base_dataset import BaseDataset
...
from .filelist import Filelist

__all__ = [
    'BaseDataset', ... ,'Filelist'
]

然后在配置文件中,为了使用 Filelist,用户可以按以下方式修改配置

train = dict(
    type='Filelist',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)

使用数据集包装

数据集包装是一种可以改变数据集类行为的类,比如将数据集中的样本进行重复,或是将不同类别的数据进行再平衡。

重复数据集

我们使用 RepeatDataset 作为一个重复数据集的封装。举个例子,假设原始数据集是 Dataset_A,为了重复它,我们需要如下的配置文件:

data = dict(
    train=dict(
        type='RepeatDataset',
        times=N,
        dataset=dict(  # 这里是 Dataset_A 的原始配置
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
    ...
)

类别平衡数据集

我们使用 ClassBalancedDataset 作为根据类别频率对数据集进行重复采样的封装类。进行重复采样的数据集需要实现函数 self.get_cat_ids(idx) 以支持 ClassBalancedDataset

举个例子,按照 oversample_thr=1e-3Dataset_A 进行重复采样,需要如下的配置文件:

data = dict(
    train = dict(
        type='ClassBalancedDataset',
        oversample_thr=1e-3,
        dataset=dict(  # 这里是 Dataset_A 的原始配置
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
    ...
)

更加具体的细节,请参考 API 文档

教程 4:如何设计数据处理流程

设计数据流水线

按照典型的用法,我们通过 DatasetDataLoader 来使用多个 worker 进行数据加 载。对 Dataset 的索引操作将返回一个与模型的 forward 方法的参数相对应的字典。

数据流水线和数据集在这里是解耦的。通常,数据集定义如何处理标注文件,而数据流水 线定义所有准备数据字典的步骤。流水线由一系列操作组成。每个操作都将一个字典作为 输入,并输出一个字典。

这些操作分为数据加载,预处理和格式化。

这里使用 ResNet-50 在 ImageNet 数据集上的数据流水线作为示例。

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='RandomResizedCrop', size=224),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', size=256),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'])
]

对于每个操作,我们列出了添加、更新、删除的相关字典字段。在流水线的最后,我们使 用 Collect 仅保留进行模型 forward 方法所需的项。

数据加载

LoadImageFromFile - 从文件中加载图像

  • 添加:img, img_shape, ori_shape

默认情况下,LoadImageFromFile 将会直接从硬盘加载图像,但对于一些效率较高、规 模较小的模型,这可能会导致 IO 瓶颈。MMCV 支持多种数据加载后端来加速这一过程。例 如,如果训练设备上配置了 memcached,那么我们按照如下 方式修改配置文件。

memcached_root = '/mnt/xxx/memcached_client/'
train_pipeline = [
    dict(
        type='LoadImageFromFile',
        file_client_args=dict(
            backend='memcached',
            server_list_cfg=osp.join(memcached_root, 'server_list.conf'),
            client_cfg=osp.join(memcached_root, 'client.conf'))),
]

更多支持的数据加载后端,可以参见 mmcv.fileio.FileClient

预处理

Resize - 缩放图像尺寸

  • 添加:scale, scale_idx, pad_shape, scale_factor, keep_ratio

  • 更新:img, img_shape

RandomFlip - 随机翻转图像

  • 添加:flip, flip_direction

  • 更新:img

RandomCrop - 随机裁剪图像

  • 更新:img, pad_shape

Normalize - 图像数据归一化

  • 添加:img_norm_cfg

  • 更新:img

格式化

ToTensor - 转换(标签)数据至 torch.Tensor

  • 更新:根据参数 keys 指定

ImageToTensor - 转换图像数据至 torch.Tensor

  • 更新:根据参数 keys 指定

Collect - 保留指定键值

  • 删除:除了参数 keys 指定以外的所有键值对

扩展及使用自定义流水线

  1. 编写一个新的数据处理操作,并放置在 mmcls/datasets/pipelines/ 目录下的任何 一个文件中,例如 my_pipeline.py。这个类需要重载 __call__ 方法,接受一个 字典作为输入,并返回一个字典。

    from mmcls.datasets import PIPELINES
    
    @PIPELINES.register_module()
    class MyTransform(object):
    
        def __call__(self, results):
            # 对 results['img'] 进行变换操作
            return results
    
  2. mmcls/datasets/pipelines/__init__.py 中导入这个新的类。

    ...
    from .my_pipeline import MyTransform
    
    __all__ = [
        ..., 'MyTransform'
    ]
    
  3. 在数据流水线的配置中添加这一操作。

    img_norm_cfg = dict(
        mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    train_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(type='RandomResizedCrop', size=224),
        dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
        dict(type='MyTransform'),
        dict(type='Normalize', **img_norm_cfg),
        dict(type='ImageToTensor', keys=['img']),
        dict(type='ToTensor', keys=['gt_label']),
        dict(type='Collect', keys=['img', 'gt_label'])
    ]
    

流水线可视化

设计好数据流水线后,可以使用可视化工具查看具体的效果。

教程 5:如何增加新模块

开发新组件

我们基本上将模型组件分为 3 种类型。

  • 主干网络:通常是一个特征提取网络,例如 ResNet、MobileNet

  • 颈部:用于连接主干网络和头部的组件,例如 GlobalAveragePooling

  • 头部:用于执行特定任务的组件,例如分类和回归

添加新的主干网络

这里,我们以 ResNet_CIFAR 为例,展示了如何开发一个新的主干网络组件。

ResNet_CIFAR 针对 CIFAR 32x32 的图像输入,将 ResNet 中 kernel_size=7, stride=2 的设置替换为 kernel_size=3, stride=1,并移除了 stem 层之后的 MaxPooling,以避免传递过小的特征图到残差块中。

它继承自 ResNet 并只修改了 stem 层。

  1. 创建一个新文件 mmcls/models/backbones/resnet_cifar.py

import torch.nn as nn

from ..builder import BACKBONES
from .resnet import ResNet


@BACKBONES.register_module()
class ResNet_CIFAR(ResNet):

    """ResNet backbone for CIFAR.

    (对这个主干网络的简短描述)

    Args:
        depth(int): Network depth, from {18, 34, 50, 101, 152}.
        ...
        (参数文档)
    """

    def __init__(self, depth, deep_stem=False, **kwargs):
        # 调用基类 ResNet 的初始化函数
        super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem **kwargs)
        # 其他特殊的初始化流程
        assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'

    def _make_stem_layer(self, in_channels, base_channels):
        # 重载基类的方法,以实现对网络结构的修改
        self.conv1 = build_conv_layer(
            self.conv_cfg,
            in_channels,
            base_channels,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False)
        self.norm1_name, norm1 = build_norm_layer(
            self.norm_cfg, base_channels, postfix=1)
        self.add_module(self.norm1_name, norm1)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):  # 需要返回一个元组
        pass  # 此处省略了网络的前向实现

    def init_weights(self, pretrained=None):
        pass  # 如果有必要的话,重载基类 ResNet 的参数初始化函数

    def train(self, mode=True):
        pass  # 如果有必要的话,重载基类 ResNet 的训练状态函数
  1. mmcls/models/backbones/__init__.py 中导入新模块

...
from .resnet_cifar import ResNet_CIFAR

__all__ = [
    ..., 'ResNet_CIFAR'
]
  1. 在配置文件中使用新的主干网络

model = dict(
    ...
    backbone=dict(
        type='ResNet_CIFAR',
        depth=18,
        other_arg=xxx),
    ...

添加新的颈部组件

这里我们以 GlobalAveragePooling 为例。这是一个非常简单的颈部组件,没有任何参数。

要添加新的颈部组件,我们主要需要实现 forward 函数,该函数对主干网络的输出进行 一些操作并将结果传递到头部。

  1. 创建一个新文件 mmcls/models/necks/gap.py

    import torch.nn as nn
    
    from ..builder import NECKS
    
    @NECKS.register_module()
    class GlobalAveragePooling(nn.Module):
    
        def __init__(self):
            self.gap = nn.AdaptiveAvgPool2d((1, 1))
    
        def forward(self, inputs):
            # 简单起见,我们默认输入是一个张量
            outs = self.gap(inputs)
            outs = outs.view(inputs.size(0), -1)
            return outs
    
  2. mmcls/models/necks/__init__.py 中导入新模块

    ...
    from .gap import GlobalAveragePooling
    
    __all__ = [
        ..., 'GlobalAveragePooling'
    ]
    
  3. 修改配置文件以使用新的颈部组件

    model = dict(
        neck=dict(type='GlobalAveragePooling'),
    )
    

添加新的头部组件

在此,我们以 LinearClsHead 为例,说明如何开发新的头部组件。

要添加一个新的头部组件,基本上我们需要实现 forward_train 函数,它接受来自颈部 或主干网络的特征图作为输入,并基于真实标签计算。

  1. 创建一个文件 mmcls/models/heads/linear_head.py.

    from ..builder import HEADS
    from .cls_head import ClsHead
    
    
    @HEADS.register_module()
    class LinearClsHead(ClsHead):
    
        def __init__(self,
                  num_classes,
                  in_channels,
                  loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
                  topk=(1, )):
            super(LinearClsHead, self).__init__(loss=loss, topk=topk)
            self.in_channels = in_channels
            self.num_classes = num_classes
    
            if self.num_classes <= 0:
                raise ValueError(
                    f'num_classes={num_classes} must be a positive integer')
    
            self._init_layers()
    
        def _init_layers(self):
            self.fc = nn.Linear(self.in_channels, self.num_classes)
    
        def init_weights(self):
            normal_init(self.fc, mean=0, std=0.01, bias=0)
    
        def forward_train(self, x, gt_label):
            cls_score = self.fc(x)
            losses = self.loss(cls_score, gt_label)
            return losses
    
    
  2. mmcls/models/heads/__init__.py 中导入这个模块

    ...
    from .linear_head import LinearClsHead
    
    __all__ = [
        ..., 'LinearClsHead'
    ]
    
  3. 修改配置文件以使用新的头部组件。

连同 GlobalAveragePooling 颈部组件,完整的模型配置如下:

model = dict(
    type='ImageClassifier',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(3, ),
        style='pytorch'),
    neck=dict(type='GlobalAveragePooling'),
    head=dict(
        type='LinearClsHead',
        num_classes=1000,
        in_channels=2048,
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
        topk=(1, 5),
    ))

添加新的损失函数

要添加新的损失函数,我们主要需要在损失函数模块中 forward 函数。另外,利用装饰器 weighted_loss 可以方便的实现对每个元素的损失进行加权平均。

假设我们要模拟从另一个分类模型生成的概率分布,需要添加 L1loss 来实现该目的。

  1. 创建一个新文件 mmcls/models/losses/l1_loss.py

    import torch
    import torch.nn as nn
    
    from ..builder import LOSSES
    from .utils import weighted_loss
    
    @weighted_loss
    def l1_loss(pred, target):
        assert pred.size() == target.size() and target.numel() > 0
        loss = torch.abs(pred - target)
        return loss
    
    @LOSSES.register_module()
    class L1Loss(nn.Module):
    
        def __init__(self, reduction='mean', loss_weight=1.0):
            super(L1Loss, self).__init__()
            self.reduction = reduction
            self.loss_weight = loss_weight
    
        def forward(self,
                    pred,
                    target,
                    weight=None,
                    avg_factor=None,
                    reduction_override=None):
            assert reduction_override in (None, 'none', 'mean', 'sum')
            reduction = (
                reduction_override if reduction_override else self.reduction)
            loss = self.loss_weight * l1_loss(
                pred, target, weight, reduction=reduction, avg_factor=avg_factor)
            return loss
    
  2. 在文件 mmcls/models/losses/__init__.py 中导入这个模块

    ...
    from .l1_loss import L1Loss, l1_loss
    
    __all__ = [
        ..., 'L1Loss', 'l1_loss'
    ]
    
  3. 修改配置文件中的 loss 字段以使用新的损失函数

    loss=dict(type='L1Loss', loss_weight=1.0))
    

教程 6:如何自定义优化策略

在本教程中,我们将介绍如何在运行自定义模型时,进行构造优化器、定制学习率及动量调整策略、梯度裁剪、梯度累计以及用户自定义优化方法等。

构造 PyTorch 内置优化器

MMClassification 支持 PyTorch 实现的所有优化器,仅需在配置文件中,指定 “optimizer” 字段。 例如,如果要使用 “SGD”,则修改如下。

optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)

要修改模型的学习率,只需要在优化器的配置中修改 lr 即可。 要配置其他参数,可直接根据 PyTorch API 文档 进行。

备注

配置文件中的 ‘type’ 不是构造时的参数,而是 PyTorch 内置优化器的类名。

例如,如果想使用 Adam 并设置参数为 torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), 则需要进行如下修改

optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

定制学习率调整策略

定制学习率衰减曲线

深度学习研究中,广泛应用学习率衰减来提高网络的性能。要使用学习率衰减,可以在配置中设置 lr_confg 字段。

比如在默认的 ResNet 网络训练中,我们使用阶梯式的学习率衰减策略,配置文件为:

lr_config = dict(policy='step', step=[100, 150])

在训练过程中,程序会周期性地调用 MMCV 中的 StepLRHook 来进行学习率更新。

此外,我们也支持其他学习率调整方法,如 CosineAnnealingPoly 等。详情可见 这里

  • ConsineAnnealing:

    lr_config = dict(policy='CosineAnnealing', min_lr_ratio=1e-5)
    
  • Poly:

    lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
    

定制学习率预热策略

在训练的早期阶段,网络容易不稳定,而学习率的预热就是为了减少这种不稳定性。通过预热,学习率将会从一个很小的值逐步提高到预定值。

在 MMClassification 中,我们同样使用 lr_config 配置学习率预热策略,主要的参数有以下几个:

  • warmup : 学习率预热曲线类别,必须为 ‘constant’、 ‘linear’, ‘exp’ 或者 None 其一, 如果为 None, 则不使用学习率预热策略。

  • warmup_by_epoch : 是否以轮次(epoch)为单位进行预热。

  • warmup_iters : 预热的迭代次数,当 warmup_by_epoch=True 时,单位为轮次(epoch);当 warmup_by_epoch=False 时,单位为迭代次数(iter)。

  • warmup_ratio : 预测的初始学习率 lr = lr * warmup_ratio

例如:

  1. 迭代次数线性预热

    lr_config = dict(
        policy='CosineAnnealing',
        by_epoch=False,
        min_lr_ratio=1e-2,
        warmup='linear',
        warmup_ratio=1e-3,
        warmup_iters=20 * 1252,
        warmup_by_epoch=False)
    
  2. 轮次指数预热

    lr_config = dict(
        policy='CosineAnnealing',
        min_lr=0,
        warmup='exp',
        warmup_iters=5,
        warmup_ratio=0.1,
        warmup_by_epoch=True)
    

小技巧

配置完成后,可以使用 MMClassification 提供的 学习率可视化工具 画出对应学习率调整曲线。

定制动量调整策略

MMClassification 支持动量调整器根据学习率修改模型的动量,从而使模型收敛更快。

动量调整程序通常与学习率调整器一起使用,例如,以下配置用于加速收敛。 更多细节可参考 CyclicLrUpdaterCyclicMomentumUpdater

这里是一个用例:

lr_config = dict(
    policy='cyclic',
    target_ratio=(10, 1e-4),
    cyclic_times=1,
    step_ratio_up=0.4,
)
momentum_config = dict(
    policy='cyclic',
    target_ratio=(0.85 / 0.95, 1),
    cyclic_times=1,
    step_ratio_up=0.4,
)

参数化精细配置

一些模型可能具有一些特定于参数的设置以进行优化,例如 BatchNorm 层不添加权重衰减或者对不同的网络层使用不同的学习率。 在 MMClassification 中,我们通过 optimizerparamwise_cfg 参数进行配置,可以参考MMCV

  • 使用指定选项

    MMClassification 提供了包括 bias_lr_multbias_decay_multnorm_decay_multdwconv_decay_multdcn_offset_lr_multbypass_duplicate 选项,指定相关所有的 baisnormdwconvdcnbypass 参数。例如令模型中所有的 BN 不进行参数衰减:

    optimizer = dict(
        type='SGD',
        lr=0.8,
        weight_decay=1e-4,
        paramwise_cfg=dict(norm_decay_mult=0.)
    )
    
  • 使用 custom_keys 指定参数

    MMClassification 可通过 custom_keys 指定不同的参数使用不同的学习率或者权重衰减,例如对特定的参数不使用权重衰减:

    paramwise_cfg = dict(
        custom_keys={
            'backbone.cls_token': dict(decay_mult=0.0),
            'backbone.pos_embed': dict(decay_mult=0.0)
        })
    
    optimizer = dict(
        type='SGD',
        lr=0.8,
        weight_decay=1e-4,
        paramwise_cfg=paramwise_cfg)
    

    对 backbone 使用更小的学习率与衰减系数:

    optimizer = dict(
        type='SGD',
        lr=0.8,
        weight_decay=1e-4,
        # backbone 的 'lr' and 'weight_decay' 分别为 0.1 * lr 和 0.9 * weight_decay
        paramwise_cfg = dict(custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=0.9)}))
    

梯度裁剪与梯度累计

除了 PyTorch 优化器的基本功能,我们还提供了一些对优化器的增强功能,例如梯度裁剪、梯度累计等,参考 MMCV

梯度裁剪

在训练过程中,损失函数可能接近于一些异常陡峭的区域,从而导致梯度爆炸。而梯度裁剪可以帮助稳定训练过程,更多介绍可以参见该页面

目前我们支持在 optimizer_config 字段中添加 grad_clip 参数来进行梯度裁剪,更详细的参数可参考 PyTorch 文档

用例如下:

# norm_type: 使用的范数类型,此处使用范数2。
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))

当使用继承并修改基础配置方式时,如果基础配置中 grad_clip=None,需要添加 _delete_=True。有关 _delete_ 可以参考教程 1:如何编写配置文件。案例如下:

_base_ = [./_base_/schedules/imagenet_bs256_coslr.py]

optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2), _delete_=True, type='OptimizerHook')
# 当 type 为 'OptimizerHook',可以省略 type;其他情况下,此处必须指明 type='xxxOptimizerHook'。

梯度累计

计算资源缺乏缺乏时,每个训练批次的大小(batch size)只能设置为较小的值,这可能会影响模型的性能。

可以使用梯度累计来规避这一问题。

用例如下:

data = dict(samples_per_gpu=64)
optimizer_config = dict(type="GradientCumulativeOptimizerHook", cumulative_iters=4)

表示训练时,每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64,也就等价于单张 GPU 上一次迭代的批次大小为 256,也即:

data = dict(samples_per_gpu=256)
optimizer_config = dict(type="OptimizerHook")

备注

当在 optimizer_config 不指定优化器钩子类型时,默认使用 OptimizerHook

用户自定义优化方法

在学术研究和工业实践中,可能需要使用 MMClassification 未实现的优化方法,可以通过以下方法添加。

备注

本部分将修改 MMClassification 源码或者向 MMClassification 框架添加代码,初学者可跳过。

自定义优化器

1. 定义一个新的优化器

一个自定义的优化器可根据如下规则进行定制

假设我们想添加一个名为 MyOptimzer 的优化器,其拥有参数 a, bc。 可以创建一个名为 mmcls/core/optimizer 的文件夹,并在目录下的一个文件,如 mmcls/core/optimizer/my_optimizer.py 中实现该自定义优化器:

from mmcv.runner import OPTIMIZERS
from torch.optim import Optimizer


@OPTIMIZERS.register_module()
class MyOptimizer(Optimizer):

    def __init__(self, a, b, c):

2. 注册优化器

要注册上面定义的上述模块,首先需要将此模块导入到主命名空间中。有两种方法可以实现它。

  • 修改 mmcls/core/optimizer/__init__.py,将其导入至 optimizer 包;再修改 mmcls/core/__init__.py 以导入 optimizer

    创建 mmcls/core/optimizer/__init__.py 文件。 新定义的模块应导入到 mmcls/core/optimizer/__init__.py 中,以便注册器能找到新模块并将其添加:

# 在 mmcls/core/optimizer/__init__.py 中
from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字

__all__ = ['MyOptimizer']
# 在 mmcls/core/__init__.py 中
...
from .optimizer import *  # noqa: F401, F403
  • 在配置中使用 custom_imports 手动导入

custom_imports = dict(imports=['mmcls.core.optimizer.my_optimizer'], allow_failed_imports=False)

mmcls.core.optimizer.my_optimizer 模块将会在程序开始阶段被导入,MyOptimizer 类会随之自动被注册。 注意,只有包含 MyOptmizer 类的包会被导入。mmcls.core.optimizer.my_optimizer.MyOptimizer 不会 被直接导入。

3. 在配置文件中指定优化器

之后,用户便可在配置文件的 optimizer 域中使用 MyOptimizer。 在配置中,优化器由 “optimizer” 字段定义,如下所示:

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)

要使用自定义的优化器,可以将该字段更改为

optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)

自定义优化器构造器

某些模型可能具有一些特定于参数的设置以进行优化,例如 BatchNorm 层的权重衰减。

虽然我们的 DefaultOptimizerConstructor 已经提供了这些强大的功能,但可能仍然无法覆盖需求。 此时我们可以通过自定义优化器构造函数来进行其他细粒度的参数调整。

from mmcv.runner.optimizer import OPTIMIZER_BUILDERS


@OPTIMIZER_BUILDERS.register_module()
class MyOptimizerConstructor:

    def __init__(self, optimizer_cfg, paramwise_cfg=None):
        pass

    def __call__(self, model):
        ...    # 在这里实现自己的优化器构造器。
        return my_optimizer

这里是我们默认的优化器构造器的实现,可以作为新优化器构造器实现的模板。

教程 7:如何自定义模型运行参数

在本教程中,我们将介绍如何在运行自定义模型时,进行自定义工作流和钩子的方法。

定制工作流

工作流是一个形如 (任务名,周期数) 的列表,用于指定运行顺序和周期。这里“周期数”的单位由执行器的类型来决定。

比如在 MMClassification 中,我们默认使用基于轮次的执行器(EpochBasedRunner),那么“周期数”指的就是对应的任务在一个周期中 要执行多少个轮次。通常,我们只希望执行训练任务,那么只需要使用以下设置:

workflow = [('train', 1)]

有时我们可能希望在训练过程中穿插检查模型在验证集上的一些指标(例如,损失,准确性)。

在这种情况下,可以将工作流程设置为:

[('train', 1), ('val', 1)]

这样一来,程序会一轮训练一轮测试地反复执行。

需要注意的是,默认情况下,我们并不推荐用这种方式来进行模型验证,而是推荐在训练中使用 EvalHook 进行模型验证。使用上述工作流的方式进行模型验证只是一个替代方案。

备注

  1. 在验证周期时不会更新模型参数。

  2. 配置文件内的关键词 max_epochs 控制训练时期数,并且不会影响验证工作流程。

  3. 工作流 [('train', 1), ('val', 1)][('train', 1)] 不会改变 EvalHook 的行为。 因为 EvalHookafter_train_epoch 调用,而验证工作流只会影响 after_val_epoch 调用的钩子。 因此,[('train', 1), ('val', 1)][('train', 1)] 的区别在于,runner 在完成每一轮训练后,会计算验证集上的损失。

钩子

钩子机制在 OpenMMLab 开源算法库中应用非常广泛,结合执行器可以实现对训练过程的整个生命周期进行管理,可以通过相关文章进一步理解钩子。

钩子只有在构造器中被注册才起作用,目前钩子主要分为两类:

  • 默认训练钩子

默认训练钩子由运行器默认注册,一般为一些基础型功能的钩子,已经有确定的优先级,一般不需要修改优先级。

  • 定制钩子

定制钩子通过 custom_hooks 注册,一般为一些增强型功能的钩子,需要在配置文件中指定优先级,不指定该钩子的优先级将默被设定为 ‘NORMAL’。

优先级列表

Level

Value

HIGHEST

0

VERY_HIGH

10

HIGH

30

ABOVE_NORMAL

40

NORMAL(default)

50

BELOW_NORMAL

60

LOW

70

VERY_LOW

90

LOWEST

100

优先级确定钩子的执行顺序,每次训练前,日志会打印出各个阶段钩子的执行顺序,方便调试。

默认训练钩子

有一些常见的钩子未通过 custom_hooks 注册,但会在运行器(Runner)中默认注册,它们是:

Hooks

Priority

LrUpdaterHook

VERY_HIGH (10)

MomentumUpdaterHook

HIGH (30)

OptimizerHook

ABOVE_NORMAL (40)

CheckpointHook

NORMAL (50)

IterTimerHook

LOW (70)

EvalHook

LOW (70)

LoggerHook(s)

VERY_LOW (90)

OptimizerHookMomentumUpdaterHookLrUpdaterHook优化策略 部分进行了介绍, IterTimerHook 用于记录所用时间,目前不支持修改;

下面介绍如何使用去定制 CheckpointHookLoggerHooks 以及 EvalHook

权重文件钩子(CheckpointHook)

MMCV 的 runner 使用 checkpoint_config 来初始化 CheckpointHook

checkpoint_config = dict(interval=1)

用户可以设置 “max_keep_ckpts” 来仅保存少量模型权重文件,或者通过 “save_optimizer” 决定是否存储优化器的状态字典。 更多细节可参考 这里

日志钩子(LoggerHooks)

log_config 包装了多个记录器钩子,并可以设置间隔。 目前,MMCV 支持 TextLoggerHookWandbLoggerHookMlflowLoggerHookTensorboardLoggerHook。 更多细节可参考这里

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
验证钩子(EvalHook)

配置中的 evaluation 字段将用于初始化 EvalHook

EvalHook 有一些保留参数,如 intervalsave_beststart 等。其他的参数,如“metrics”将被传递给 dataset.evaluate()

evaluation = dict(interval=1, metric='accuracy', metric_options={'topk': (1, )})

我们可以通过参数 save_best 保存取得最好验证结果时的模型权重:

# "auto" 表示自动选择指标来进行模型的比较。也可以指定一个特定的 key 比如 "accuracy_top-1"。
evaluation = dict(interval=1, save_best=True, metric='accuracy', metric_options={'topk': (1, )})

在跑一些大型实验时,可以通过修改参数 start 跳过训练靠前轮次时的验证步骤,以节约时间。如下:

evaluation = dict(interval=1, start=200, metric='accuracy', metric_options={'topk': (1, )})

表示在第 200 轮之前,只执行训练流程,不执行验证;从轮次 200 开始,在每一轮训练之后进行验证。

备注

在 MMClassification 的默认配置文件中,evaluation 字段一般被放在 datasets 基础配置文件中。

使用内置钩子

一些钩子已在 MMCV 和 MMClassification 中实现:

可以直接修改配置以使用该钩子,如下格式:

custom_hooks = [
    dict(type='MMCVHook', a=a_value, b=b_value, priority='NORMAL')
]

例如使用 EMAHook,进行一次 EMA 的间隔是100个迭代:

custom_hooks = [
    dict(type='EMAHook', interval=100, priority='HIGH')
]

自定义钩子

创建一个新钩子

这里举一个在 MMClassification 中创建一个新钩子,并在训练中使用它的示例:

from mmcv.runner import HOOKS, Hook


@HOOKS.register_module()
class MyHook(Hook):

    def __init__(self, a, b):
        pass

    def before_run(self, runner):
        pass

    def after_run(self, runner):
        pass

    def before_epoch(self, runner):
        pass

    def after_epoch(self, runner):
        pass

    def before_iter(self, runner):
        pass

    def after_iter(self, runner):
        pass

根据钩子的功能,用户需要指定钩子在训练的每个阶段将要执行的操作,比如 before_runafter_runbefore_epochafter_epochbefore_iterafter_iter

注册新钩子

之后,需要导入 MyHook。假设该文件在 mmcls/core/utils/my_hook.py,有两种办法导入它:

  • 修改 mmcls/core/utils/__init__.py 进行导入

    新定义的模块应导入到 mmcls/core/utils/__init__py 中,以便注册器能找到并添加新模块:

from .my_hook import MyHook

__all__ = ['MyHook']
  • 使用配置文件中的 custom_imports 变量手动导入

custom_imports = dict(imports=['mmcls.core.utils.my_hook'], allow_failed_imports=False)

修改配置

custom_hooks = [
    dict(type='MyHook', a=a_value, b=b_value)
]

还可通过 priority 参数设置钩子优先级,如下所示:

custom_hooks = [
    dict(type='MyHook', a=a_value, b=b_value, priority='NORMAL')
]

默认情况下,在注册过程中,钩子的优先级设置为“NORMAL”。

常见问题

1. resume_from, load_from,init_cfg.Pretrained 区别

  • load_from :仅仅加载模型权重,主要用于加载预训练或者训练好的模型;

  • resume_from :不仅导入模型权重,还会导入优化器信息,当前轮次(epoch)信息,主要用于从断点继续训练。

  • init_cfg.Pretrained :在权重初始化期间加载权重,您可以指定要加载的模块。 这通常在微调模型时使用,请参阅教程 2:如何微调模型

模型库统计

Model Zoo

ImageNet

ImageNet has multiple versions, but the most commonly used one is ILSVRC 2012. The ResNet family models below are trained by standard data augmentations, i.e., RandomResizedCrop, RandomHorizontalFlip and Normalize.

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

VGG-11

132.86

7.63

68.75

88.87

config

model | log

VGG-13

133.05

11.34

70.02

89.46

config

model | log

VGG-16

138.36

15.5

71.62

90.49

config

model | log

VGG-19

143.67

19.67

72.41

90.80

config

model | log

VGG-11-BN

132.87

7.64

70.75

90.12

config

model | log

VGG-13-BN

133.05

11.36

72.15

90.71

config

model | log

VGG-16-BN

138.37

15.53

73.72

91.68

config

model | log

VGG-19-BN

143.68

19.7

74.70

92.24

config

model | log

RepVGG-A0*

9.11(train) | 8.31 (deploy)

1.52 (train) | 1.36 (deploy)

72.41

90.50

config (train) | config (deploy)

model

RepVGG-A1*

14.09 (train) | 12.79 (deploy)

2.64 (train) | 2.37 (deploy)

74.47

91.85

config (train) | config (deploy)

model

RepVGG-A2*

28.21 (train) | 25.5 (deploy)

5.7 (train) | 5.12 (deploy)

76.48

93.01

config (train) | config (deploy)

model

RepVGG-B0*

15.82 (train) | 14.34 (deploy)

3.42 (train) | 3.06 (deploy)

75.14

92.42

config (train) | config (deploy)

model

RepVGG-B1*

57.42 (train) | 51.83 (deploy)

13.16 (train) | 11.82 (deploy)

78.37

94.11

config (train) | config (deploy)

model

RepVGG-B1g2*

45.78 (train) | 41.36 (deploy)

9.82 (train) | 8.82 (deploy)

77.79

93.88

config (train) | config (deploy)

model

RepVGG-B1g4*

39.97 (train) | 36.13 (deploy)

8.15 (train) | 7.32 (deploy)

77.58

93.84

config (train) | config (deploy)

model

RepVGG-B2*

89.02 (train) | 80.32 (deploy)

20.46 (train) | 18.39 (deploy)

78.78

94.42

config (train) | config (deploy)

model

RepVGG-B2g4*

61.76 (train) | 55.78 (deploy)

12.63 (train) | 11.34 (deploy)

79.38

94.68

config (train) | config (deploy)

model

RepVGG-B3*

123.09 (train) | 110.96 (deploy)

29.17 (train) | 26.22 (deploy)

80.52

95.26

config (train) | config (deploy)

model

RepVGG-B3g4*

83.83 (train) | 75.63 (deploy)

17.9 (train) | 16.08 (deploy)

80.22

95.10

config (train) | config (deploy)

model

RepVGG-D2se*

133.33 (train) | 120.39 (deploy)

36.56 (train) | 32.85 (deploy)

81.81

95.94

config (train) | config (deploy)

model

ResNet-18

11.69

1.82

70.07

89.44

config

model | log

ResNet-34

21.8

3.68

73.85

91.53

config

model | log

ResNet-50 (rsb-a1)

25.56

4.12

80.12

94.78

config

model | log

ResNet-101

44.55

7.85

78.18

94.03

config

model | log

ResNet-152

60.19

11.58

78.63

94.16

config

model | log

Res2Net-50-14w-8s*

25.06

4.22

78.14

93.85

config

model

Res2Net-50-26w-8s*

48.40

8.39

79.20

94.36

config

model

Res2Net-101-26w-4s*

45.21

8.12

79.19

94.44

config

model

ResNeSt-50*

27.48

5.41

81.13

95.59

config

model

ResNeSt-101*

48.28

10.27

82.32

96.24

config

model

ResNeSt-200*

70.2

17.53

82.41

96.22

config

model

ResNeSt-269*

110.93

22.58

82.70

96.28

config

model

ResNetV1D-50

25.58

4.36

77.54

93.57

config

model | log

ResNetV1D-101

44.57

8.09

78.93

94.48

config

model | log

ResNetV1D-152

60.21

11.82

79.41

94.7

config

model | log

ResNeXt-32x4d-50

25.03

4.27

77.90

93.66

config

model | log

ResNeXt-32x4d-101

44.18

8.03

78.71

94.12

config

model | log

ResNeXt-32x8d-101

88.79

16.5

79.23

94.58

config

model | log

ResNeXt-32x4d-152

59.95

11.8

78.93

94.41

config

model | log

SE-ResNet-50

28.09

4.13

77.74

93.84

config

model | log

SE-ResNet-101

49.33

7.86

78.26

94.07

config

model | log

RegNetX-400MF

5.16

0.41

72.56

90.78

config

model | log

RegNetX-800MF

7.26

0.81

74.76

92.32

config

model | log

RegNetX-1.6GF

9.19

1.63

76.84

93.31

config

model | log

RegNetX-3.2GF

15.3

3.21

78.09

94.08

config

model | log

RegNetX-4.0GF

22.12

4.0

78.60

94.17

config

model | log

RegNetX-6.4GF

26.21

6.51

79.38

94.65

config

model | log

RegNetX-8.0GF

39.57

8.03

79.12

94.51

config

model | log

RegNetX-12GF

46.11

12.15

79.67

95.03

config

model | log

ShuffleNetV1 1.0x (group=3)

1.87

0.146

68.13

87.81

config

model | log

ShuffleNetV2 1.0x

2.28

0.149

69.55

88.92

config

model | log

MobileNet V2

3.5

0.319

71.86

90.42

config

model | log

ViT-B/16*

86.86

33.03

85.43

97.77

config

model

ViT-B/32*

88.3

8.56

84.01

97.08

config

model

ViT-L/16*

304.72

116.68

85.63

97.63

config

model

Swin-Transformer tiny

28.29

4.36

81.18

95.61

config

model | log

Swin-Transformer small

49.61

8.52

83.02

96.29

config

model | log

Swin-Transformer base

87.77

15.14

83.36

96.44

config

model | log

Transformer in Transformer small*

23.76

3.36

81.52

95.73

config

model

T2T-ViT_t-14

21.47

4.34

81.83

95.84

config

model | log

T2T-ViT_t-19

39.08

7.80

82.63

96.18

config

model | log

T2T-ViT_t-24

64.00

12.69

82.71

96.09

config

model | log

Mixer-B/16*

59.88

12.61

76.68

92.25

config

model

Mixer-L/16*

208.2

44.57

72.34

88.02

config

model

DeiT-tiny

5.72

1.08

74.50

92.24

config

model | log

DeiT-tiny distilled*

5.72

1.08

74.51

91.90

config

model

DeiT-small

22.05

4.24

80.69

95.06

config

model | log

DeiT-small distilled*

22.05

4.24

81.17

95.40

config

model

DeiT-base

86.57

16.86

81.76

95.81

config

model | log

DeiT-base distilled*

86.57

16.86

83.33

96.49

config

model

DeiT-base 384px*

86.86

49.37

83.04

96.31

config

model

DeiT-base distilled 384px*

86.86

49.37

85.55

97.35

config

model

Conformer-tiny-p16*

23.52

4.90

81.31

95.60

config

model

Conformer-small-p32*

38.85

7.09

81.96

96.02

config

model

Conformer-small-p16*

37.67

10.31

83.32

96.46

config

model

Conformer-base-p16*

83.29

22.89

83.82

96.59

config

model

PCPVT-small*

24.11

3.67

81.14

95.69

config

model

PCPVT-base*

43.83

6.45

82.66

96.26

config

model

PCPVT-large*

60.99

9.51

83.09

96.59

config

model

SVT-small*

24.06

2.82

81.77

95.57

config

model

SVT-base*

56.07

8.35

83.13

96.29

config

model

SVT-large*

99.27

14.82

83.60

96.50

config

model

EfficientNet-B0*

5.29

0.02

76.74

93.17

config

model

EfficientNet-B0 (AA)*

5.29

0.02

77.26

93.41

config

model

EfficientNet-B0 (AA + AdvProp)*

5.29

0.02

77.53

93.61

config

model

EfficientNet-B1*

7.79

0.03

78.68

94.28

config

model

EfficientNet-B1 (AA)*

7.79

0.03

79.20

94.42

config

model

EfficientNet-B1 (AA + AdvProp)*

7.79

0.03

79.52

94.43

config

model

EfficientNet-B2*

9.11

0.03

79.64

94.80

config

model

EfficientNet-B2 (AA)*

9.11

0.03

80.21

94.96

config

model

EfficientNet-B2 (AA + AdvProp)*

9.11

0.03

80.45

95.07

config

model

EfficientNet-B3*

12.23

0.06

81.01

95.34

config

model

EfficientNet-B3 (AA)*

12.23

0.06

81.58

95.67

config

model

EfficientNet-B3 (AA + AdvProp)*

12.23

0.06

81.81

95.69

config

model

EfficientNet-B4*

19.34

0.12

82.57

96.09

config

model

EfficientNet-B4 (AA)*

19.34

0.12

82.95

96.26

config

model

EfficientNet-B4 (AA + AdvProp)*

19.34

0.12

83.25

96.44

config

model

EfficientNet-B5*

30.39

0.24

83.18

96.47

config

model

EfficientNet-B5 (AA)*

30.39

0.24

83.82

96.76

config

model

EfficientNet-B5 (AA + AdvProp)*

30.39

0.24

84.21

96.98

config

model

EfficientNet-B6 (AA)*

43.04

0.41

84.05

96.82

config

model

EfficientNet-B6 (AA + AdvProp)*

43.04

0.41

84.74

97.14

config

model

EfficientNet-B7 (AA)*

66.35

0.72

84.38

96.88

config

model

EfficientNet-B7 (AA + AdvProp)*

66.35

0.72

85.14

97.23

config

model

EfficientNet-B8 (AA + AdvProp)*

87.41

1.09

85.38

97.28

config

model

ConvNeXt-T*

28.59

4.46

82.05

95.86

config

model

ConvNeXt-S*

50.22

8.69

83.13

96.44

config

model

ConvNeXt-B*

88.59

15.36

83.85

96.74

config

model

ConvNeXt-B*

88.59

15.36

85.81

97.86

config

model

ConvNeXt-L*

197.77

34.37

84.30

96.89

config

model

ConvNeXt-L*

197.77

34.37

86.61

98.04

config

model

ConvNeXt-XL*

350.20

60.93

86.97

98.20

config

model

HRNet-W18*

21.30

4.33

76.75

93.44

config

model

HRNet-W30*

37.71

8.17

78.19

94.22

config

model

HRNet-W32*

41.23

8.99

78.44

94.19

config

model

HRNet-W40*

57.55

12.77

78.94

94.47

config

model

HRNet-W44*

67.06

14.96

78.88

94.37

config

model

HRNet-W48*

77.47

17.36

79.32

94.52

config

model

HRNet-W64*

128.06

29.00

79.46

94.65

config

model

HRNet-W18 (ssld)*

21.30

4.33

81.06

95.70

config

model

HRNet-W48 (ssld)*

77.47

17.36

83.63

96.79

config

model

WRN-50*

68.88

11.44

81.45

95.53

config

model

WRN-101*

126.89

22.81

78.84

94.28

config

model

CSPDarkNet50*

27.64

5.04

80.05

95.07

config

model

CSPResNet50*

21.62

3.48

79.55

94.68

config

model

CSPResNeXt50*

20.57

3.11

79.96

94.96

config

model

DenseNet121*

7.98

2.88

74.96

92.21

config

model

DenseNet169*

14.15

3.42

76.08

93.11

config

model

DenseNet201*

20.01

4.37

77.32

93.64

config

model

DenseNet161*

28.68

7.82

77.61

93.83

config

model

VAN-T*

4.11

0.88

75.41

93.02

config

model

VAN-S*

13.86

2.52

81.01

95.63

config

model

VAN-B*

26.58

5.03

82.80

96.21

config

model

VAN-L*

44.77

8.99

83.86

96.73

config

model

MViTv2-tiny*

24.17

4.70

82.33

96.15

config

model

MViTv2-small*

34.87

7.00

83.63

96.51

config

model

MViTv2-base*

51.47

10.20

84.34

96.86

config

model

MViTv2-large*

217.99

42.10

85.25

97.14

config

model

EfficientFormer-l1*

12.19

1.30

80.46

94.99

config

model

EfficientFormer-l3*

31.41

3.93

82.45

96.18

config

model

EfficientFormer-l7*

82.23

10.16

83.40

96.60

config

model

Models with * are converted from other repos, others are trained by ourselves.

CIFAR10

Model

Params(M)

Flops(G)

Top-1 (%)

Config

Download

ResNet-18-b16x8

11.17

0.56

94.82

config

ResNet-34-b16x8

21.28

1.16

95.34

config

ResNet-50-b16x8

23.52

1.31

95.55

config

ResNet-101-b16x8

42.51

2.52

95.58

config

ResNet-152-b16x8

58.16

3.74

95.76

config

Conformer

Conformer: Local Features Coupling Global Representations for Visual Recognition

Abstract

Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

Conformer-tiny-p16*

23.52

4.90

81.31

95.60

config

model

Conformer-small-p32*

38.85

7.09

81.96

96.02

config

model

Conformer-small-p16*

37.67

10.31

83.32

96.46

config

model

Conformer-base-p16*

83.29

22.89

83.82

96.59

config

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@article{peng2021conformer,
      title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
      author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
      journal={arXiv preprint arXiv:2105.03889},
      year={2021},
}

ConvMixer

Patches Are All You Need?

Abstract

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ConvMixer-768/32*

21.11

19.62

80.16

95.08

config

model

ConvMixer-1024/20*

24.38

5.55

76.94

93.36

config

model

ConvMixer-1536/20*

51.63

48.71

81.37

95.61

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{trockman2022patches,
      title={Patches Are All You Need?},
      author={Asher Trockman and J. Zico Kolter},
      year={2022},
      eprint={2201.09792},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

ConvNeXt

A ConvNet for the 2020s

Abstract

The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Results and models

ImageNet-1k

Model

Pretrain

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ConvNeXt-T*

From scratch

28.59

4.46

82.05

95.86

config

model

ConvNeXt-S*

From scratch

50.22

8.69

83.13

96.44

config

model

ConvNeXt-B*

From scratch

88.59

15.36

83.85

96.74

config

model

ConvNeXt-B*

ImageNet-21k

88.59

15.36

85.81

97.86

config

model

ConvNeXt-L*

From scratch

197.77

34.37

84.30

96.89

config

model

ConvNeXt-L*

ImageNet-21k

197.77

34.37

86.61

98.04

config

model

ConvNeXt-XL*

ImageNet-21k

350.20

60.93

86.97

98.20

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Pre-trained Models

The pre-trained models on ImageNet-1k or ImageNet-21k are used to fine-tune on the downstream tasks.

Model

Training Data

Params(M)

Flops(G)

Download

ConvNeXt-T*

ImageNet-1k

28.59

4.46

model

ConvNeXt-S*

ImageNet-1k

50.22

8.69

model

ConvNeXt-B*

ImageNet-1k

88.59

15.36

model

ConvNeXt-B*

ImageNet-21k

88.59

15.36

model

ConvNeXt-L*

ImageNet-21k

197.77

34.37

model

ConvNeXt-XL*

ImageNet-21k

350.20

60.93

model

Models with * are converted from the official repo.

Citation

@Article{liu2022convnet,
  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
  title   = {A ConvNet for the 2020s},
  journal = {arXiv preprint arXiv:2201.03545},
  year    = {2022},
}

CSPNet

CSPNet: A New Backbone that can Enhance Learning Capability of CNN

Abstract

Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at this https URL.

Results and models

ImageNet-1k

Model

Pretrain

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

CSPDarkNet50*

From scratch

27.64

5.04

80.05

95.07

config

model

CSPResNet50*

From scratch

21.62

3.48

79.55

94.68

config

model

CSPResNeXt50*

From scratch

20.57

3.11

79.96

94.96

config

model

Models with * are converted from the timm repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@inproceedings{wang2020cspnet,
  title={CSPNet: A new backbone that can enhance learning capability of CNN},
  author={Wang, Chien-Yao and Liao, Hong-Yuan Mark and Wu, Yueh-Hua and Chen, Ping-Yang and Hsieh, Jun-Wei and Yeh, I-Hau},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops},
  pages={390--391},
  year={2020}
}

CSRA

Residual Attention: A Simple but Effective Method for Multi-Label Recognition

Abstract

Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.

Results and models

VOC2007

Model

Pretrain

Params(M)

Flops(G)

mAP

OF1 (%)

CF1 (%)

Config

Download

Resnet101-CSRA

ImageNet-1k

23.55

4.12

94.98

90.80

89.16

config

model | log

Citation

@misc{https://doi.org/10.48550/arxiv.2108.02456,
  doi = {10.48550/ARXIV.2108.02456},
  url = {https://arxiv.org/abs/2108.02456},
  author = {Zhu, Ke and Wu, Jianxin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Residual Attention: A Simple but Effective Method for Multi-Label Recognition},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

DeiT

Training data-efficient image transformers & distillation through attention

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Results and models

ImageNet-1k

The teacher of the distilled version DeiT is RegNetY-16GF.

Model

Pretrain

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

DeiT-tiny

From scratch

5.72

1.08

74.50

92.24

config

model | log

DeiT-tiny distilled*

From scratch

5.72

1.08

74.51

91.90

config

model

DeiT-small

From scratch

22.05

4.24

80.69

95.06

config

model | log

DeiT-small distilled*

From scratch

22.05

4.24

81.17

95.40

config

model

DeiT-base

From scratch

86.57

16.86

81.76

95.81

config

model | log

DeiT-base*

From scratch

86.57

16.86

81.79

95.59

config

model

DeiT-base distilled*

From scratch

86.57

16.86

83.33

96.49

config

model

DeiT-base 384px*

ImageNet-1k

86.86

49.37

83.04

96.31

config

model

DeiT-base distilled 384px*

ImageNet-1k

86.86

49.37

85.55

97.35

config

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

警告

MMClassification doesn’t support training the distilled version DeiT. And we provide distilled version checkpoints for inference only.

Citation

@InProceedings{pmlr-v139-touvron21a,
  title =     {Training data-efficient image transformers &amp; distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}

DenseNet

Densely Connected Convolutional Networks

Abstract

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

DenseNet121*

7.98

2.88

74.96

92.21

config

model

DenseNet169*

14.15

3.42

76.08

93.11

config

model

DenseNet201*

20.01

4.37

77.32

93.64

config

model

DenseNet161*

28.68

7.82

77.61

93.83

config

model

Models with * are converted from pytorch, guided by original repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{https://doi.org/10.48550/arxiv.1608.06993,
      doi = {10.48550/ARXIV.1608.06993},
      url = {https://arxiv.org/abs/1608.06993},
      author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
      keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
      title = {Densely Connected Convolutional Networks},
      publisher = {arXiv},
      year = {2016},
      copyright = {arXiv.org perpetual, non-exclusive license}
}

EfficientFormer

EfficientFormer: Vision Transformers at MobileNet Speed

Abstract

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

EfficientFormer-l1*

12.19

1.30

80.46

94.99

config

model

EfficientFormer-l3*

31.41

3.93

82.45

96.18

config

model

EfficientFormer-l7*

82.23

10.16

83.40

96.60

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{https://doi.org/10.48550/arxiv.2206.01191,
  doi = {10.48550/ARXIV.2206.01191},

  url = {https://arxiv.org/abs/2206.01191},

  author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},

  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},

  title = {EfficientFormer: Vision Transformers at MobileNet Speed},

  publisher = {arXiv},

  year = {2022},

  copyright = {Creative Commons Attribution 4.0 International}
}

EfficientNet

Rethinking Model Scaling for Convolutional Neural Networks

Abstract

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.

Results and models

ImageNet-1k

In the result table, AA means trained with AutoAugment pre-processing, more details can be found in the paper, and AdvProp is a method to train with adversarial examples, more details can be found in the paper.

Note: In MMClassification, we support training with AutoAugment, don’t support AdvProp by now.

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

EfficientNet-B0*

5.29

0.02

76.74

93.17

config

model

EfficientNet-B0 (AA)*

5.29

0.02

77.26

93.41

config

model

EfficientNet-B0 (AA + AdvProp)*

5.29

0.02

77.53

93.61

config

model

EfficientNet-B1*

7.79

0.03

78.68

94.28

config

model

EfficientNet-B1 (AA)*

7.79

0.03

79.20

94.42

config

model

EfficientNet-B1 (AA + AdvProp)*

7.79

0.03

79.52

94.43

config

model

EfficientNet-B2*

9.11

0.03

79.64

94.80

config

model

EfficientNet-B2 (AA)*

9.11

0.03

80.21

94.96

config

model

EfficientNet-B2 (AA + AdvProp)*

9.11

0.03

80.45

95.07

config

model

EfficientNet-B3*

12.23

0.06

81.01

95.34

config

model

EfficientNet-B3 (AA)*

12.23

0.06

81.58

95.67

config

model

EfficientNet-B3 (AA + AdvProp)*

12.23

0.06

81.81

95.69

config

model

EfficientNet-B4*

19.34

0.12

82.57

96.09

config

model

EfficientNet-B4 (AA)*

19.34

0.12

82.95

96.26

config

model

EfficientNet-B4 (AA + AdvProp)*

19.34

0.12

83.25

96.44

config

model

EfficientNet-B5*

30.39

0.24

83.18

96.47

config

model

EfficientNet-B5 (AA)*

30.39

0.24

83.82

96.76

config

model

EfficientNet-B5 (AA + AdvProp)*

30.39

0.24

84.21

96.98

config

model

EfficientNet-B6 (AA)*

43.04

0.41

84.05

96.82

config

model

EfficientNet-B6 (AA + AdvProp)*

43.04

0.41

84.74

97.14

config

model

EfficientNet-B7 (AA)*

66.35

0.72

84.38

96.88

config

model

EfficientNet-B7 (AA + AdvProp)*

66.35

0.72

85.14

97.23

config

model

EfficientNet-B8 (AA + AdvProp)*

87.41

1.09

85.38

97.28

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@inproceedings{tan2019efficientnet,
  title={Efficientnet: Rethinking model scaling for convolutional neural networks},
  author={Tan, Mingxing and Le, Quoc},
  booktitle={International Conference on Machine Learning},
  pages={6105--6114},
  year={2019},
  organization={PMLR}
}

HorNet

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Abstract

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.

Results and models

ImageNet-1k

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

HorNet-T*

From scratch

224x224

22.41

3.98

82.84

96.24

config

model

HorNet-T-GF*

From scratch

224x224

22.99

3.9

82.98

96.38

config

model

HorNet-S*

From scratch

224x224

49.53

8.83

83.79

96.75

config

model

HorNet-S-GF*

From scratch

224x224

50.4

8.71

83.98

96.77

config

model

HorNet-B*

From scratch

224x224

87.26

15.59

84.24

96.94

config

model

HorNet-B-GF*

From scratch

224x224

88.42

15.42

84.32

96.95

config

model

*Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Pre-trained Models

The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks.

Model

Pretrain

resolution

Params(M)

Flops(G)

Download

HorNet-L*

ImageNet-21k

224x224

194.54

34.83

model

HorNet-L-GF*

ImageNet-21k

224x224

196.29

34.58

model

HorNet-L-GF384*

ImageNet-21k

384x384

201.23

101.63

model

*Models with * are converted from the official repo.

Citation

@article{rao2022hornet,
  title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
  author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
  journal={arXiv preprint arXiv:2207.14284},
  year={2022}
}

HRNet

Deep High-Resolution Representation Learning for Visual Recognition

Abstract

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

HRNet-W18*

21.30

4.33

76.75

93.44

config

model

HRNet-W30*

37.71

8.17

78.19

94.22

config

model

HRNet-W32*

41.23

8.99

78.44

94.19

config

model

HRNet-W40*

57.55

12.77

78.94

94.47

config

model

HRNet-W44*

67.06

14.96

78.88

94.37

config

model

HRNet-W48*

77.47

17.36

79.32

94.52

config

model

HRNet-W64*

128.06

29.00

79.46

94.65

config

model

HRNet-W18 (ssld)*

21.30

4.33

81.06

95.70

config

model

HRNet-W48 (ssld)*

77.47

17.36

83.63

96.79

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@article{WangSCJDZLMTWLX19,
  title={Deep High-Resolution Representation Learning for Visual Recognition},
  author={Jingdong Wang and Ke Sun and Tianheng Cheng and
          Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and
          Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
  journal   = {TPAMI}
  year={2019}
}

Mlp-Mixer

MLP-Mixer: An all-MLP Architecture for Vision

Abstract

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. “mixing” the per-location features), and one with MLPs applied across patches (i.e. “mixing” spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

Mixer-B/16*

59.88

12.61

76.68

92.25

config

model

Mixer-L/16*

208.2

44.57

72.34

88.02

config

model

Models with * are converted from timm. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{tolstikhin2021mlpmixer,
      title={MLP-Mixer: An all-MLP Architecture for Vision},
      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
      year={2021},
      eprint={2105.01601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

MobileNet V2

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Abstract

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.

The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

MobileNet V2

3.5

0.319

71.86

90.42

config

model | log

Citation

@INPROCEEDINGS{8578572,
  author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
  booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  title={MobileNetV2: Inverted Residuals and Linear Bottlenecks},
  year={2018},
  volume={},
  number={},
  pages={4510-4520},
  doi={10.1109/CVPR.2018.00474}}
}

MobileNet V3

Searching for MobileNetV3

Abstract

We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

MobileNetV3-Small*

2.54

0.06

67.66

87.41

config

model

MobileNetV3-Large*

5.48

0.23

74.04

91.34

config

model

Models with * are converted from torchvision. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@inproceedings{Howard_2019_ICCV,
    author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
    title = {Searching for MobileNetV3},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month = {October},
    year = {2019}
}

MViT V2

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s’ pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.

Results and models

ImageNet-1k

Model

Pretrain

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

MViTv2-tiny*

From scratch

24.17

4.70

82.33

96.15

config

model

MViTv2-small*

From scratch

34.87

7.00

83.63

96.51

config

model

MViTv2-base*

From scratch

51.47

10.20

84.34

96.86

config

model

MViTv2-large*

From scratch

217.99

42.10

85.25

97.14

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}

PoolFormer

MetaFormer is Actually What You Need for Vision

Abstract

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model’s performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of “MetaFormer”, a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

PoolFormer-S12*

11.92

1.87

77.24

93.51

config

model

PoolFormer-S24*

21.39

3.51

80.33

95.05

config

model

PoolFormer-S36*

30.86

5.15

81.43

95.45

config

model

PoolFormer-M36*

56.17

8.96

82.14

95.71

config

model

PoolFormer-M48*

73.47

11.80

82.51

95.95

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@article{yu2021metaformer,
  title={MetaFormer is Actually What You Need for Vision},
  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2111.11418},
  year={2021}
}

RegNet

Designing Network Design Spaces

Abstract

In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

RegNetX-400MF

5.16

0.41

72.56

90.78

config

model | log

RegNetX-800MF

7.26

0.81

74.76

92.32

config

model | log

RegNetX-1.6GF

9.19

1.63

76.84

93.31

config

model | log

RegNetX-3.2GF

15.3

3.21

78.09

94.08

config

model | log

RegNetX-4.0GF

22.12

4.0

78.60

94.17

config

model | log

RegNetX-6.4GF

26.21

6.51

79.38

94.65

config

model | log

RegNetX-8.0GF

39.57

8.03

79.12

94.51

config

model | log

RegNetX-12GF

46.11

12.15

79.67

95.03

config

model | log

RegNetX-400MF*

5.16

0.41

72.55

90.91

config

model

RegNetX-800MF*

7.26

0.81

75.21

92.37

config

model

RegNetX-1.6GF*

9.19

1.63

77.04

93.51

config

model

RegNetX-3.2GF*

15.3

3.21

78.26

94.20

config

model

RegNetX-4.0GF*

22.12

4.0

78.72

94.22

config

model

RegNetX-6.4GF*

26.21

6.51

79.22

94.61

config

model

RegNetX-8.0GF*

39.57

8.03

79.31

94.57

config

model

RegNetX-12GF*

46.11

12.15

79.91

94.78

config

model

Models with * are converted from pycls. The config files of these models are only for validation.

Citation

@article{radosavovic2020designing,
    title={Designing Network Design Spaces},
    author={Ilija Radosavovic and Raj Prateek Kosaraju and Ross Girshick and Kaiming He and Piotr Dollár},
    year={2020},
    eprint={2003.13678},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

RepMLP

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers forImage Recognition

Abstract

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

RepMLP-B224*

68.24

6.71

80.41

95.12

train_cfg | deploy_cfg

model

RepMLP-B256*

96.45

9.69

81.11

95.5

train_cfg | deploy_cfg

model

Models with * are converted from the official repo.. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

How to use

The checkpoints provided are all training-time models. Use the reparameterize tool to switch them to more efficient inference-time architecture, which not only has fewer parameters but also less calculations.

Use tool

Use provided tool to reparameterize the given model and save the checkpoint:

python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}

${CFG_PATH} is the config file, ${SRC_CKPT_PATH} is the source chenpoint file, ${TARGET_CKPT_PATH} is the target deploy weight file path.

To use reparameterized weights, the config file must switch to the deploy config files.

python tools/test.py ${Deploy_CFG} ${Deploy_Checkpoint} --metrics accuracy

In the code

Use backbone.switch_to_deploy() or classificer.backbone.switch_to_deploy() to switch to the deploy mode. For example:

from mmcls.models import build_backbone

backbone_cfg=dict(type='RepMLPNet', arch='B', img_size=224, reparam_conv_kernels=(1, 3), deploy=False)
backbone = build_backbone(backbone_cfg)
backbone.switch_to_deploy()

or

from mmcls.models import build_classifier

cfg = dict(
    type='ImageClassifier',
    backbone=dict(
        type='RepMLPNet',
        arch='B',
        img_size=224,
        reparam_conv_kernels=(1, 3),
        deploy=False),
    neck=dict(type='GlobalAveragePooling'),
    head=dict(
        type='LinearClsHead',
        num_classes=1000,
        in_channels=768,
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
        topk=(1, 5),
    ))

classifier = build_classifier(cfg)
classifier.backbone.switch_to_deploy()

Citation

@article{ding2021repmlp,
  title={Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition},
  author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
  journal={arXiv preprint arXiv:2105.01883},
  year={2021}
}

RepVGG

Repvgg: Making vgg-style convnets great again

Abstract

We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.

Results and models

ImageNet-1k

Model

Epochs

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

RepVGG-A0*

120

9.11(train) | 8.31 (deploy)

1.52 (train) | 1.36 (deploy)

72.41

90.50

config (train) | config (deploy)

model

RepVGG-A1*

120

14.09 (train) | 12.79 (deploy)

2.64 (train) | 2.37 (deploy)

74.47

91.85

config (train) | config (deploy)

model

RepVGG-A2*

120

28.21 (train) | 25.5 (deploy)

5.7 (train) | 5.12 (deploy)

76.48

93.01

config (train) |config (deploy)

model

RepVGG-B0*

120

15.82 (train) | 14.34 (deploy)

3.42 (train) | 3.06 (deploy)

75.14

92.42

config (train) |config (deploy)

model

RepVGG-B1*

120

57.42 (train) | 51.83 (deploy)

13.16 (train) | 11.82 (deploy)

78.37

94.11

config (train) |config (deploy)

model

RepVGG-B1g2*

120

45.78 (train) | 41.36 (deploy)

9.82 (train) | 8.82 (deploy)

77.79

93.88

config (train) |config (deploy)

model

RepVGG-B1g4*

120

39.97 (train) | 36.13 (deploy)

8.15 (train) | 7.32 (deploy)

77.58

93.84

config (train) |config (deploy)

model

RepVGG-B2*

120

89.02 (train) | 80.32 (deploy)

20.46 (train) | 18.39 (deploy)

78.78

94.42

config (train) |config (deploy)

model

RepVGG-B2g4*

200

61.76 (train) | 55.78 (deploy)

12.63 (train) | 11.34 (deploy)

79.38

94.68

config (train) |config (deploy)

model

RepVGG-B3*

200

123.09 (train) | 110.96 (deploy)

29.17 (train) | 26.22 (deploy)

80.52

95.26

config (train) |config (deploy)

model

RepVGG-B3g4*

200

83.83 (train) | 75.63 (deploy)

17.9 (train) | 16.08 (deploy)

80.22

95.10

config (train) |config (deploy)

model

RepVGG-D2se*

200

133.33 (train) | 120.39 (deploy)

36.56 (train) | 32.85 (deploy)

81.81

95.94

config (train) |config (deploy)

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

How to use

The checkpoints provided are all training-time models. Use the reparameterize tool to switch them to more efficient inference-time architecture, which not only has fewer parameters but also less calculations.

Use tool

Use provided tool to reparameterize the given model and save the checkpoint:

python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}

${CFG_PATH} is the config file, ${SRC_CKPT_PATH} is the source chenpoint file, ${TARGET_CKPT_PATH} is the target deploy weight file path.

To use reparameterized weights, the config file must switch to the deploy config files.

python tools/test.py ${Deploy_CFG} ${Deploy_Checkpoint} --metrics accuracy

In the code

Use backbone.switch_to_deploy() or classificer.backbone.switch_to_deploy() to switch to the deploy mode. For example:

from mmcls.models import build_backbone

backbone_cfg=dict(type='RepVGG',arch='A0'),
backbone = build_backbone(backbone_cfg)
backbone.switch_to_deploy()

or

from mmcls.models import build_classifier

cfg = dict(
    type='ImageClassifier',
    backbone=dict(
        type='RepVGG',
        arch='A0'),
    neck=dict(type='GlobalAveragePooling'),
    head=dict(
        type='LinearClsHead',
        num_classes=1000,
        in_channels=1280,
        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
        topk=(1, 5),
    ))

classifier = build_classifier(cfg)
classifier.backbone.switch_to_deploy()

Citation

@inproceedings{ding2021repvgg,
  title={Repvgg: Making vgg-style convnets great again},
  author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13733--13742},
  year={2021}
}

Res2Net

Res2Net: A New Multi-scale Backbone Architecture

Abstract

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.

Results and models

ImageNet-1k

Model

resolution

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

Res2Net-50-14w-8s*

224x224

25.06

4.22

78.14

93.85

config

model | log

Res2Net-50-26w-8s*

224x224

48.40

8.39

79.20

94.36

config

model | log

Res2Net-101-26w-4s*

224x224

45.21

8.12

79.19

94.44

config

model | log

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@article{gao2019res2net,
  title={Res2Net: A New Multi-scale Backbone Architecture},
  author={Gao, Shang-Hua and Cheng, Ming-Ming and Zhao, Kai and Zhang, Xin-Yu and Yang, Ming-Hsuan and Torr, Philip},
  journal={IEEE TPAMI},
  year={2021},
  doi={10.1109/TPAMI.2019.2938758},
}

ResNet

Deep Residual Learning for Image Recognition

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Results and models

The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.

Model

resolution

Params(M)

Flops(G)

Download

ResNet-50-mill

224x224

86.74

15.14

model

The “mill” means using the mutil-label pretrain weight from ImageNet-21K Pretraining for the Masses.

Cifar10

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ResNet-18

11.17

0.56

94.82

99.87

config

model | log

ResNet-34

21.28

1.16

95.34

99.87

config

model | log

ResNet-50

23.52

1.31

95.55

99.91

config

model | log

ResNet-101

42.51

2.52

95.58

99.87

config

model | log

ResNet-152

58.16

3.74

95.76

99.89

config

model | log

Cifar100

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ResNet-50

23.71

1.31

79.90

95.19

config

model | log

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ResNet-18

11.69

1.82

69.90

89.43

config

model | log

ResNet-34

21.8

3.68

73.62

91.59

config

model | log

ResNet-50

25.56

4.12

76.55

93.06

config

model | log

ResNet-101

44.55

7.85

77.97

94.06

config

model | log

ResNet-152

60.19

11.58

78.48

94.13

config

model | log

ResNetV1C-50

25.58

4.36

77.01

93.58

config

model | log

ResNetV1C-101

44.57

8.09

78.30

94.27

config

model | log

ResNetV1C-152

60.21

11.82

78.76

94.41

config

model | log

ResNetV1D-50

25.58

4.36

77.54

93.57

config

model | log

ResNetV1D-101

44.57

8.09

78.93

94.48

config

model | log

ResNetV1D-152

60.21

11.82

79.41

94.70

config

model | log

ResNet-50 (fp16)

25.56

4.12

76.30

93.07

config

model | log

Wide-ResNet-50*

68.88

11.44

78.48

94.08

config

model

Wide-ResNet-101*

126.89

22.81

78.84

94.28

config

model

ResNet-50 (rsb-a1)

25.56

4.12

80.12

94.78

config

model | log

ResNet-50 (rsb-a2)

25.56

4.12

79.55

94.37

config

model | log

ResNet-50 (rsb-a3)

25.56

4.12

78.30

93.80

config

model | log

The “rsb” means using the training settings from ResNet strikes back: An improved training procedure in timm.

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

CUB-200-2011

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Config

Download

ResNet-50

ImageNet-21k-mill

448x448

23.92

16.48

88.45

config

model | log

Stanford-Cars

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Config

Download

ResNet-50

ImageNet-21k-mill

448x448

23.92

16.48

92.82

config

model | log

Citation

@inproceedings{he2016deep,
  title={Deep residual learning for image recognition},
  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={770--778},
  year={2016}
}

ResNeXt

Aggregated Residual Transformations for Deep Neural Networks

Abstract

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ResNeXt-32x4d-50

25.03

4.27

77.90

93.66

config

model | log

ResNeXt-32x4d-101

44.18

8.03

78.61

94.17

config

model | log

ResNeXt-32x8d-101

88.79

16.5

79.27

94.58

config

model | log

ResNeXt-32x4d-152

59.95

11.8

78.88

94.33

config

model | log

Citation

@inproceedings{xie2017aggregated,
  title={Aggregated residual transformations for deep neural networks},
  author={Xie, Saining and Girshick, Ross and Doll{\'a}r, Piotr and Tu, Zhuowen and He, Kaiming},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={1492--1500},
  year={2017}
}

SE-ResNet

Squeeze-and-Excitation Networks

Abstract

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

SE-ResNet-50

28.09

4.13

77.74

93.84

config

model | log

SE-ResNet-101

49.33

7.86

78.26

94.07

config

model | log

Citation

@inproceedings{hu2018squeeze,
  title={Squeeze-and-excitation networks},
  author={Hu, Jie and Shen, Li and Sun, Gang},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={7132--7141},
  year={2018}
}

ShuffleNet V1

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Abstract

We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ShuffleNetV1 1.0x (group=3)

1.87

0.146

68.13

87.81

config

model | log

Citation

@inproceedings{zhang2018shufflenet,
  title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},
  author={Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={6848--6856},
  year={2018}
}

ShuffleNet V2

Shufflenet v2: Practical guidelines for efficient cnn architecture design

Abstract

Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical guidelines for efficient network design. Accordingly, a new architecture is presented, called ShuffleNet V2. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ShuffleNetV2 1.0x

2.28

0.149

69.55

88.92

config

model | log

Citation

@inproceedings{ma2018shufflenet,
  title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},
  author={Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian},
  booktitle={Proceedings of the European conference on computer vision (ECCV)},
  pages={116--131},
  year={2018}
}

Swin Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.

Results and models

ImageNet-21k

The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.

Model

resolution

Params(M)

Flops(G)

Download

Swin-B

224x224

86.74

15.14

model

Swin-B

384x384

86.88

44.49

model

Swin-L

224x224

195.00

34.04

model

Swin-L

384x384

195.20

100.04

model

ImageNet-1k

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

Swin-T

From scratch

224x224

28.29

4.36

81.18

95.61

config

model | log

Swin-S

From scratch

224x224

49.61

8.52

83.02

96.29

config

model | log

Swin-B

From scratch

224x224

87.77

15.14

83.36

96.44

config

model | log

Swin-S*

From scratch

224x224

49.61

8.52

83.21

96.25

config

model

Swin-B*

From scratch

224x224

87.77

15.14

83.42

96.44

config

model

Swin-B*

From scratch

384x384

87.90

44.49

84.49

96.95

config

model

Swin-B*

ImageNet-21k

224x224

87.77

15.14

85.16

97.50

config

model

Swin-B*

ImageNet-21k

384x384

87.90

44.49

86.44

98.05

config

model

Swin-L*

ImageNet-21k

224x224

196.53

34.04

86.24

97.88

config

model

Swin-L*

ImageNet-21k

384x384

196.74

100.04

87.25

98.25

config

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

CUB-200-2011

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Config

Download

Swin-L

ImageNet-21k

384x384

195.51

100.04

91.87

config

model | log

Citation

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Swin Transformer V2

Swin Transformer V2: Scaling Up Capacity and Resolution

Abstract

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.

Results and models

ImageNet-21k

The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.

Model

resolution

Params(M)

Flops(G)

Download

Swin-B*

192x192

87.92

8.51

model

Swin-L*

192x192

196.74

19.04

model

ImageNet-1k

Model

Pretrain

resolution

window

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

Swin-T*

From scratch

256x256

8x8

28.35

4.35

81.76

95.87

config

model

Swin-T*

From scratch

256x256

16x16

28.35

4.4

82.81

96.23

config

model

Swin-S*

From scratch

256x256

8x8

49.73

8.45

83.74

96.6

config

model

Swin-S*

From scratch

256x256

16x16

49.73

8.57

84.13

96.83

config

model

Swin-B*

From scratch

256x256

8x8

87.92

14.99

84.2

96.86

config

model

Swin-B*

From scratch

256x256

16x16

87.92

15.14

84.6

97.05

config

model

Swin-B*

ImageNet-21k

256x256

16x16

87.92

15.14

86.17

97.88

config

model

Swin-B*

ImageNet-21k

384x384

24x24

87.92

34.07

87.14

98.23

config

model

Swin-L*

ImageNet-21k

256X256

16x16

196.75

33.86

86.93

98.06

config

model

Swin-L*

ImageNet-21k

384x384

24x24

196.75

76.2

87.59

98.27

config

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

ImageNet-21k pretrained models with input resolution of 256x256 and 384x384 both fine-tuned from the same pre-training model using a smaller input resolution of 192x192.

Citation

@article{https://doi.org/10.48550/arxiv.2111.09883,
  doi = {10.48550/ARXIV.2111.09883},
  url = {https://arxiv.org/abs/2111.09883},
  author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution 4.0 International}
}

Tokens-to-Token ViT

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Abstract

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

T2T-ViT_t-14

21.47

4.34

81.83

95.84

config

model | log

T2T-ViT_t-19

39.08

7.80

82.63

96.18

config

model | log

T2T-ViT_t-24

64.00

12.69

82.71

96.09

config

model | log

In consistent with the official repo, we adopt the best checkpoints during training.

Citation

@article{yuan2021tokens,
  title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
  author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2101.11986},
  year={2021}
}

TNT

Transformer in Transformer

Abstract

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as “visual sentences” and present to further divide them into smaller patches (e.g., 4×4) as “visual words”. The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

TNT-small*

23.76

3.36

81.52

95.73

config

model

Models with * are converted from timm. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{han2021transformer,
      title={Transformer in Transformer},
      author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
      year={2021},
      eprint={2103.00112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Twins

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Abstract

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at this https URL.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

PCPVT-small*

24.11

3.67

81.14

95.69

config

model

PCPVT-base*

43.83

6.45

82.66

96.26

config

model

PCPVT-large*

60.99

9.51

83.09

96.59

config

model

SVT-small*

24.06

2.82

81.77

95.57

config

model

SVT-base*

56.07

8.35

83.13

96.29

config

model

SVT-large*

99.27

14.82

83.60

96.50

config

model

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results. The validation accuracy is a little different from the official paper because of the PyTorch version. This result is get in PyTorch=1.9 while the official result is get in PyTorch=1.7

Citation

@article{chu2021twins,
  title={Twins: Revisiting spatial attention design in vision transformers},
  author={Chu, Xiangxiang and Tian, Zhi and Wang, Yuqing and Zhang, Bo and Ren, Haibing and Wei, Xiaolin and Xia, Huaxia and Shen, Chunhua},
  journal={arXiv preprint arXiv:2104.13840},
  year={2021}altgvt
}

Visual Attention Network

Visual Attention Network

Abstract

While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

Results and models

ImageNet-1k

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

VAN-B0*

From scratch

224x224

4.11

0.88

75.41

93.02

config

model

VAN-B1*

From scratch

224x224

13.86

2.52

81.01

95.63

config

model

VAN-B2*

From scratch

224x224

26.58

5.03

82.80

96.21

config

model

VAN-B3*

From scratch

224x224

44.77

8.99

83.86

96.73

config

model

VAN-B4*

From scratch

224x224

60.28

12.22

84.13

96.86

config

model

*Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Pre-trained Models

The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks.

Model

Pretrain

resolution

Params(M)

Flops(G)

Download

VAN-B4*

ImageNet-21k

224x224

60.28

12.22

model

VAN-B5*

ImageNet-21k

224x224

89.97

17.21

model

VAN-B6*

ImageNet-21k

224x224

283.9

55.28

model

*Models with * are converted from the official repo.

Citation

@article{guo2022visual,
  title={Visual Attention Network},
  author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2202.09741},
  year={2022}
}

VGG

Very Deep Convolutional Networks for Large-Scale Image Recognition

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

VGG-11

132.86

7.63

68.75

88.87

config

model | log

VGG-13

133.05

11.34

70.02

89.46

config

model | log

VGG-16

138.36

15.5

71.62

90.49

config

model | log

VGG-19

143.67

19.67

72.41

90.80

config

model | log

VGG-11-BN

132.87

7.64

70.67

90.16

config

model | log

VGG-13-BN

133.05

11.36

72.12

90.66

config

model | log

VGG-16-BN

138.37

15.53

73.74

91.66

config

model | log

VGG-19-BN

143.68

19.7

74.68

92.27

config

model | log

Citation

@article{simonyan2014very,
  title={Very deep convolutional networks for large-scale image recognition},
  author={Simonyan, Karen and Zisserman, Andrew},
  journal={arXiv preprint arXiv:1409.1556},
  year={2014}
}

Vision Transformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Results and models

The training step of Vision Transformers is divided into two steps. The first step is training the model on a large dataset, like ImageNet-21k, and get the pre-trained model. And the second step is training the model on the target dataset, like ImageNet-1k, and get the fine-tuned model. Here, we provide both pre-trained models and fine-tuned models.

ImageNet-21k

The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don’t have evaluation results.

Model

resolution

Params(M)

Flops(G)

Download

ViT-B16*

224x224

86.86

33.03

model

ViT-B32*

224x224

88.30

8.56

model

ViT-L16*

224x224

304.72

116.68

model

Models with * are converted from the official repo.

ImageNet-1k

Model

Pretrain

resolution

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

ViT-B16*

ImageNet-21k

384x384

86.86

33.03

85.43

97.77

config

model

ViT-B32*

ImageNet-21k

384x384

88.30

8.56

84.01

97.08

config

model

ViT-L16*

ImageNet-21k

384x384

304.72

116.68

85.63

97.63

config

model

ViT-B16 (IPU)

ImageNet-21k

224x224

86.86

33.03

81.22

95.56

config

model | log

Models with * are converted from the official repo. The config files of these models are only for validation. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@inproceedings{
  dosovitskiy2021an,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
  booktitle={International Conference on Learning Representations},
  year={2021},
  url={https://openreview.net/forum?id=YicbFdNTTy}
}

Wide-ResNet

Wide Residual Networks

Abstract

Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.

Results and models

ImageNet-1k

Model

Params(M)

Flops(G)

Top-1 (%)

Top-5 (%)

Config

Download

WRN-50*

68.88

11.44

78.48

94.08

config

model

WRN-101*

126.89

22.81

78.84

94.28

config

model

WRN-50 (timm)*

68.88

11.44

81.45

95.53

config

model

Models with * are converted from the TorchVision and TIMM. The config files of these models are only for inference. We don’t ensure these config files’ training accuracy and welcome you to contribute your reproduction results.

Citation

@INPROCEEDINGS{Zagoruyko2016WRN,
    author = {Sergey Zagoruyko and Nikos Komodakis},
    title = {Wide Residual Networks},
    booktitle = {BMVC},
    year = {2016}}

Pytorch 转 ONNX (试验性的)

如何将模型从 PyTorch 转换到 ONNX

准备工作

  1. 请参照 安装指南 从源码安装 MMClassification。

  2. 安装 onnx 和 onnxruntime。

pip install onnx onnxruntime==1.5.1

使用方法

python tools/deployment/pytorch2onnx.py \
    ${CONFIG_FILE} \
    --checkpoint ${CHECKPOINT_FILE} \
    --output-file ${OUTPUT_FILE} \
    --shape ${IMAGE_SHAPE} \
    --opset-version ${OPSET_VERSION} \
    --dynamic-shape \
    --show \
    --simplify \
    --verify \

所有参数的说明:

  • config : 模型配置文件的路径。

  • --checkpoint : 模型权重文件的路径。

  • --output-file: ONNX 模型的输出路径。如果没有指定,默认为当前脚本执行路径下的 tmp.onnx

  • --shape: 模型输入的高度和宽度。如果没有指定,默认为 224 224

  • --opset-version : ONNX 的 opset 版本。如果没有指定,默认为 11

  • --dynamic-shape : 是否以动态输入尺寸导出 ONNX。 如果没有指定,默认为 False

  • --show: 是否打印导出模型的架构。如果没有指定,默认为 False

  • --simplify: 是否精简导出的 ONNX 模型。如果没有指定,默认为 False

  • --verify: 是否验证导出模型的正确性。如果没有指定,默认为False

示例:

python tools/deployment/pytorch2onnx.py \
    configs/resnet/resnet18_8xb16_cifar10.py \
    --checkpoint checkpoints/resnet/resnet18_b16x8_cifar10.pth \
    --output-file checkpoints/resnet/resnet18_b16x8_cifar10.onnx \
    --dynamic-shape \
    --show \
    --simplify \
    --verify \

支持导出至 ONNX 的模型列表

下表列出了保证可导出至 ONNX,并在 ONNX Runtime 中运行的模型。

模型

配置文件

批推理

动态输入尺寸

备注

MobileNetV2

configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py

Y

Y

ResNet

configs/resnet/resnet18_8xb16_cifar10.py

Y

Y

ResNeXt

configs/resnext/resnext50-32x4d_8xb32_in1k.py

Y

Y

SE-ResNet

configs/seresnet/seresnet50_8xb32_in1k.py

Y

Y

ShuffleNetV1

configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py

Y

Y

ShuffleNetV2

configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py

Y

Y

注:

  • 以上所有模型转换测试基于 Pytorch==1.6.0 进行

提示

  • 如果你在上述模型的转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。未在上表中列出的模型,由于资源限制,我们可能无法提供很多帮助,如果遇到问题,请尝试自行解决。

常见问题

ONNX 转 TensorRT(试验性的)

如何将模型从 ONNX 转换到 TensorRT

准备工作

  1. 请参照 安装指南 从源码安装 MMClassification。

  2. 使用我们的工具 pytorch2onnx.md 将 PyTorch 模型转换至 ONNX。

使用方法

python tools/deployment/onnx2tensorrt.py \
    ${MODEL} \
    --trt-file ${TRT_FILE} \
    --shape ${IMAGE_SHAPE} \
    --workspace-size {WORKSPACE_SIZE} \
    --show \
    --verify \

所有参数的说明:

  • model : ONNX 模型的路径。

  • --trt-file: TensorRT 引擎文件的输出路径。如果没有指定,默认为当前脚本执行路径下的 tmp.trt

  • --shape: 模型输入的高度和宽度。如果没有指定,默认为 224 224

  • --workspace-size : 构建 TensorRT 引擎所需要的 GPU 空间大小,单位为 GiB。如果没有指定,默认为 1 GiB。

  • --show: 是否展示模型的输出。如果没有指定,默认为 False

  • --verify: 是否使用 ONNXRuntime 和 TensorRT 验证模型转换的正确性。如果没有指定,默认为False

示例:

python tools/deployment/onnx2tensorrt.py \
    checkpoints/resnet/resnet18_b16x8_cifar10.onnx \
    --trt-file checkpoints/resnet/resnet18_b16x8_cifar10.trt \
    --shape 224 224 \
    --show \
    --verify \

支持转换至 TensorRT 的模型列表

下表列出了保证可转换为 TensorRT 的模型。

模型

配置文件

状态

MobileNetV2

configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py

Y

ResNet

configs/resnet/resnet18_8xb16_cifar10.py

Y

ResNeXt

configs/resnext/resnext50-32x4d_8xb32_in1k.py

Y

ShuffleNetV1

configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py

Y

ShuffleNetV2

configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py

Y

注:

  • 以上所有模型转换测试基于 Pytorch==1.6.0 和 TensorRT-7.2.1.6.Ubuntu-16.04.x86_64-gnu.cuda-10.2.cudnn8.0 进行

提示

  • 如果你在上述模型的转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。未在上表中列出的模型,由于资源限制,我们可能无法提供很多帮助,如果遇到问题,请尝试自行解决。

常见问题

Pytorch 转 TorchScript (试验性的)

如何将 PyTorch 模型转换至 TorchScript

使用方法

python tools/deployment/pytorch2torchscript.py \
    ${CONFIG_FILE} \
    --checkpoint ${CHECKPOINT_FILE} \
    --output-file ${OUTPUT_FILE} \
    --shape ${IMAGE_SHAPE} \
    --verify \

所有参数的说明:

  • config : 模型配置文件的路径。

  • --checkpoint : 模型权重文件的路径。

  • --output-file: TorchScript 模型的输出路径。如果没有指定,默认为当前脚本执行路径下的 tmp.pt

  • --shape: 模型输入的高度和宽度。如果没有指定,默认为 224 224

  • --verify: 是否验证导出模型的正确性。如果没有指定,默认为False

示例:

python tools/deployment/pytorch2torchscript.py \
    configs/resnet/resnet18_8xb16_cifar10.py \
    --checkpoint checkpoints/resnet/resnet18_b16x8_cifar10.pth \
    --output-file checkpoints/resnet/resnet18_b16x8_cifar10.pt \
    --verify \

注:

  • 所有模型基于 Pytorch==1.8.1 通过了转换测试

提示

  • 由于 torch.jit.is_tracing() 只在 PyTorch 1.6 之后的版本中得到支持,对于 PyTorch 1.3-1.5 的用户,我们建议手动提前返回结果。

  • 如果你在本仓库的模型转换中遇到问题,请在 GitHub 中创建一个 issue,我们会尽快处理。

常见问题

模型部署至 TorchServe

为了使用 TorchServe 部署一个 MMClassification 模型,需要进行以下几步:

1. 转换 MMClassification 模型至 TorchServe

python tools/deployment/mmcls2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
--output-folder ${MODEL_STORE} \
--model-name ${MODEL_NAME}

备注

${MODEL_STORE} 需要是一个文件夹的绝对路径。

示例:

python tools/deployment/mmcls2torchserve.py \
  configs/resnet/resnet18_8xb32_in1k.py \
  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
  --output-folder ./checkpoints \
  --model-name resnet18_in1k

2. 构建 mmcls-serve docker 镜像

docker build -t mmcls-serve:latest docker/serve/

3. 运行 mmcls-serve 镜像

请参考官方文档 基于 docker 运行 TorchServe.

为了使镜像能够使用 GPU 资源,需要安装 nvidia-docker。之后可以传递 --gpus 参数以在 GPU 上运。

示例:

docker run --rm \
--cpus 8 \
--gpus device=0 \
-p8080:8080 -p8081:8081 -p8082:8082 \
--mount type=bind,source=`realpath ./checkpoints`,target=/home/model-server/model-store \
mmcls-serve:latest

备注

realpath ./checkpoints 是 “./checkpoints” 的绝对路径,你可以将其替换为你保存 TorchServe 模型的目录的绝对路径。

参考 该文档 了解关于推理 (8080),管理 (8081) 和指标 (8082) 等 API 的信息。

4. 测试部署

curl http://127.0.0.1:8080/predictions/${MODEL_NAME} -T demo/demo.JPEG

您应该获得类似于以下内容的响应:

{
  "pred_label": 58,
  "pred_score": 0.38102269172668457,
  "pred_class": "water snake"
}

另外,你也可以使用 test_torchserver.py 来比较 TorchServe 和 PyTorch 的结果,并进行可视化。

python tools/deployment/test_torchserver.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${MODEL_NAME}
[--inference-addr ${INFERENCE_ADDR}] [--device ${DEVICE}]

示例:

python tools/deployment/test_torchserver.py \
  demo/demo.JPEG \
  configs/resnet/resnet18_8xb32_in1k.py \
  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
  resnet18_in1k

可视化

数据流水线可视化

python tools/visualizations/vis_pipeline.py \
    ${CONFIG_FILE} \
    [--output-dir ${OUTPUT_DIR}] \
    [--phase ${DATASET_PHASE}] \
    [--number ${BUNBER_IMAGES_DISPLAY}] \
    [--skip-type ${SKIP_TRANSFORM_TYPE}] \
    [--mode ${DISPLAY_MODE}] \
    [--show] \
    [--adaptive] \
    [--min-edge-length ${MIN_EDGE_LENGTH}] \
    [--max-edge-length ${MAX_EDGE_LENGTH}] \
    [--bgr2rgb] \
    [--window-size ${WINDOW_SIZE}] \
    [--cfg-options ${CFG_OPTIONS}]

所有参数的说明

  • config : 模型配置文件的路径。

  • --output-dir: 保存图片文件夹,如果没有指定,默认为 '',表示不保存。

  • --phase: 可视化数据集的阶段,只能为 [train, val, test] 之一,默认为 train

  • --number: 可视化样本数量。如果没有指定,默认展示数据集的所有图片。

  • --skip-type: 预设跳过的数据流水线过程。如果没有指定,默认为 ['ToTensor', 'Normalize', 'ImageToTensor', 'Collect']

  • --mode: 可视化的模式,只能为 [original, transformed, concat, pipeline] 之一,如果没有指定,默认为 concat

  • --show: 将可视化图片以弹窗形式展示。

  • --adaptive: 自动调节可视化图片的大小。

  • --min-edge-length: 最短边长度,当使用了 --adaptive 时有效。 当图片任意边小于 ${MIN_EDGE_LENGTH} 时,会保持长宽比不变放大图片,短边对齐至 ${MIN_EDGE_LENGTH},默认为200。

  • --max-edge-length: 最长边长度,当使用了 --adaptive 时有效。 当图片任意边大于 ${MAX_EDGE_LENGTH} 时,会保持长宽比不变缩小图片,短边对齐至 ${MAX_EDGE_LENGTH},默认为1000。

  • --bgr2rgb: 将图片的颜色通道翻转。

  • --window-size: 可视化窗口大小,如果没有指定,默认为 12*7。如果需要指定,按照格式 'W*H'

  • --cfg-options : 对配置文件的修改,参考教程 1:如何编写配置文件

备注

  1. 如果不指定 --mode,默认设置为 concat,获取原始图片和预处理后图片拼接的图片;如果 --mode 设置为 original,则获取原始图片;如果 --mode 设置为 transformed,则获取预处理后的图片;如果 --mode 设置为 pipeline,则获得数据流水线所有中间过程图片。

  2. 当指定了 --adaptive 选项时,会自动的调整尺寸过大和过小的图片,你可以通过设定 --min-edge-length--max-edge-length 来指定自动调整的图片尺寸。

示例

  1. ‘original’ 模式,可视化 CIFAR100 验证集中的100张原始图片,显示并保存在 ./tmp 文件夹下:

python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100 --show --adaptive --bgr2rgb
  1. ‘transformed’ 模式,可视化 ImageNet 训练集的所有经过预处理的图片,并以弹窗形式显示:

python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode transformed
  1. ‘concat’ 模式,可视化 ImageNet 训练集的10张原始图片与预处理后图片对比图,保存在 ./tmp 文件夹下:

python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
  1. ‘pipeline’ 模式,可视化 ImageNet 训练集经过数据流水线的过程图像:

python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --adaptive --mode pipeline --show

学习率策略可视化

python tools/visualizations/vis_lr.py \
    ${CONFIG_FILE} \
    [--dataset-size ${Dataset_Size}] \
    [--ngpus ${NUM_GPUs}] \
    [--save-path ${SAVE_PATH}] \
    [--title ${TITLE}] \
    [--style ${STYLE}] \
    [--window-size ${WINDOW_SIZE}] \
    [--cfg-options ${CFG_OPTIONS}] \

所有参数的说明

  • config : 模型配置文件的路径。

  • --dataset-size : 数据集的大小。如果指定,build_dataset 将被跳过并使用这个大小作为数据集大小,默认使用 build_dataset 所得数据集的大小。

  • --ngpus : 使用 GPU 的数量。

  • --save-path : 保存的可视化图片的路径,默认不保存。

  • --title : 可视化图片的标题,默认为配置文件名。

  • --style : 可视化图片的风格,默认为 whitegrid

  • --window-size: 可视化窗口大小,如果没有指定,默认为 12*7。如果需要指定,按照格式 'W*H'

  • --cfg-options : 对配置文件的修改,参考教程 1:如何编写配置文件

备注

部分数据集在解析标注阶段比较耗时,可直接将 dataset-size 指定数据集的大小,以节约时间。

示例

python tools/visualizations/vis_lr.py configs/resnet/resnet50_b16x8_cifar100.py

当数据集为 ImageNet 时,通过直接指定数据集大小来节约时间,并保存图片:

python tools/visualizations/vis_lr.py configs/repvgg/repvgg-B3g4_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py --dataset-size 1281167 --ngpus 4 --save-path ./repvgg-B3g4_4xb64-lr.jpg

类别激活图可视化

MMClassification 提供 tools\visualizations\vis_cam.py 工具来可视化类别激活图。请使用 pip install "grad-cam>=1.3.6" 安装依赖的 pytorch-grad-cam

目前支持的方法有:

Method

What it does

GradCAM

使用平均梯度对 2D 激活进行加权

GradCAM++

类似 GradCAM,但使用了二阶梯度

XGradCAM

类似 GradCAM,但通过归一化的激活对梯度进行了加权

EigenCAM

使用 2D 激活的第一主成分(无法区分类别,但效果似乎不错)

EigenGradCAM

类似 EigenCAM,但支持类别区分,使用了激活 * 梯度的第一主成分,看起来和 GradCAM 差不多,但是更干净

LayerCAM

使用正梯度对激活进行空间加权,对于浅层有更好的效果

命令行

python tools/visualizations/vis_cam.py \
    ${IMG} \
    ${CONFIG_FILE} \
    ${CHECKPOINT} \
    [--target-layers ${TARGET-LAYERS}] \
    [--preview-model] \
    [--method ${METHOD}] \
    [--target-category ${TARGET-CATEGORY}] \
    [--save-path ${SAVE_PATH}] \
    [--vit-like] \
    [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
    [--aug_smooth] \
    [--eigen_smooth] \
    [--device ${DEVICE}] \
    [--cfg-options ${CFG-OPTIONS}]

所有参数的说明

  • img:目标图片路径。

  • config:模型配置文件的路径。

  • checkpoint:权重路径。

  • --target-layers:所查看的网络层名称,可输入一个或者多个网络层, 如果不设置,将使用最后一个block中的norm层。

  • --preview-model:是否查看模型所有网络层。

  • --method:类别激活图图可视化的方法,目前支持 GradCAM, GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM,不区分大小写。如果不设置,默认为 GradCAM

  • --target-category:查看的目标类别,如果不设置,使用模型检测出来的类别做为目标类别。

  • --save-path:保存的可视化图片的路径,默认不保存。

  • --eigen-smooth:是否使用主成分降低噪音,默认不开启。

  • --vit-like: 是否为 ViT 类似的 Transformer-based 网络

  • --num-extra-tokens: ViT 类网络的额外的 tokens 通道数,默认使用主干网络的 num_extra_tokens

  • --aug-smooth:是否使用测试时增强

  • --device:使用的计算设备,如果不设置,默认为’cpu’。

  • --cfg-options:对配置文件的修改,参考教程 1:如何编写配置文件

备注

在指定 --target-layers 时,如果不知道模型有哪些网络层,可使用命令行添加 --preview-model 查看所有网络层名称;

示例(CNN)

--target-layersResnet-50 中的一些示例如下:

  • 'backbone.layer4',表示第四个 ResLayer 层的输出。

  • 'backbone.layer4.2' 表示第四个 ResLayer 层中第三个 BottleNeck 块的输出。

  • 'backbone.layer4.2.conv1' 表示上述 BottleNeck 块中 conv1 层的输出。

备注

对于 ModuleList 或者 Sequential 类型的网络层,可以直接使用索引的方式指定子模块。比如 backbone.layer4[-1]backbone.layer4.2 是相同的,因为 layer4 是一个拥有三个子模块的 Sequential

  1. 使用不同方法可视化 ResNet50,默认 target-category 为模型检测的结果,使用默认推导的 target-layers

    python tools/visualizations/vis_cam.py \
        demo/bird.JPEG \
        configs/resnet/resnet50_8xb32_in1k.py \
        https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
        --method GradCAM
        # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
    

    Image

    GradCAM

    GradCAM++

    EigenGradCAM

    LayerCAM

  2. 同一张图不同类别的激活图效果图,在 ImageNet 数据集中,类别238为 ‘Greater Swiss Mountain dog’,类别281为 ‘tabby, tabby cat’。

    python tools/visualizations/vis_cam.py \
        demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
        https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
        --target-layers 'backbone.layer4.2' \
        --method GradCAM \
        --target-category 238
        # --target-category 281
    

    Category

    Image

    GradCAM

    XGradCAM

    LayerCAM

    Dog

    Cat

  3. 使用 --eigen-smooth 以及 --aug-smooth 获取更好的可视化效果。

    python tools/visualizations/vis_cam.py \
        demo/dog.jpg  \
        configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py \
        https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
        --target-layers 'backbone.layer16' \
        --method LayerCAM \
        --eigen-smooth --aug-smooth
    

    Image

    LayerCAM

    eigen-smooth

    aug-smooth

    eigen&aug

示例(Transformer)

--target-layers 在 Transformer-based 网络中的一些示例如下:

  • Swin-Transformer 中:'backbone.norm3'

  • ViT 中:'backbone.layers[-1].ln1'

对于 Transformer-based 的网络,比如 ViT、T2T-ViT 和 Swin-Transformer,特征是被展平的。为了绘制 CAM 图,我们需要指定 --vit-like 选项,从而让被展平的特征恢复方形的特征图。

除了特征被展平之外,一些类 ViT 的网络还会添加额外的 tokens。比如 ViT 和 T2T-ViT 中添加了分类 token,DeiT 中还添加了蒸馏 token。在这些网络中,分类计算在最后一个注意力模块之后就已经完成了,分类得分也只和这些额外的 tokens 有关,与特征图无关,也就是说,分类得分对这些特征图的导数为 0。因此,我们不能使用最后一个注意力模块的输出作为 CAM 绘制的目标层。

另外,为了去除这些额外的 toekns 以获得特征图,我们需要知道这些额外 tokens 的数量。MMClassification 中几乎所有 Transformer-based 的网络都拥有 num_extra_tokens 属性。而如果你希望将此工具应用于新的,或者第三方的网络,而且该网络没有指定 num_extra_tokens 属性,那么可以使用 --num-extra-tokens 参数手动指定其数量。

  1. Swin Transformer 使用默认 target-layers 进行 CAM 可视化:

    python tools/visualizations/vis_cam.py \
        demo/bird.JPEG  \
        configs/swin_transformer/swin-tiny_16xb64_in1k.py \
        https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
        --vit-like
    
  2. Vision Transformer(ViT) 进行 CAM 可视化:

    python tools/visualizations/vis_cam.py \
        demo/bird.JPEG  \
        configs/vision_transformer/vit-base-p16_ft-64xb64_in1k-384.py \
        https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
        --vit-like \
        --target-layers 'backbone.layers[-1].ln1'
    
  3. T2T-ViT 进行 CAM 可视化:

    python tools/visualizations/vis_cam.py \
        demo/bird.JPEG  \
        configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
        https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
        --vit-like \
        --target-layers 'backbone.encoder[-1].ln1'
    

Image

ResNet50

ViT

Swin

T2T-ViT

常见问题

分析

日志分析

绘制曲线图

指定一个训练日志文件,可通过 tools/analysis_tools/analyze_logs.py 脚本绘制指定键值的变化曲线

python tools/analysis_tools/analyze_logs.py plot_curve \
    ${JSON_LOGS}  \
    [--keys ${KEYS}]  \
    [--title ${TITLE}]  \
    [--legend ${LEGEND}]  \
    [--backend ${BACKEND}]  \
    [--style ${STYLE}]  \
    [--out ${OUT_FILE}] \
    [--window-size ${WINDOW_SIZE}]

所有参数的说明

  • json_logs :模型配置文件的路径(可同时传入多个,使用空格分开)。

  • --keys :分析日志的关键字段,数量为 len(${JSON_LOGS}) * len(${KEYS}) 默认为 ‘loss’。

  • --title :分析日志的图片名称,默认使用配置文件名, 默认为空。

  • --legend :图例名(可同时传入多个,使用空格分开,数目与 ${JSON_LOGS} * ${KEYS} 数目一致)。默认使用 "${JSON_LOG}-${KEYS}"

  • --backend :matplotlib 的绘图后端,默认由 matplotlib 自动选择。

  • --style :绘图配色风格,默认为 whitegrid

  • --out :保存分析图片的路径,如不指定则不保存。

  • --window-size: 可视化窗口大小,如果没有指定,默认为 12*7。如果需要指定,需按照格式 'W*H'

备注

--style 选项依赖于第三方库 seaborn,需要设置绘图风格请现安装该库。

例如:

  • 绘制某日志文件对应的损失曲线图。

    python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys loss --legend loss
    
  • 绘制某日志文件对应的 top-1 和 top-5 准确率曲线图,并将曲线图导出为 results.jpg 文件。

    python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys accuracy_top-1 accuracy_top-5  --legend top1 top5 --out results.jpg
    
  • 在同一图像内绘制两份日志文件对应的 top-1 准确率曲线图。

    python tools/analysis_tools/analyze_logs.py plot_curve log1.json log2.json --keys accuracy_top-1 --legend run1 run2
    

备注

本工具会自动根据关键字段选择从日志的训练部分还是验证部分读取,因此如果你添加了 自定义的验证指标,请把相对应的关键字段加入到本工具的 TEST_METRICS 变量中。

统计训练时间

tools/analysis_tools/analyze_logs.py 也可以根据日志文件统计训练耗时。

python tools/analysis_tools/analyze_logs.py cal_train_time \
    ${JSON_LOGS}
    [--include-outliers]

所有参数的说明

  • json_logs :模型配置文件的路径(可同时传入多个,使用空格分开)。

  • --include-outliers :如果指定,将不会排除每个轮次中第一轮迭代的记录(有时第一轮迭代会耗时较长)

示例:

python tools/analysis_tools/analyze_logs.py cal_train_time work_dirs/some_exp/20200422_153324.log.json

预计输出结果如下所示:

-----Analyze train time of work_dirs/some_exp/20200422_153324.log.json-----
slowest epoch 68, average time is 0.3818
fastest epoch 1, average time is 0.3694
time std over epochs is 0.0020
average iter time: 0.3777 s/iter

结果分析

利用 tools/test.py--out 参数,我们可以将所有的样本的推理结果保存到输出 文件中。利用这一文件,我们可以进行进一步的分析。

评估结果

tools/analysis_tools/eval_metric.py 可以用来再次计算评估结果。

python tools/analysis_tools/eval_metric.py \
      ${CONFIG} \
      ${RESULT} \
      [--metrics ${METRICS}]  \
      [--cfg-options ${CFG_OPTIONS}] \
      [--metric-options ${METRIC_OPTIONS}]

所有参数说明

  • config :配置文件的路径。

  • resulttools/test.py 的输出结果文件。

  • metrics : 评估的衡量指标,可接受的值取决于数据集类。

  • --cfg-options: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件

  • --metric-options: 如果指定了,这些选项将被传递给数据集 evaluate 函数的 metric_options 参数。

备注

tools/test.py 中,我们支持使用 --out-items 选项来选择保存哪些结果。为了使用本工具,请确保结果文件中包含 “class_scores”。

示例

python tools/analysis_tools/eval_metric.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py ./result.pkl --metrics accuracy --metric-options "topk=(1,5)"

查看典型结果

tools/analysis_tools/analyze_results.py 可以保存预测成功/失败,同时得分最高的 k 个图像。

python tools/analysis_tools/analyze_results.py \
      ${CONFIG} \
      ${RESULT} \
      [--out-dir ${OUT_DIR}] \
      [--topk ${TOPK}] \
      [--cfg-options ${CFG_OPTIONS}]

所有参数说明

  • config :配置文件的路径。

  • resulttools/test.py 的输出结果文件。

  • --out-dir :保存结果分析的文件夹路径。

  • --topk :分别保存多少张预测成功/失败的图像。如果不指定,默认为 20

  • --cfg-options: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件

备注

tools/test.py 中,我们支持使用 --out-items 选项来选择保存哪些结果。为了使用本工具,请确保结果文件中包含 “pred_score”、”pred_label” 和 “pred_class”。

示例

python tools/analysis_tools/analyze_results.py \
       configs/resnet/resnet50_xxxx.py \
       result.pkl \
       --out-dir results \
       --topk 50

模型复杂度分析

计算 FLOPs 和参数量(试验性的)

我们根据 flops-counter.pytorch 提供了一个脚本用于计算给定模型的 FLOPs 和参数量。

python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]

所有参数说明

  • config :配置文件的路径。

  • --shape: 输入尺寸,支持单值或者双值, 如: --shape 256--shape 224 256。默认为224 224

用户将获得如下结果:

==============================
Input shape: (3, 224, 224)
Flops: 4.12 GFLOPs
Params: 25.56 M
==============================

警告

此工具仍处于试验阶段,我们不保证该数字正确无误。您最好将结果用于简单比较,但在技术报告或论文中采用该结果之前,请仔细检查。

  • FLOPs 与输入的尺寸有关,而参数量与输入尺寸无关。默认输入尺寸为 (1, 3, 224, 224)

  • 一些运算不会被计入 FLOPs 的统计中,例如 GN 和自定义运算。详细信息请参考 mmcv.cnn.get_model_complexity_info()

常见问题

其他工具

打印完整配置

tools/misc/print_config.py 脚本会解析所有输入变量,并打印完整配置信息。

python tools/misc/print_config.py ${CONFIG} [--cfg-options ${CFG_OPTIONS}]

所有参数说明

示例

python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py

检查数据集

tools/misc/verify_dataset.py 脚本会检查数据集的所有图片,查看是否有已经损坏的图片。

python tools/print_config.py \
    ${CONFIG} \
    [--out-path ${OUT-PATH}] \
    [--phase ${PHASE}] \
    [--num-process ${NUM-PROCESS}]
    [--cfg-options ${CFG_OPTIONS}]

所有参数说明:

  • config : 配置文件的路径。

  • --out-path : 输出结果路径,默认为 ‘brokenfiles.log’。

  • --phase : 检查哪个阶段的数据集,可用值为 “train” 、”test” 或者 “val”, 默认为 “train”。

  • --num-process : 指定的进程数,默认为1。

  • --cfg-options: 额外的配置选项,会被合入配置文件,参考教程 1:如何编写配置文件

示例:

python tools/misc/verify_dataset.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py --out-path broken_imgs.log --phase val --num-process 8

常见问题

参与贡献 OpenMMLab

欢迎任何类型的贡献,包括但不限于

  • 修改拼写错误或代码错误

  • 添加文档或将文档翻译成其他语言

  • 添加新功能和新组件

工作流程

  1. fork 并 pull 最新的 OpenMMLab 仓库 (MMClassification)

  2. 签出到一个新分支(不要使用 master 分支提交 PR)

  3. 进行修改并提交至 fork 出的自己的远程仓库

  4. 在我们的仓库中创建一个 PR

备注

如果你计划添加一些新的功能,并引入大量改动,请尽量首先创建一个 issue 来进行讨论。

代码风格

Python

我们采用 PEP8 作为统一的代码风格。

我们使用下列工具来进行代码风格检查与格式化:

  • flake8: Python 官方发布的代码规范检查工具,是多个检查工具的封装

  • isort: 自动调整模块导入顺序的工具

  • yapf: 一个 Python 文件的格式化工具。

  • codespell: 检查单词拼写是否有误

  • mdformat: 检查 markdown 文件的工具

  • docformatter: 一个 docstring 格式化工具。

yapf 和 isort 的格式设置位于 setup.cfg

我们使用 pre-commit hook 来保证每次提交时自动进行代 码检查和格式化,启用的功能包括 flake8, yapf, isort, trailing whitespaces, markdown files, 修复 end-of-files, double-quoted-strings, python-encoding-pragma, mixed-line-ending, 对 requirments.txt的排序等。 pre-commit hook 的配置文件位于 .pre-commit-config

在你克隆仓库后,你需要按照如下步骤安装并初始化 pre-commit hook。

pip install -U pre-commit

在仓库文件夹中执行

pre-commit install

在此之后,每次提交,代码规范检查和格式化工具都将被强制执行。

重要

在创建 PR 之前,请确保你的代码完成了代码规范检查,并经过了 yapf 的格式化。

C++ 和 CUDA

C++ 和 CUDA 的代码规范遵从 Google C++ Style Guide

mmcls.apis

These are some high-level APIs for classification tasks.

mmcls.apis

Train

Test

Inference

mmcls.core

This package includes some runtime components. These components are useful in classification tasks but not supported by MMCV yet.

备注

Some components may be moved to MMCV in the future.

Evaluation

Evaluation metrics calculation functions

Hook

Optimizers

mmcls.models

The models package contains several sub-packages for addressing the different components of a model.

  • Classifier: The top-level module which defines the whole process of a classification model.

  • Backbones: Usually a feature extraction network, e.g., ResNet, MobileNet.

  • Necks: The component between backbones and heads, e.g., GlobalAveragePooling.

  • Heads: The component for specific tasks. In MMClassification, we provides heads for classification.

  • Losses: Loss functions.

Classifier

Backbones

Necks

Heads

Losses

mmcls.models.utils

This package includes some helper functions and common components used in various networks.

Common Components

Helper Functions

channel_shuffle

make_divisible

to_ntuple

is_tracing

mmcls.datasets

The datasets package contains several usual datasets for image classification tasks and some dataset wrappers.

Custom Dataset

ImageNet

CIFAR

MNIST

VOC

StanfordCars Cars

Base classes

Dataset Wrappers

Data Transformations

In MMClassification, the data preparation and the dataset is decomposed. The datasets only define how to get samples’ basic information from the file system. These basic information includes the ground-truth label and raw images data / the paths of images.

To prepare the inputs data, we need to do some transformations on these basic information. These transformations includes loading, preprocessing and formatting. And a series of data transformations makes up a data pipeline. Therefore, you can find the a pipeline argument in the configs of dataset, for example:

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='RandomResizedCrop', size=224),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', size=256),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'])
]

data = dict(
    train=dict(..., pipeline=train_pipeline),
    val=dict(..., pipeline=test_pipeline),
    test=dict(..., pipeline=test_pipeline),
)

Every item of a pipeline list is one of the following data transformations class. And if you want to add a custom data transformation class, the tutorial Custom Data Pipelines will help you.

Loading

LoadImageFromFile

Preprocessing and Augmentation

CenterCrop

Lighting

Normalize

Pad

Resize

RandomCrop

RandomErasing

RandomFlip

RandomGrayscale

RandomResizedCrop

ColorJitter

Composed Augmentation

Composed augmentation is a kind of methods which compose a series of data augmentation transformations, such as AutoAugment and RandAugment.

In composed augmentation, we need to specify several data transformations or several groups of data transformations (The policies argument) as the random sampling space. These data transformations are chosen from the below table. In addition, we provide some preset policies in this folder.

Formatting

Collect

ImageToTensor

ToNumpy

ToPIL

ToTensor

Transpose

Batch Augmentation

Batch augmentation is the augmentation which involve multiple samples, such as Mixup and CutMix.

In MMClassification, these batch augmentation is used as a part of Classifier. A typical usage is as below:

model = dict(
    backbone = ...,
    neck = ...,
    head = ...,
    train_cfg=dict(augments=[
        dict(type='BatchMixup', alpha=0.8, prob=0.5, num_classes=num_classes),
        dict(type='BatchCutMix', alpha=1.0, prob=0.5, num_classes=num_classes),
    ]))
)

Mixup

CutMix

ResizeMix

mmcls.utils

These are some useful help function in the utils package.

Changelog

v0.25.0(06/12/2022)

Highlights

  • Support MLU backend.

New Features

  • Support MLU backend. (#1159)

  • Support Activation Checkpointing for ConvNeXt. (#1152)

Improvements

  • Add dist_train_arm.sh for ARM device and update NPU results. (#1218)

Bug Fixes

  • Fix a bug caused MMClsWandbHook stuck. (#1242)

  • Fix the redundant device_ids in tools/test.py. (#1215)

Docs Update

  • Add version banner and version warning in master docs. (#1216)

  • Update NPU support doc. (#1198)

  • Fixed typo in pytorch2torchscript.md. (#1173)

  • Fix typo in miscellaneous.md. (#1137)

  • further detail for the doc for ClassBalancedDataset. (#901)

v0.24.1(31/10/2022)

New Features

  • Support mmcls with NPU backend. (#1072)

Bug Fixes

  • Fix performance issue in convnext DDP train. (#1098)

v0.24.0(30/9/2022)

Highlights

  • Support HorNet, EfficientFormerm, SwinTransformer V2 and MViT backbones.

  • Support Standford Cars dataset.

New Features

  • Support HorNet Backbone. (#1013)

  • Support EfficientFormer. (#954)

  • Support Stanford Cars dataset. (#893)

  • Support CSRA head. (#881)

  • Support Swin Transform V2. (#799)

  • Support MViT and add checkpoints. (#924)

Improvements

  • [Improve] replace loop of progressbar in api/test. (#878)

  • [Enhance] RepVGG for YOLOX-PAI. (#1025)

  • [Enhancement] Update VAN. (#1017)

  • [Refactor] Re-write get_sinusoid_encoding from third-party implementation. (#965)

  • [Improve] Upgrade onnxsim to v0.4.0. (#915)

  • [Improve] Fixed typo in RepVGG. (#985)

  • [Improve] Using train_step instead of forward in PreciseBNHook (#964)

  • [Improve] Use forward_dummy to calculate FLOPS. (#953)

Bug Fixes

  • Fix warning with torch.meshgrid. (#860)

  • Add matplotlib minimum version requriments. (#909)

  • val loader should not drop last by default. (#857)

  • Fix config.device bug in toturial. (#1059)

  • Fix attenstion clamp max params (#1034)

  • Fix device mismatch in Swin-v2. (#976)

  • Fix the output position of Swin-Transformer. (#947)

Docs Update

  • Fix typo in config.md. (#827)

  • Add version for torchvision to avoide error. (#903)

  • Fixed typo for --out-dir option of analyze_results.py. (#898)

  • Refine the docstring of RegNet (#935)

v0.23.2(28/7/2022)

New Features

  • Support MPS device. (#894)

Bug Fixes

  • Fix a bug in Albu which caused crashing. (#918)

v0.23.1(2/6/2022)

New Features

  • Dedicated MMClsWandbHook for MMClassification (Weights and Biases Integration) (#764)

Improvements

  • Use mdformat instead of markdownlint to format markdown. (#844)

Bug Fixes

  • Fix wrong --local_rank.

Docs Update

  • Update install tutorials. (#854)

  • Fix wrong link in README. (#835)

v0.23.0(1/5/2022)

New Features

  • Support DenseNet. (#750)

  • Support VAN. (#739)

Improvements

  • Support training on IPU and add fine-tuning configs of ViT. (#723)

Docs Update

  • New style API reference, and easier to use! Welcome view it. (#774)

v0.22.1(15/4/2022)

New Features

  • [Feature] Support resize relative position embedding in SwinTransformer. (#749)

  • [Feature] Add PoolFormer backbone and checkpoints. (#746)

Improvements

  • [Enhance] Improve CPE performance by reduce memory copy. (#762)

  • [Enhance] Add extra dataloader settings in configs. (#752)

v0.22.0(30/3/2022)

Highlights

  • Support a series of CSP Network, such as CSP-ResNet, CSP-ResNeXt and CSP-DarkNet.

  • A new CustomDataset class to help you build dataset of yourself!

  • Support ConvMixer, RepMLP and new dataset - CUB dataset.

New Features

  • [Feature] Add CSPNet and backbone and checkpoints (#735)

  • [Feature] Add CustomDataset. (#738)

  • [Feature] Add diff seeds to diff ranks. (#744)

  • [Feature] Support ConvMixer. (#716)

  • [Feature] Our dist_train & dist_test tools support distributed training on multiple machines. (#734)

  • [Feature] Add RepMLP backbone and checkpoints. (#709)

  • [Feature] Support CUB dataset. (#703)

  • [Feature] Support ResizeMix. (#676)

Improvements

  • [Enhance] Use --a-b instead of --a_b in arguments. (#754)

  • [Enhance] Add get_cat_ids and get_gt_labels to KFoldDataset. (#721)

  • [Enhance] Set torch seed in worker_init_fn. (#733)

Bug Fixes

  • [Fix] Fix the discontiguous output feature map of ConvNeXt. (#743)

Docs Update

  • [Docs] Add brief installation steps in README for copy&paste. (#755)

  • [Docs] fix logo url link from mmocr to mmcls. (#732)

v0.21.0(04/03/2022)

Highlights

  • Support ResNetV1c and Wide-ResNet, and provide pre-trained models.

  • Support dynamic input shape for ViT-based algorithms. Now our ViT, DeiT, Swin-Transformer and T2T-ViT support forwarding with any input shape.

  • Reproduce training results of DeiT. And our DeiT-T and DeiT-S have higher accuracy comparing with the official weights.

New Features

  • Add ResNetV1c. (#692)

  • Support Wide-ResNet. (#715)

  • Support gem pooling (#677)

Improvements

  • Reproduce training results of DeiT. (#711)

  • Add ConvNeXt pretrain models on ImageNet-1k. (#707)

  • Support dynamic input shape for ViT-based algorithms. (#706)

  • Add evaluate function for ConcatDataset. (#650)

  • Enhance vis-pipeline tool. (#604)

  • Return code 1 if scripts runs failed. (#694)

  • Use PyTorch official one_hot to implement convert_to_one_hot. (#696)

  • Add a new pre-commit-hook to automatically add a copyright. (#710)

  • Add deprecation message for deploy tools. (#697)

  • Upgrade isort pre-commit hooks. (#687)

  • Use --gpu-id instead of --gpu-ids in non-distributed multi-gpu training/testing. (#688)

  • Remove deprecation. (#633)

Bug Fixes

  • Fix Conformer forward with irregular input size. (#686)

  • Add dist.barrier to fix a bug in directory checking. (#666)

v0.20.1(07/02/2022)

Bug Fixes

  • Fix the MMCV dependency version.

v0.20.0(30/01/2022)

Highlights

  • Support K-fold cross-validation. The tutorial will be released later.

  • Support HRNet, ConvNeXt, Twins and EfficientNet.

  • Support model conversion from PyTorch to Core-ML by a tool.

New Features

  • Support K-fold cross-validation. (#563)

  • Support HRNet and add pre-trained models. (#660)

  • Support ConvNeXt and add pre-trained models. (#670)

  • Support Twins and add pre-trained models. (#642)

  • Support EfficientNet and add pre-trained models.(#649)

  • Support features_only option in TIMMBackbone. (#668)

  • Add conversion script from pytorch to Core-ML model. (#597)

Improvements

  • New-style CPU training and inference. (#674)

  • Add setup multi-processing both in train and test. (#671)

  • Rewrite channel split operation in ShufflenetV2. (#632)

  • Deprecate the support for “python setup.py test”. (#646)

  • Support single-label, softmax, custom eps by asymmetric loss. (#609)

  • Save class names in best checkpoint created by evaluation hook. (#641)

Bug Fixes

  • Fix potential unexcepted behaviors if metric_options is not specified in multi-label evaluation. (#647)

  • Fix API changes in pytorch-grad-cam&gt;=1.3.7. (#656)

  • Fix bug which breaks cal_train_time in analyze_logs.py. (#662)

Docs Update

  • Update README in configs according to OpenMMLab standard. (#672)

  • Update installation guide and README. (#624)

v0.19.0(31/12/2021)

Highlights

  • The feature extraction function has been enhanced. See #593 for more details.

  • Provide the high-acc ResNet-50 training settings from ResNet strikes back.

  • Reproduce the training accuracy of T2T-ViT & RegNetX, and provide self-training checkpoints.

  • Support DeiT & Conformer backbone and checkpoints.

  • Provide a CAM visualization tool based on pytorch-grad-cam, and detailed user guide!

New Features

  • Support Precise BN. (#401)

  • Add CAM visualization tool. (#577)

  • Repeated Aug and Sampler Registry. (#588)

  • Add DeiT backbone and checkpoints. (#576)

  • Support LAMB optimizer. (#591)

  • Implement the conformer backbone. (#494)

  • Add the frozen function for Swin Transformer model. (#574)

  • Support using checkpoint in Swin Transformer to save memory. (#557)

Improvements

  • [Reproduction] Reproduce RegNetX training accuracy. (#587)

  • [Reproduction] Reproduce training results of T2T-ViT. (#610)

  • [Enhance] Provide high-acc training settings of ResNet. (#572)

  • [Enhance] Set a random seed when the user does not set a seed. (#554)

  • [Enhance] Added NumClassCheckHook and unit tests. (#559)

  • [Enhance] Enhance feature extraction function. (#593)

  • [Enhance] Improve efficiency of precision, recall, f1_score and support. (#595)

  • [Enhance] Improve accuracy calculation performance. (#592)

  • [Refactor] Refactor analysis_log.py. (#529)

  • [Refactor] Use new API of matplotlib to handle blocking input in visualization. (#568)

  • [CI] Cancel previous runs that are not completed. (#583)

  • [CI] Skip build CI if only configs or docs modification. (#575)

Bug Fixes

  • Fix test sampler bug. (#611)

  • Try to create a symbolic link, otherwise copy. (#580)

  • Fix a bug for multiple output in swin transformer. (#571)

Docs Update

  • Update mmcv, torch, cuda version in Dockerfile and docs. (#594)

  • Add analysis&misc docs. (#525)

  • Fix docs build dependency. (#584)

v0.18.0(30/11/2021)

Highlights

  • Support MLP-Mixer backbone and provide pre-trained checkpoints.

  • Add a tool to visualize the learning rate curve of the training phase. Welcome to use with the tutorial!

New Features

  • Add MLP Mixer Backbone. (#528, #539)

  • Support positive weights in BCE. (#516)

  • Add a tool to visualize learning rate in each iterations. (#498)

Improvements

  • Use CircleCI to do unit tests. (#567)

  • Focal loss for single label tasks. (#548)

  • Remove useless import_modules_from_string. (#544)

  • Rename config files according to the config name standard. (#508)

  • Use reset_classifier to remove head of timm backbones. (#534)

  • Support passing arguments to loss from head. (#523)

  • Refactor Resize transform and add Pad transform. (#506)

  • Update mmcv dependency version. (#509)

Bug Fixes

  • Fix bug when using ClassBalancedDataset. (#555)

  • Fix a bug when using iter-based runner with ‘val’ workflow. (#542)

  • Fix interpolation method checking in Resize. (#547)

  • Fix a bug when load checkpoints in mulit-GPUs environment. (#527)

  • Fix an error on indexing scalar metrics in analyze_result.py. (#518)

  • Fix wrong condition judgment in analyze_logs.py and prevent empty curve. (#510)

Docs Update

  • Fix vit config and model broken links. (#564)

  • Add abstract and image for every paper. (#546)

  • Add mmflow and mim in banner and readme. (#543)

  • Add schedule and runtime tutorial docs. (#499)

  • Add the top-5 acc in ResNet-CIFAR README. (#531)

  • Fix TOC of visualization.md and add example images. (#513)

  • Use docs link of other projects and add MMCV docs. (#511)

v0.17.0(29/10/2021)

Highlights

  • Support Tokens-to-Token ViT backbone and Res2Net backbone. Welcome to use!

  • Support ImageNet21k dataset.

  • Add a pipeline visualization tool. Try it with the tutorials!

New Features

  • Add Tokens-to-Token ViT backbone and converted checkpoints. (#467)

  • Add Res2Net backbone and converted weights. (#465)

  • Support ImageNet21k dataset. (#461)

  • Support seesaw loss. (#500)

  • Add a pipeline visualization tool. (#406)

  • Add a tool to find broken files. (#482)

  • Add a tool to test TorchServe. (#468)

Improvements

  • Refator Vision Transformer. (#395)

  • Use context manager to reuse matplotlib figures. (#432)

Bug Fixes

  • Remove DistSamplerSeedHook if use IterBasedRunner. (#501)

  • Set the priority of EvalHook to “LOW” to avoid a bug when using IterBasedRunner. (#488)

  • Fix a wrong parameter of get_root_logger in apis/train.py. (#486)

  • Fix version check in dataset builder. (#474)

Docs Update

  • Add English Colab tutorials and update Chinese Colab tutorials. (#483, #497)

  • Add tutuorial for config files. (#487)

  • Add model-pages in Model Zoo. (#480)

  • Add code-spell pre-commit hook and fix a large mount of typos. (#470)

v0.16.0(30/9/2021)

Highlights

  • We have improved compatibility with downstream repositories like MMDetection and MMSegmentation. We will add some examples about how to use our backbones in MMDetection.

  • Add RepVGG backbone and checkpoints. Welcome to use it!

  • Add timm backbones wrapper, now you can simply use backbones of pytorch-image-models in MMClassification!

New Features

  • Add RepVGG backbone and checkpoints. (#414)

  • Add timm backbones wrapper. (#427)

Improvements

  • Fix TnT compatibility and verbose warning. (#436)

  • Support setting --out-items in tools/test.py. (#437)

  • Add datetime info and saving model using torch<1.6 format. (#439)

  • Improve downstream repositories compatibility. (#421)

  • Rename the option --options to --cfg-options in some tools. (#425)

  • Add PyTorch 1.9 and Python 3.9 build workflow, and remove some CI. (#422)

Bug Fixes

  • Fix format error in test.py when metric returns np.ndarray. (#441)

  • Fix publish_model bug if no parent of out_file. (#463)

  • Fix num_classes bug in pytorch2onnx.py. (#458)

  • Fix missing runtime requirement packaging. (#459)

  • Fix saving simplified model bug in ONNX export tool. (#438)

Docs Update

  • Update getting_started.md and install.md. And rewrite finetune.md. (#466)

  • Use PyTorch style docs theme. (#457)

  • Update metafile and Readme. (#435)

  • Add CITATION.cff. (#428)

v0.15.0(31/8/2021)

Highlights

  • Support hparams argument in AutoAugment and RandAugment to provide hyperparameters for sub-policies.

  • Support custom squeeze channels in SELayer.

  • Support classwise weight in losses.

New Features

  • Add hparams argument in AutoAugment and RandAugment and some other improvement. (#398)

  • Support classwise weight in losses. (#388)

  • Enhance SELayer to support custom squeeze channels. (#417)

Code Refactor

  • Better result visualization. (#419)

  • Use post_process function to handle pred result processing. (#390)

  • Update digit_version function. (#402)

  • Avoid albumentations to install both opencv and opencv-headless. (#397)

  • Avoid unnecessary listdir when building ImageNet. (#396)

  • Use dynamic mmcv download link in TorchServe dockerfile. (#387)

Docs Improvement

  • Add readme of some algorithms and update meta yml. (#418)

  • Add Copyright information. (#413)

  • Fix typo ‘metirc’. (#411)

  • Update QQ group QR code. (#393)

  • Add PR template and modify issue template. (#380)

v0.14.0(4/8/2021)

Highlights

  • Add transformer-in-transformer backbone and pretrain checkpoints, refers to the paper.

  • Add Chinese colab tutorial.

  • Provide dockerfile to build mmcls dev docker image.

New Features

  • Add transformer in transformer backbone and pretrain checkpoints. (#339)

  • Support mim, welcome to use mim to manage your mmcls project. (#376)

  • Add Dockerfile. (#365)

  • Add ResNeSt configs. (#332)

Improvements

  • Use the presistent_works option if available, to accelerate training. (#349)

  • Add Chinese ipynb tutorial. (#306)

  • Refactor unit tests. (#321)

  • Support to test mmdet inference with mmcls backbone. (#343)

  • Use zero as default value of thrs in metrics. (#341)

Bug Fixes

  • Fix ImageNet dataset annotation file parse bug. (#370)

  • Fix docstring typo and init bug in ShuffleNetV1. (#374)

  • Use local ATTENTION registry to avoid conflict with other repositories. (#376)

  • Fix swin transformer config bug. (#355)

  • Fix patch_cfg argument bug in SwinTransformer. (#368)

  • Fix duplicate init_weights call in ViT init function. (#373)

  • Fix broken _base_ link in a resnet config. (#361)

  • Fix vgg-19 model link missing. (#363)

v0.13.0(3/7/2021)

  • Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet.

New Features

  • Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet. (#271)

  • Add pretained model of RegNetX. (#269)

  • Support adding custom hooks in config file. (#305)

  • Improve and add Chinese translation of CONTRIBUTING.md and all tools tutorials. (#320)

  • Dump config before training. (#282)

  • Add torchscript and torchserve deployment tools. (#279, #284)

Improvements

  • Improve test tools and add some new tools. (#322)

  • Correct MobilenetV3 backbone structure and add pretained models. (#291)

  • Refactor PatchEmbed and HybridEmbed as independent components. (#330)

  • Refactor mixup and cutmix as Augments to support more functions. (#278)

  • Refactor weights initialization method. (#270, #318, #319)

  • Refactor LabelSmoothLoss to support multiple calculation formulas. (#285)

Bug Fixes

  • Fix bug for CPU training. (#286)

  • Fix missing test data when num_imgs can not be evenly divided by num_gpus. (#299)

  • Fix build compatible with pytorch v1.3-1.5. (#301)

  • Fix magnitude_std bug in RandAugment. (#309)

  • Fix bug when samples_per_gpu is 1. (#311)

v0.12.0(3/6/2021)

  • Finish adding Chinese tutorials and build Chinese documentation on readthedocs.

  • Update ResNeXt checkpoints and ResNet checkpoints on CIFAR.

New Features

  • Improve and add Chinese translation of data_pipeline.md and new_modules.md. (#265)

  • Build Chinese translation on readthedocs. (#267)

  • Add an argument efficientnet_style to RandomResizedCrop and CenterCrop. (#268)

Improvements

  • Only allow directory operation when rank==0 when testing. (#258)

  • Fix typo in base_head. (#274)

  • Update ResNeXt checkpoints. (#283)

Bug Fixes

  • Add attribute data.test in MNIST configs. (#264)

  • Download CIFAR/MNIST dataset only on rank 0. (#273)

  • Fix MMCV version compatibility. (#276)

  • Fix CIFAR color channels bug and update checkpoints in model zoo. (#280)

v0.11.1(21/5/2021)

  • Refine new_dataset.md and add Chinese translation of finture.md, new_dataset.md.

New Features

  • Add dim argument for GlobalAveragePooling. (#236)

  • Add random noise to RandAugment magnitude. (#240)

  • Refine new_dataset.md and add Chinese translation of finture.md, new_dataset.md. (#243)

Improvements

  • Refactor arguments passing for Heads. (#239)

  • Allow more flexible magnitude_range in RandAugment. (#249)

  • Inherits MMCV registry so that in the future OpenMMLab repos like MMDet and MMSeg could directly use the backbones supported in MMCls. (#252)

Bug Fixes

  • Fix typo in analyze_results.py. (#237)

  • Fix typo in unittests. (#238)

  • Check if specified tmpdir exists when testing to avoid deleting existing data. (#242 & #258)

  • Add missing config files in MANIFEST.in. (#250 & #255)

  • Use temporary directory under shared directory to collect results to avoid unavailability of temporary directory for multi-node testing. (#251)

v0.11.0(1/5/2021)

  • Support cutmix trick.

  • Support random augmentation.

  • Add tools/deployment/test.py as a ONNX runtime test tool.

  • Support ViT backbone and add training configs for ViT on ImageNet.

  • Add Chinese README.md and some Chinese tutorials.

New Features

  • Support cutmix trick. (#198)

  • Add simplify option in pytorch2onnx.py. (#200)

  • Support random augmentation. (#201)

  • Add config and checkpoint for training ResNet on CIFAR-100. (#208)

  • Add tools/deployment/test.py as a ONNX runtime test tool. (#212)

  • Support ViT backbone and add training configs for ViT on ImageNet. (#214)

  • Add finetuning configs for ViT on ImageNet. (#217)

  • Add device option to support training on CPU. (#219)

  • Add Chinese README.md and some Chinese tutorials. (#221)

  • Add metafile.yml in configs to support interaction with paper with code(PWC) and MMCLI. (#225)

  • Upload configs and converted checkpoints for ViT fintuning on ImageNet. (#230)

Improvements

  • Fix LabelSmoothLoss so that label smoothing and mixup could be enabled at the same time. (#203)

  • Add cal_acc option in ClsHead. (#206)

  • Check CLASSES in checkpoint to avoid unexpected key error. (#207)

  • Check mmcv version when importing mmcls to ensure compatibility. (#209)

  • Update CONTRIBUTING.md to align with that in MMCV. (#210)

  • Change tags to html comments in configs README.md. (#226)

  • Clean codes in ViT backbone. (#227)

  • Reformat pytorch2onnx.md tutorial. (#229)

  • Update setup.py to support MMCLI. (#232)

Bug Fixes

  • Fix missing cutmix_prob in ViT configs. (#220)

  • Fix backend for resize in ResNeXt configs. (#222)

v0.10.0(1/4/2021)

  • Support AutoAugmentation

  • Add tutorials for installation and usage.

New Features

  • Add Rotate pipeline for data augmentation. (#167)

  • Add Invert pipeline for data augmentation. (#168)

  • Add Color pipeline for data augmentation. (#171)

  • Add Solarize and Posterize pipeline for data augmentation. (#172)

  • Support fp16 training. (#178)

  • Add tutorials for installation and basic usage of MMClassification.(#176)

  • Support AutoAugmentation, AutoContrast, Equalize, Contrast, Brightness and Sharpness pipelines for data augmentation. (#179)

Improvements

  • Support dynamic shape export to onnx. (#175)

  • Release training configs and update model zoo for fp16 (#184)

  • Use MMCV’s EvalHook in MMClassification (#182)

Bug Fixes

  • Fix wrong naming in vgg config (#181)

v0.9.0(1/3/2021)

  • Implement mixup trick.

  • Add a new tool to create TensorRT engine from ONNX, run inference and verify outputs in Python.

New Features

  • Implement mixup and provide configs of training ResNet50 using mixup. (#160)

  • Add Shear pipeline for data augmentation. (#163)

  • Add Translate pipeline for data augmentation. (#165)

  • Add tools/onnx2tensorrt.py as a tool to create TensorRT engine from ONNX, run inference and verify outputs in Python. (#153)

Improvements

  • Add --eval-options in tools/test.py to support eval options override, matching the behavior of other open-mmlab projects. (#158)

  • Support showing and saving painted results in mmcls.apis.test and tools/test.py, matching the behavior of other open-mmlab projects. (#162)

Bug Fixes

  • Fix configs for VGG, replace checkpoints converted from other repos with the ones trained by ourselves and upload the missing logs in the model zoo. (#161)

v0.8.0(31/1/2021)

  • Support multi-label task.

  • Support more flexible metrics settings.

  • Fix bugs.

New Features

  • Add evaluation metrics: mAP, CP, CR, CF1, OP, OR, OF1 for multi-label task. (#123)

  • Add BCE loss for multi-label task. (#130)

  • Add focal loss for multi-label task. (#131)

  • Support PASCAL VOC 2007 dataset for multi-label task. (#134)

  • Add asymmetric loss for multi-label task. (#132)

  • Add analyze_results.py to select images for success/fail demonstration. (#142)

  • Support new metric that calculates the total number of occurrences of each label. (#143)

  • Support class-wise evaluation results. (#143)

  • Add thresholds in eval_metrics. (#146)

  • Add heads and a baseline config for multilabel task. (#145)

Improvements

  • Remove the models with 0 checkpoint and ignore the repeated papers when counting papers to gain more accurate model statistics. (#135)

  • Add tags in README.md. (#137)

  • Fix optional issues in docstring. (#138)

  • Update stat.py to classify papers. (#139)

  • Fix mismatched columns in README.md. (#150)

  • Fix test.py to support more evaluation metrics. (#155)

Bug Fixes

  • Fix bug in VGG weight_init. (#140)

  • Fix bug in 2 ResNet configs in which outdated heads were used. (#147)

  • Fix bug of misordered height and width in RandomCrop and RandomResizedCrop. (#151)

  • Fix missing meta_keys in Collect. (#149 & #152)

v0.7.0(31/12/2020)

  • Add more evaluation metrics.

  • Fix bugs.

New Features

  • Remove installation of MMCV from requirements. (#90)

  • Add 3 evaluation metrics: precision, recall and F-1 score. (#93)

  • Allow config override during testing and inference with --options. (#91 & #96)

Improvements

  • Use build_runner to make runners more flexible. (#54)

  • Support to get category ids in BaseDataset. (#72)

  • Allow CLASSES override during BaseDateset initialization. (#85)

  • Allow input image as ndarray during inference. (#87)

  • Optimize MNIST config. (#98)

  • Add config links in model zoo documentation. (#99)

  • Use functions from MMCV to collect environment. (#103)

  • Refactor config files so that they are now categorized by methods. (#116)

  • Add README in config directory. (#117)

  • Add model statistics. (#119)

  • Refactor documentation in consistency with other MM repositories. (#126)

Bug Fixes

  • Add missing CLASSES argument to dataset wrappers. (#66)

  • Fix slurm evaluation error during training. (#69)

  • Resolve error caused by shape in Accuracy. (#104)

  • Fix bug caused by extremely insufficient data in distributed sampler.(#108)

  • Fix bug in gpu_ids in distributed training. (#107)

  • Fix bug caused by extremely insufficient data in collect results during testing (#114)

v0.6.0(11/10/2020)

  • Support new method: ResNeSt and VGG.

  • Support new dataset: CIFAR10.

  • Provide new tools to do model inference, model conversion from pytorch to onnx.

New Features

  • Add model inference. (#16)

  • Add pytorch2onnx. (#20)

  • Add PIL backend for transform Resize. (#21)

  • Add ResNeSt. (#25)

  • Add VGG and its pretained models. (#27)

  • Add CIFAR10 configs and models. (#38)

  • Add albumentations transforms. (#45)

  • Visualize results on image demo. (#58)

Improvements

  • Replace urlretrieve with urlopen in dataset.utils. (#13)

  • Resize image according to its short edge. (#22)

  • Update ShuffleNet config. (#31)

  • Update pre-trained models for shufflenet_v2, shufflenet_v1, se-resnet50, se-resnet101. (#33)

Bug Fixes

  • Fix init_weights in shufflenet_v2.py. (#29)

  • Fix the parameter size in test_pipeline. (#30)

  • Fix the parameter in cosine lr schedule. (#32)

  • Fix the convert tools for mobilenet_v2. (#34)

  • Fix crash in CenterCrop transform when image is greyscale (#40)

  • Fix outdated configs. (#53)

0.x 相关兼容性问题

MMClassification 0.20.1

MMCV 兼容性

在 Twins 骨干网络中,我们使用了 MMCV 提供的 PatchEmbed 模块,该模块是在 MMCV 1.4.2 版本加入的,因此我们需要将 MMCV 依赖版本升至 1.4.2。

常见问题

我们在这里列出了一些常见问题及其相应的解决方案。如果您发现任何常见问题并有方法 帮助解决,欢迎随时丰富列表。如果这里的内容没有涵盖您的问题,请按照 提问模板 在 GitHub 上提出问题,并补充模板中需要的信息。

安装

  • MMCV 与 MMClassification 的兼容问题。如遇到 “AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, <=xxx.”

    这里我们列举了各版本 MMClassification 对 MMCV 版本的依赖,请选择合适的 MMCV 版本来避免安装和使用中的问题。

    MMClassification version

    MMCV version

    dev

    mmcv>=1.7.0, <1.9.0

    0.25.0 (master)

    mmcv>=1.4.2, <1.9.0

    0.24.1

    mmcv>=1.4.2, <1.9.0

    0.23.2

    mmcv>=1.4.2, <1.7.0

    0.22.1

    mmcv>=1.4.2, <1.6.0

    0.21.0

    mmcv>=1.4.2, <=1.5.0

    0.20.1

    mmcv>=1.4.2, <=1.5.0

    0.19.0

    mmcv>=1.3.16, <=1.5.0

    0.18.0

    mmcv>=1.3.16, <=1.5.0

    0.17.0

    mmcv>=1.3.8, <=1.5.0

    0.16.0

    mmcv>=1.3.8, <=1.5.0

    0.15.0

    mmcv>=1.3.8, <=1.5.0

    0.15.0

    mmcv>=1.3.8, <=1.5.0

    0.14.0

    mmcv>=1.3.8, <=1.5.0

    0.13.0

    mmcv>=1.3.8, <=1.5.0

    0.12.0

    mmcv>=1.3.1, <=1.5.0

    0.11.1

    mmcv>=1.3.1, <=1.5.0

    0.11.0

    mmcv>=1.3.0

    0.10.0

    mmcv>=1.3.0

    0.9.0

    mmcv>=1.1.4

    0.8.0

    mmcv>=1.1.4

    0.7.0

    mmcv>=1.1.4

    0.6.0

    mmcv>=1.1.4

    备注

    由于 dev 分支处于频繁开发中,MMCV 版本依赖可能不准确。如果您在使用 dev 分支时遇到问题,请尝试更新 MMCV 到最新版。

  • 使用 Albumentations

    如果你希望使用 albumentations 相关的功能,我们建议使用 pip install -r requirements/optional.txt 或者 pip install -U albumentations>=0.3.2 --no-binary qudida,albumentations 命令进行安装。

    如果你直接使用 pip install albumentations>=0.3.2 来安装,它会同时安装 opencv-python-headless (即使你已经安装了 opencv-python)。具体细节可参阅 官方文档

开发

  • 如果我对源码进行了改动,需要重新安装以使改动生效吗?

    如果你遵照最佳实践的指引,从源码安装 mmcls,那么任何本地修改都不需要重新安装即可生效。

  • 如何在多个 MMClassification 版本下进行开发?

    通常来说,我们推荐通过不同虚拟环境来管理多个开发目录下的 MMClassification。 但如果你希望在不同目录(如 mmcls-0.21, mmcls-0.23 等)使用同一个环境进行开发, 我们提供的训练和测试 shell 脚本会自动使用当前目录的 mmcls,其他 Python 脚本 则可以在命令前添加 PYTHONPATH=`pwd` 来使用当前目录的代码。

    反过来,如果你希望 shell 脚本使用环境中安装的 MMClassification,而不是当前目录的, 则可以去掉 shell 脚本中如下一行代码:

    PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
    

NPU (华为昇腾)

使用方法

首先,请参考 教程 安装带有 NPU 支持的 MMCV。

使用如下命令,可以利用 8 个 NPU 在机器上训练模型(以 ResNet 为例):

bash tools/dist_train.sh configs/cspnet/resnet50_8xb32_in1k.py 8 --device npu

或者,使用如下命令,在一个 NPU 上训练模型(以 ResNet 为例):

python tools/train.py configs/cspnet/resnet50_8xb32_in1k.py --device npu

经过验证的模型

模型

Top-1 (%)

Top-5 (%)

配置文件

相关下载

CSPResNeXt50

77.10

93.55

config

model | log

DenseNet121

72.62

91.04

config

model | log

EfficientNet-B4(AA + AdvProp)

75.55

92.86

config

model | log

HRNet-W18

77.01

93.46

config

model | log

ResNetV1D-152

77.11

94.54

config

model | log

ResNet-50

76.40

-

config

model | log

ResNetXt-32x4d-50

77.55

93.75

config

model | log

SE-ResNet-50

77.64

93.76

config

model | log

VGG-11

68.92

88.83

config

model | log

ShuffleNetV2 1.0x

69.53

88.82

config

model | log

以上所有模型权重及训练日志均由华为昇腾团队提供

索引与表格

开始你的第一步

教程

模型库

实用工具

社区

API 参考文档

其他说明

设备支持

语言切换

Read the Docs v: mmcls-0.x
Versions
latest
stable
mmcls-1.x
mmcls-0.x
dev
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.