Shortcuts

BLIP

摘要

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

使用方式

from mmpretrain import inference_model

result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}

Models and results

Image Caption on COCO

模型

Params (M)

BLEU-4

CIDER

配置文件

下载

blip-base_3rdparty_caption*

223.97

40.12

132.82

config

model

Image Caption on NoCaps

模型

Params (M)

SPICE

CIDER

配置文件

下载

blip-base_3rdparty_caption*

223.97

14.69

109.12

config

model

Image Caption on Flickr30k

模型

Params (M)

SPICE

CIDER

配置文件

下载

blip-base_3rdparty_caption*

223.97

15.58

68.89

config

model

Visual Grounding on RefCOCO

模型

Params (M)

Accuracy (testA)

Accuracy (testB)

配置文件

下载

blip-base_8xb16_refcoco

498.49

86.14

77.33

config

model | log

Visual Question Answering on VQAv2

模型

Params (M)

Accuracy

配置文件

下载

blip-base_3rdparty_vqa*

361.48

78.20

config

model

Visual Question Answering on OK-VQA

模型

Params (M)

Accuracy

配置文件

下载

blip-base_3rdparty_vqa*

361.48

40.59#

config

model

Visual Question Answering on OCR-VQA

模型

Params (M)

Accuracy

配置文件

下载

blip-base_3rdparty_vqa*

361.48

28.30#

config

model

Image-To-Text Retrieval on COCO

模型

Params (M)

Recall@1

Recall@5

配置文件

下载

blip-base_3rdparty_retrieval*

447.49

82.52

95.34

config

model

Text-To-Image Retrieval on COCO

模型

Params (M)

Recall@1

Recall@5

配置文件

下载

blip-base_3rdparty_retrieval*

447.49

64.82

86.28

config

model

Image-To-Text Retrieval on Flickr30k

模型

Params (M)

Recall@1

Recall@5

配置文件

下载

blip-base_3rdparty_retrieval*

447.49

95.10#

99.60#

config

model

Text-To-Image Retrieval on Flickr30k

模型

Params (M)

Recall@1

Recall@5

配置文件

下载

blip-base_3rdparty_retrieval*

447.49

85.26#

96.58#

config

model

NLVR on NLVR2

模型

Params (M)

Top-1 (%)

配置文件

下载

blip-base_3rdparty_nlvr*

259.37

82.33

config

model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.

Results with # denote zero-shot evaluation. The corresponding model hasn’t been finetuned on that dataset.

引用

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}
Read the Docs v: latest
Versions
latest
stable
mmcls-1.x
mmcls-0.x
dev
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.