BLIP¶

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Abstract¶

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

How to use it?¶

from mmpretrain import inference_model

result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}

Prepare your dataset according to the docs.

Test:

python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth

Models and results¶

Image Caption on COCO¶

Model	Params (M)	BLEU-4	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	40.12	132.82	config	model

Image Caption on NoCaps¶

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	14.69	109.12	config	model

Image Caption on Flickr30k¶

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	15.58	68.89	config	model

Visual Grounding on RefCOCO¶

Model	Params (M)	Accuracy (testA)	Accuracy (testB)	Config	Download
`blip-base_8xb16_refcoco`	498.49	86.14	77.33	config	model \| log

Visual Question Answering on VQAv2¶

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	78.20	config	model

Visual Question Answering on OK-VQA¶

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	40.59#	config	model

Visual Question Answering on OCR-VQA¶

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	28.30#	config	model

Image-To-Text Retrieval on COCO¶

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	82.52	95.34	config	model

Text-To-Image Retrieval on COCO¶

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	64.82	86.28	config	model

Image-To-Text Retrieval on Flickr30k¶

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	95.10#	99.60#	config	model

Text-To-Image Retrieval on Flickr30k¶

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	85.26#	96.58#	config	model

NLVR on NLVR2¶

Model	Params (M)	Top-1 (%)	Config	Download
`blip-base_3rdparty_nlvr`*	259.37	82.33	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven’t reproduce the training results.

Results with # denote zero-shot evaluation. The corresponding model hasn’t been finetuned on that dataset.

Citation¶

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}