Blip2VQA¶

class mmpretrain.models.multimodal.Blip2VQA(vision_backbone, text_backbone, multimodal_backbone, vision_neck, tokenizer=None, prompt='', max_txt_len=20, num_captions=1, data_preprocessor=None, init_cfg=None)[source]¶

BLIP2 VQA.

Module for BLIP2 VQA task. For more details about the initialization params, please refer to Blip2Caption.

predict(images, data_samples=None, **kwargs)[source]¶

Predict captions from a batch of inputs.

Parameters:

images (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the predict method of head.

Returns:

Return list of data samples.

Return type:

List[DataSample]