BlipVQA¶
- class mmpretrain.models.multimodal.BlipVQA(tokenizer, vision_backbone, multimodal_backbone, head, data_preprocessor=None, init_cfg=None)[源代码]¶
BLIP VQA.
- 参数:
tokenizer – (dict): The config for tokenizer.
vision_backbone (dict) – Encoder for extracting image features.
multimodal_backbone (dict) – Backbone for extracting multi-modal features. We apply this part as VQA fusion module.
head (dict) – The head module to calculate loss from processed features.
data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use MutimodalDataPreprocessor as type. See
MutimodalDataPreprocessor
for more details. Defaults to None.init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.
- extract_feat(images)[源代码]¶
Extract features from the input tensor with shape (N, C, ..).
- 参数:
images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.
- 返回:
The output features.
- 返回类型:
visual_embeds (Tensor)
- forward(images, data_samples=None, mode='loss')[源代码]¶
The unified entry for a forward process in both training and test.
“loss”: For training. Forward and return a dict of losses according to the given inputs and data samples. Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.“predict”: For testing. Forward and return a list of data_sample that contains pred_answer for each question.
- 参数:
images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.
data_samples (List[DataSample], optional) – The annotation data of every samples. Required when
mode="loss"
. Defaults to None.mode (str) – Return what kind of value. Defaults to ‘loss’.
- 返回:
The return type depends on
mode
. - Ifmode="loss"
, return a dict of tensor. - Ifmode="predict"
, return a list of DataSample
- loss(images, data_samples=None)[源代码]¶
generate train_loss from the input tensor and data_samples.
- 参数:
images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.
data_samples (List[DataSample], optional) – The annotation data of every samples.
- 返回:
The losses features.
- 返回类型:
Dict[torch.Tensor]
- predict(images, data_samples=None)[源代码]¶
update data_samples that contain pred_answer for each question.
- 参数:
images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.
data_samples (List[DataSample], optional) – The annotation data of every samples.
- 返回:
The losses features.
- 返回类型:
Dict[torch.Tensor]