Shortcuts

BlipVQA

class mmpretrain.models.multimodal.BlipVQA(tokenizer, vision_backbone, multimodal_backbone, head, data_preprocessor=None, init_cfg=None)[source]

BLIP VQA.

Parameters:
  • tokenizer – (dict): The config for tokenizer.

  • vision_backbone (dict) – Encoder for extracting image features.

  • multimodal_backbone (dict) – Backbone for extracting multi-modal features. We apply this part as VQA fusion module.

  • head (dict) – The head module to calculate loss from processed features.

  • data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use MutimodalDataPreprocessor as type. See MutimodalDataPreprocessor for more details. Defaults to None.

  • init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.

extract_feat(images)[source]

Extract features from the input tensor with shape (N, C, ..).

Parameters:

images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.

Returns:

The output features.

Return type:

visual_embeds (Tensor)

forward(images, data_samples=None, mode='loss')[source]

The unified entry for a forward process in both training and test.

  • “loss”: For training. Forward and return a dict of losses according to the given inputs and data samples. Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

  • “predict”: For testing. Forward and return a list of data_sample that contains pred_answer for each question.

Parameters:
  • images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.

  • data_samples (List[DataSample], optional) – The annotation data of every samples. Required when mode="loss". Defaults to None.

  • mode (str) – Return what kind of value. Defaults to ‘loss’.

Returns:

The return type depends on mode. - If mode="loss", return a dict of tensor. - If mode="predict", return a list of DataSample

loss(images, data_samples=None)[source]

generate train_loss from the input tensor and data_samples.

Parameters:
  • images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.

  • data_samples (List[DataSample], optional) – The annotation data of every samples.

Returns:

The losses features.

Return type:

Dict[torch.Tensor]

predict(images, data_samples=None)[source]

update data_samples that contain pred_answer for each question.

Parameters:
  • images (Tensor) – A batch of images. The shape of it should be (B, C, H, W) for images and (B, T, C, H, W) for videos.

  • data_samples (List[DataSample], optional) – The annotation data of every samples.

Returns:

The losses features.

Return type:

Dict[torch.Tensor]

Read the Docs v: latest
Versions
latest
stable
mmcls-1.x
mmcls-0.x
dev
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.