Shortcuts

BlipRetrieval

class mmpretrain.models.multimodal.BlipRetrieval(vision_backbone, text_backbone, multimodal_backbone=None, vision_neck=None, text_neck=None, head=None, multimodal_head=None, tokenizer=None, momentum=0.995, negative_all_rank=True, temperature=0.07, fast_match=False, topk=256, max_txt_len=20, data_preprocessor=None, init_cfg=None)[source]

BLIP Retriever.

Parameters:
  • vision_backbone (dict) – Backbone for extracting image features.

  • text_backbone (dict) – Backbone for extracting text features.

  • multimodal_backbone (Optional[dict]) – Backbone for extracting multi-modal features.

  • vision_neck (Optional[dict]) – The neck module to process image features from vision backbone. Defaults to None.

  • text_neck (Optional[dict]) – The neck module to process text features from text backbone. Defaults to None.

  • head (Optional[Union[List[dict], dict]]) – The head module to calculate loss from processed single modality features. See mmmultimodal.models.heads. Notice that if the head is not set, loss method cannot be used. Defaults to None.

  • multimodal_head (Optional[Union[List[dict], dict]]) – The multi-modal head module to calculate loss from processed multimodal features. See mmmultimodal.models.heads. Notice that if the head is not set, loss method cannot be used. Defaults to None.

  • momentum (float) – Momentum used for momentum contrast. Defaults to .995.

  • negative_all_rank (bool) – Whether to sample negative data from all ranks for image text matching in training. Defaults to True.

  • temperature (float) – Temperature parameter that controls the concentration level of the distribution. Defaults to 0.07.

  • fast_match (bool) – If False, select topk similarity as candidates and compute the matching score. If True, return the similarity as the matching score directly. Defaults to False.

  • topk (int) – Select topk similarity as candidates for compute matching scores. Notice that this is not the topk in evaluation. Defaults to 256.

  • data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MutimodalDataPreprocessor” as type. See MutimodalDataPreprocessor for more details. Defaults to None.

  • init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.

compute_score_matrix_i2t(img_feats, img_embeds, text_feats, text_ids, text_atts)[source]

Compare the score matrix for image-to-text retrieval. Every image should compare to all the text features.

Parameters:
  • img_feats (torch.Tensor) – The input img feats tensor with shape (M, C). M stands for numbers of samples on a single GPU.

  • img_embeds (torch.Tensor) – The input img embeds tensor with shape (M, C). M stands for numbers of samples on a single GPU.

  • text_feats (torch.Tensor) – The input text feats tensor with shape (N, C). N stands for numbers of all samples on all GPUs.

  • text_ids (torch.Tensor) – The input tensor with shape (N, C).

  • text_atts (torch.Tensor) – The input tensor with shape (N, C).

Returns:

Score matrix of image-to-text retrieval.

Return type:

torch.Tensor

compute_score_matrix_t2i(img_feats, img_embeds, text_feats, text_ids, text_atts)[source]

Compare the score matrix for text-to-image retrieval. Every text should compare to all the image features.

Parameters:
  • img_feats (torch.Tensor) – The input img feats tensor with shape (M, C). M stands for numbers of samples on a single GPU.

  • img_embeds (torch.Tensor) – The input img embeds tensor with shape (M, C). M stands for numbers of samples on a single GPU.

  • text_feats (torch.Tensor) – The input text feats tensor with shape (N, C). N stands for numbers of all samples on all GPUs.

  • text_ids (torch.Tensor) – The input tensor with shape (M, C).

  • text_atts (torch.Tensor) – The input tensor with shape (M, C).

Returns:

Score matrix of text-to-image retrieval.

Return type:

torch.Tensor

extract_feat(images=None, data_samples=None, return_texts=True, return_embeds=None)[source]

Extract features from the input dict.

Parameters:
  • images (tensor, optional) – The images to extract features. Defaults to None.

  • data_samples (list, optional) – The data samples containing texts to extract features. Defaults to None.

  • return_texts (bool) – Whether to return the tokenized text and the corresponding attention masks. Defaults to True.

  • return_embeds (bool) – Whether to return the text embedding and image embedding. Defaults to None, which means to use self.fast_match.

Returns:

The output features.

If multimodal_backbone is not exist, tuple of torch.Tensor will be returned.

Return type:

Tuple[torch.Tensor]

forward(images=None, data_samples=None, mode='tensor')[source]

The unified entry for a forward process in both training and test. The method should accept two modes: “tensor”, and “loss”:

  • “tensor”: Forward the whole network and return tensor without any post-processing, same as a common nn.Module.

  • “loss”: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

For unified “predict” mode in other mm repos. It is noticed that image-text retrieval cannot perform batch prediction since it will go through all the samples. A standard process of retrieval evaluation is to extract and collect all feats, and then predict all samples. Therefore the predict mode here is remained as a trigger to inform use to choose the right configurations.

Parameters:
  • images (torch.Tensor) – The input inputs tensor of shape (N, C, …) in general.

  • data_samples (List[DataSample], optional) – The annotation data of every samples. It’s required if mode="loss". Defaults to None.

  • mode (str) – Return what kind of value. Defaults to ‘tensor’.

Returns:

The return type depends on mode. - If mode="tensor", return a tuple. - If mode="loss", return a dict of tensor.

loss(images, data_samples=None)[source]

Calculate losses from a batch of inputs and data samples.

Parameters:
  • inputs (dict) – A batch of inputs. The input tensor with of at least one modality. For image, the value is a tensor of shape (N, C, …) in general. For text, the value is a dict of tokenized text inputs.

  • data_samples (Optional[List[DataSample]]) – The annotation data of every samples. Defaults to None.

Returns:

a dictionary of loss components of

both head and multimodal head.

Return type:

Dict[str, torch.tensor]