BlipCaption¶

class mmpretrain.models.multimodal.BlipCaption(vision_encoder, decoder_head, tokenizer=None, prompt='', max_txt_len=20, num_captions=1, data_preprocessor=None, init_cfg=None)[source]¶

BLIP Caption.

Parameters:

vision_encoder (dict) – Encoder for extracting image features.
decoder_head (dict) – The decoder head module to forward and calculate loss from processed features.
tokenizer – (Optional[dict]): The config for tokenizer. Defaults to None.
prompt (str) – Prompt used for training and eval. Defaults to ‘’.
max_txt_len (int) – Max text length of input text.
num_captions (int) – Number of captions to be generated for each image.
data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MutimodalDataPreprocessor” as type. See MutimodalDataPreprocessor for more details. Defaults to None.
init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.

forward(images, data_samples=None, mode='loss')[source]¶

The unified entry for a forward process in both training and test. The method should accept two modes: “predict” and “loss”:

“predict”: Forward and return the predictions, which are fully processed to a list of DataSample.
“loss”: Forward and return a dict of losses according to the given inputs and data samples.

Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the train_step().

Parameters:

images (torch.Tensor) – pre_processed img tensor (N, C, …).
data_samples (List[DataSample], optional) – Data samples with additional infos.
mode (str) – Return what kind of value. Defaults to ‘loss’.

Returns:

The return type depends on mode. - If mode="loss", return a dict of tensor.

loss(images, data_samples)[source]¶

Calculate losses from a batch of images and data samples.

Parameters:

images (torch.Tensor) – The input images tensor with shape (N, C, …) in general.
data_samples (List[ImageTextDataSample]) – The annotation data of every samples.

Returns:

a dictionary of loss components.

Return type:

dict[str, Tensor]

predict(images, data_samples=None, **kwargs)[source]¶

Predict captions from a batch of inputs.

Parameters:

images (torch.Tensor) – The input images tensor with shape (N, C, …) in general.
data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the predict method of head.

Returns:

Return list of data samples.

Return type:

List[DataSample]