Blip2Caption¶
- class mmpretrain.models.multimodal.Blip2Caption(vision_backbone, text_backbone, multimodal_backbone, vision_neck, tokenizer=None, prompt='', max_txt_len=20, num_captions=1, data_preprocessor=None, init_cfg=None)[source]¶
BLIP2 Caption.
Module for BLIP2 Caption task.
- Parameters:
vision_backbone (dict) – The config dict for vision backbone.
text_backbone (dict) – The config dict for text backbone.
multimodal_backbone (dict) – The config dict for multimodal backbone.
vision_neck (dict) – The config dict for vision neck.
tokenizer – (Optional[dict]): The config for tokenizer. Defaults to None.
prompt (str) – Prompt used for training and eval. Defaults to ‘’.
max_txt_len (int) – Max text length of input text.
num_captions (int) – Number of captions to be generated for each image.
data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MultiModalDataPreprocessor” as type. See
MultiModalDataPreprocessor
for more details. Defaults to None.init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.
- forward(images, data_samples=None, mode='loss')[source]¶
The unified entry for a forward process in both training and test. The method should accept two modes: “predict” and “loss”:
“predict”: Forward and return the predictions, which are fully processed to a list of
DataSample
.“loss”: Forward and return a dict of losses according to the given inputs and data samples.
Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- Parameters:
images (torch.Tensor) – pre_processed img tensor (N, C, …).
data_samples (List[DataSample], optional) –
mode (str) – Return what kind of value. Defaults to ‘loss’.
- Returns:
The return type depends on
mode
. - Ifmode="loss"
, return a dict of tensor. - Ifmode="predict"
, return a list of
- loss(images, data_samples=None, **kwargs)[source]¶
The forward function in training.
- Parameters:
images (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the
loss
method ofhead
.
- Returns:
A dictionary of loss components.
- Return type:
Dict[str, torch.Tensor]
- predict(images, data_samples=None, **kwargs)[source]¶
Predict captions from a batch of inputs.
- Parameters:
images (torch.Tensor) – The input tensor with shape (N, C, …) in general.
data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None.
**kwargs – Other keyword arguments accepted by the
predict
method ofhead
.
- Returns:
Return list of data samples.
- Return type:
List[DataSample]