BlipGrounding¶
- class mmpretrain.models.multimodal.BlipGrounding(tokenizer=None, visual_encoder=None, text_encoder=None, multimodal_encoder=None, head=None, data_preprocessor=None, init_cfg=None)[source]¶
BLIP Grounding.
- Parameters:
visual_encoder (dict) – Backbone for extracting image features.
text_encoder (dict) – Backbone for extracting text features. but we integrate the vqa text extractor into the tokenizer part in datasets/transform/ so we don’t need text_backbone
multimodal_encoder (Optional[dict]) – Backbone for extracting multi-modal features. We apply this part as VQA fusion module.
neck (Optional[dict]) – The neck module to process features from backbone. Defaults to None.
head (Optional[Union[List[dict], dict]]) – The head module to calculate loss from processed features. See
mmpretrain.models.heads
. Notice that if the head is not set, loss method cannot be used. Defaults to None.data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MutimodalDataPreprocessor” as type. See
MutimodalDataPreprocessor
for more details. Defaults to None.init_cfg (Optional[dict]) – the config to control the initialization. Defaults to None.
- extract_feat(images)[source]¶
Extract features from the input tensor with shape (N, C, …).
- Parameters:
inputs (Tensor) – A batch of inputs. The shape of it should be
(num_samples, num_channels, *img_shape)
.- Returns:
The output features.
- Return type:
image_embeds (Tensor)
- forward(images, data_samples=None, mode='loss')[source]¶
The unified entry for a forward process in both training and test. The method should accept only one mode “loss”:
“loss”: Forward and return a dict of losses according to the given inputs and data samples.
Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- Parameters:
inputs (torch.Tensor, tuple) – The input tensor with shape (N, C, …) in general.
data_samples (List[VQADataSample], optional) – The annotation data of every samples. It’s required if
mode="loss"
. Defaults to None.mode (str) – Return what kind of value. Defaults to ‘loss’.
- Returns:
The return type depends on
mode
. - Ifmode="loss"
, return a dict of tensor.
- loss(images, data_samples=None)[source]¶
generate train_loss from the input tensor and data_samples.
- Parameters:
inputs (Tensor) – A batch of inputs. The shape of it should be
(num_samples, num_channels, *img_shape)
.data_samples (List[VQADataSample], optional) – The annotation data of every samples..
- Returns:
The losses features.
- Return type:
Dict[torch.Tensor]