Flamingo¶
- class mmpretrain.models.multimodal.Flamingo(vision_encoder, lang_encoder, tokenizer, task='caption', zeroshot_prompt='<image>Output:', shot_prompt_tmpl='<image>Output:{caption}<|endofchunk|>', final_prompt_tmpl='<image>Output:', generation_cfg={}, data_preprocessor=None, init_cfg=None)[source]¶
The Open Flamingo model for multiple tasks.
- Parameters:
vision_encoder (dict) – The config of the vision encoder.
lang_encoder (dict) – The config of the language encoder.
tokenizer (dict) – The tokenizer to encode the text.
task (int) – The task to perform prediction.
zeroshot_prompt (str) – Prompt used for zero-shot inference. Defaults to ‘<image>Output:’.
shot_prompt_tmpl (str) – Prompt used for few-shot inference. Defaults to
<image>Output:{caption}<|endofchunk|>
.final_prompt_tmpl (str) – Final part of prompt used for inference. Defaults to ‘<image>Output:’.
generation_cfg (dict) – The extra generation config, accept the keyword arguments of [~`transformers.GenerationConfig`]. Defaults to an empty dict.
data_preprocessor (Optional[dict]) – The config for preprocessing input data. If None or no specified type, it will use “MutimodalDataPreprocessor” as type. See
MutimodalDataPreprocessor
for more details. Defaults to None.init_cfg (dict, optional) – The initialization config. Defaults to None.
- extract_vision_feats(images)[source]¶
Extract vision features.
- Parameters:
images (torch.Tensor) – For zero-shot, the input images tensor is with shape (B, C, H, W), for few-shot, which is (B, T_img, C, H, W) in general. Images in the same chunk are collated along T_img. Video data is not supported yet.
- Returns:
Return extracted features.
- Return type:
- forward(images, data_samples=None, mode='loss')[source]¶
The unified entry for a forward process in both training and test. The method should accept only one mode “loss”:
“loss”: Forward and return a dict of losses according to the given inputs and data samples.
Note that this method doesn’t handle neither back propagation nor optimizer updating, which are done in the
train_step()
.- Parameters:
images (torch.Tensor) – The input image tensor with different ndim according to the inputs.
data_samples (List[DataSample], optional) – The annotation data of every samples. It’s required if
mode="loss"
. Defaults to None.mode (str) – Return what kind of value. Defaults to ‘loss’.
- Returns:
The return type depends on
mode
. - Ifmode="loss"
, return a dict of tensor.
- post_process(outputs, data_samples)[source]¶
Perform post process for outputs for different task.
- Parameters:
outputs (torch.Tensor) – The generated outputs.
data_samples (List[DataSample], optional) – The annotation data of every samples.
- Returns:
Return list of data samples.
- Return type:
List[DataSample]
- predict(images, data_samples=None, **generation_cfg)[source]¶
Predict generation results from a batch of inputs.
- Parameters:
images (torch.Tensor) – For zero-shot, the input images tensor is with shape (B, C, H, W), for few-shot, which is (B, T_img, C, H, W) in general. Images in the same chunk are collated along T_img. Video data is not supported yet.
data_samples (List[DataSample], optional) – The annotation data of every samples. Defaults to None.
**generation_cfg – Other keyword arguments accepted by the
generate
method oflang_encoder
.
- Returns:
Return list of data samples.
- Return type:
List[DataSample]
- preprocess_text(data_samples, device)[source]¶
Preprocess text in advance before fed into language model.
- Parameters:
data_samples (List[DataSample]) – The annotation data of every samples. Defaults to None.
device (torch.device) – Device for text to put on.
- Returns:
Return list of data samples.
- Return type:
List[DataSample]