vision_encoder (dict) – Encoder for extracting image features.
decoder_head (dict) – The decoder head module to forward and
calculate loss from processed features.
tokenizer – (Optional[dict]): The config for tokenizer.
Defaults to None.
prompt (str) – Prompt used for training and eval.
Defaults to ‘’.
max_txt_len (int) – Max text length of input text.
num_captions (int) – Number of captions to be generated for each image.
data_preprocessor (Optional[dict]) – The config for preprocessing input
data. If None or no specified type, it will use
“MutimodalDataPreprocessor” as type.
See MutimodalDataPreprocessor for more details.
Defaults to None.
init_cfg (Optional[dict]) – the config to control the initialization.
Defaults to None.