Saltar para o conteúdo principal

Multimodal Data

  • Vision-Language Tasks – (image-text retrieval, captioning, VQA)
  • Audio-Text Tasks – (speech translation, audio captioning)
  • Cross-Modal Retrieval and Alignment – (text-to-image, image-to-text search)
  • Multimodal Generation – (text-to-image, text-to-video, image-conditioned text generation)
  • Reasoning and Instruction Tasks – (multimodal QA, instruction following)
Loading comments…