Multimodal Data
- Vision-Language Tasks – (image-text retrieval, captioning, VQA)
- Audio-Text Tasks – (speech translation, audio captioning)
- Cross-Modal Retrieval and Alignment – (text-to-image, image-to-text search)
- Multimodal Generation – (text-to-image, text-to-video, image-conditioned text generation)
- Reasoning and Instruction Tasks – (multimodal QA, instruction following)