🔹 1. What your table shows
Data Type | Transformer Type / Adaptation |
---|---|
Text | GPT, T5, BERT |
Image | ViT, ViT + generative decoder |
Audio | SpeechT5, MusicLM |
Video | TimeSformer, VideoMAE |
Multimodal | CLIP, BLIP |
Observation:
Some of these are purely generative, some are discriminative, and some are both depending on usage.
🔹 2. Which are generative?
Data Type | Generative? | Notes |
---|---|---|
Text | ✅ Yes (GPT, T5 in generation mode) | Can generate text sequences. BERT alone is mostly discriminative. |
Image | ✅ Yes (ViT + decoder, Diffusion) | ViT itself is a feature extractor, generative decoder is needed to create images. |
Audio | ✅ Yes (SpeechT5, MusicLM) | Can generate speech, TTS, music. |
Video | ✅ Yes (VideoMAE + decoder, TimeSformer for generation) | Video generation requires decoder after transformer embeddings. |
Multimodal | ✅ Yes (BLIP, CLIP used in generation pipeline) | CLIP itself is discriminative (alignment), but used with generative decoder (text-to-image, VQGAN + CLIP). |