File size: 1,789 Bytes
4df1e58 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
## 🔹 1. What your table shows
| Data Type | Transformer Type / Adaptation |
| ---------- | ----------------------------- |
| Text | GPT, T5, BERT |
| Image | ViT, ViT + generative decoder |
| Audio | SpeechT5, MusicLM |
| Video | TimeSformer, VideoMAE |
| Multimodal | CLIP, BLIP |
Observation:
Some of these are purely generative, some are discriminative, and some are both depending on usage.
## 🔹 2. Which are generative?
| Data Type | Generative? | Notes |
| ---------- | ------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------- |
| Text | ✅ Yes (GPT, T5 in generation mode) | Can generate text sequences. BERT alone is mostly discriminative. |
| Image | ✅ Yes (ViT + decoder, Diffusion) | ViT itself is a **feature extractor**, generative decoder is needed to create images. |
| Audio | ✅ Yes (SpeechT5, MusicLM) | Can generate speech, TTS, music. |
| Video | ✅ Yes (VideoMAE + decoder, TimeSformer for generation) | Video generation requires decoder after transformer embeddings. |
| Multimodal | ✅ Yes (BLIP, CLIP used in generation pipeline) | CLIP itself is discriminative (alignment), but used with generative decoder (text-to-image, VQGAN + CLIP). |
|