File size: 1,789 Bytes
4df1e58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## 🔹 1. What your table shows

| Data Type  | Transformer Type / Adaptation |
| ---------- | ----------------------------- |
| Text       | GPT, T5, BERT                 |
| Image      | ViT, ViT + generative decoder |
| Audio      | SpeechT5, MusicLM             |
| Video      | TimeSformer, VideoMAE         |
| Multimodal | CLIP, BLIP                    |

Observation:

Some of these are purely generative, some are discriminative, and some are both depending on usage.

## 🔹 2. Which are generative?
| Data Type  | Generative?                                            | Notes                                                                                                      |
| ---------- | ------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------- |
| Text       | ✅ Yes (GPT, T5 in generation mode)                     | Can generate text sequences. BERT alone is mostly discriminative.                                          |
| Image      | ✅ Yes (ViT + decoder, Diffusion)                       | ViT itself is a **feature extractor**, generative decoder is needed to create images.                      |
| Audio      | ✅ Yes (SpeechT5, MusicLM)                              | Can generate speech, TTS, music.                                                                           |
| Video      | ✅ Yes (VideoMAE + decoder, TimeSformer for generation) | Video generation requires decoder after transformer embeddings.                                            |
| Multimodal | ✅ Yes (BLIP, CLIP used in generation pipeline)         | CLIP itself is discriminative (alignment), but used with generative decoder (text-to-image, VQGAN + CLIP). |