Attention_is_all_you_need_transformers / Generative_ai_transformers.md
ankitkushwaha90's picture
Create Generative_ai_transformers.md
4df1e58 verified

🔹 1. What your table shows

Data Type Transformer Type / Adaptation
Text GPT, T5, BERT
Image ViT, ViT + generative decoder
Audio SpeechT5, MusicLM
Video TimeSformer, VideoMAE
Multimodal CLIP, BLIP

Observation:

Some of these are purely generative, some are discriminative, and some are both depending on usage.

🔹 2. Which are generative?

Data Type Generative? Notes
Text ✅ Yes (GPT, T5 in generation mode) Can generate text sequences. BERT alone is mostly discriminative.
Image ✅ Yes (ViT + decoder, Diffusion) ViT itself is a feature extractor, generative decoder is needed to create images.
Audio ✅ Yes (SpeechT5, MusicLM) Can generate speech, TTS, music.
Video ✅ Yes (VideoMAE + decoder, TimeSformer for generation) Video generation requires decoder after transformer embeddings.
Multimodal ✅ Yes (BLIP, CLIP used in generation pipeline) CLIP itself is discriminative (alignment), but used with generative decoder (text-to-image, VQGAN + CLIP).