File size: 2,872 Bytes
2c1a634 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
The Hugging Face Transformers library absolutely supports models for images, audio/voice, and video — not just text.
Here’s how it works:
## ✅ Hugging Face Transformers Supports Multiple Data Types
## 1. Text (NLP)
Models: BERT, GPT-2, T5, LLaMA, Mistral
Class: AutoModelForSequenceClassification, AutoModelForCausalLM, etc.
## 2. Images (Vision Transformers)
Models: ViT, DeiT, BEiT, Swin, ConvNeXT, DINO, MAE
Class: AutoModelForImageClassification
Example:
```python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import torch
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
image = Image.open("cat.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
pred = outputs.logits.argmax(-1)
print(pred)
```
## 3. Audio / Speech
Models: Wav2Vec2, Whisper, HuBERT, SpeechT5
Class: AutoModelForSpeechRecognition
Example:
```python
from transformers import AutoProcessor, AutoModelForSpeechRecognition
import torch
import soundfile as sf
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = AutoModelForSpeechRecognition.from_pretrained("facebook/wav2vec2-base-960h")
speech, rate = sf.read("speech.wav")
inputs = processor(speech, sampling_rate=rate, return_tensors="pt")
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(pred_ids)
print(transcription)
```
## 4. Video
Models: TimeSformer, VideoMAE
Class: AutoModelForVideoClassification
Example:
```python
from transformers import AutoImageProcessor, AutoModelForVideoClassification
import torch, av
processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = AutoModelForVideoClassification.from_pretrained("MCG-NJU/videomae-base")
# (Pseudo example - you'd extract video frames first)
frames = [...]
inputs = processor(frames, return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.argmax(-1))
```
## 5. Multimodal (Text + Image / Video)
Models: CLIP, BLIP, ViLT
Class: AutoModel depending on task
Example (CLIP):
```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg")
inputs = processor(text=["a dog", "a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits = outputs.logits_per_image
print(logits.softmax(dim=-1))
```
## 🔹 Conclusion
➡️ Yes, Hugging Face Transformers includes models for text, images, audio/voice, video, and multimodal tasks.
It started with NLP but has expanded into computer vision, speech, and multimodal AI. |