Create huggingface_library_transformer.md
Browse files
huggingface_library_transformer.md
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
The Hugging Face Transformers library absolutely supports models for images, audio/voice, and video — not just text.
|
2 |
+
|
3 |
+
Here’s how it works:
|
4 |
+
|
5 |
+
## ✅ Hugging Face Transformers Supports Multiple Data Types
|
6 |
+
## 1. Text (NLP)
|
7 |
+
|
8 |
+
Models: BERT, GPT-2, T5, LLaMA, Mistral
|
9 |
+
Class: AutoModelForSequenceClassification, AutoModelForCausalLM, etc.
|
10 |
+
|
11 |
+
## 2. Images (Vision Transformers)
|
12 |
+
|
13 |
+
Models: ViT, DeiT, BEiT, Swin, ConvNeXT, DINO, MAE
|
14 |
+
Class: AutoModelForImageClassification
|
15 |
+
|
16 |
+
Example:
|
17 |
+
```python
|
18 |
+
from transformers import AutoImageProcessor, AutoModelForImageClassification
|
19 |
+
from PIL import Image
|
20 |
+
import torch
|
21 |
+
|
22 |
+
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
23 |
+
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
24 |
+
|
25 |
+
image = Image.open("cat.jpg")
|
26 |
+
inputs = processor(images=image, return_tensors="pt")
|
27 |
+
outputs = model(**inputs)
|
28 |
+
pred = outputs.logits.argmax(-1)
|
29 |
+
print(pred)
|
30 |
+
```
|
31 |
+
|
32 |
+
## 3. Audio / Speech
|
33 |
+
|
34 |
+
Models: Wav2Vec2, Whisper, HuBERT, SpeechT5
|
35 |
+
Class: AutoModelForSpeechRecognition
|
36 |
+
|
37 |
+
Example:
|
38 |
+
```python
|
39 |
+
from transformers import AutoProcessor, AutoModelForSpeechRecognition
|
40 |
+
import torch
|
41 |
+
import soundfile as sf
|
42 |
+
|
43 |
+
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
|
44 |
+
model = AutoModelForSpeechRecognition.from_pretrained("facebook/wav2vec2-base-960h")
|
45 |
+
|
46 |
+
speech, rate = sf.read("speech.wav")
|
47 |
+
inputs = processor(speech, sampling_rate=rate, return_tensors="pt")
|
48 |
+
logits = model(**inputs).logits
|
49 |
+
pred_ids = torch.argmax(logits, dim=-1)
|
50 |
+
transcription = processor.batch_decode(pred_ids)
|
51 |
+
print(transcription)
|
52 |
+
```
|
53 |
+
## 4. Video
|
54 |
+
|
55 |
+
Models: TimeSformer, VideoMAE
|
56 |
+
Class: AutoModelForVideoClassification
|
57 |
+
|
58 |
+
Example:
|
59 |
+
```python
|
60 |
+
from transformers import AutoImageProcessor, AutoModelForVideoClassification
|
61 |
+
import torch, av
|
62 |
+
|
63 |
+
processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
|
64 |
+
model = AutoModelForVideoClassification.from_pretrained("MCG-NJU/videomae-base")
|
65 |
+
|
66 |
+
# (Pseudo example - you'd extract video frames first)
|
67 |
+
frames = [...]
|
68 |
+
inputs = processor(frames, return_tensors="pt")
|
69 |
+
outputs = model(**inputs)
|
70 |
+
print(outputs.logits.argmax(-1))
|
71 |
+
```
|
72 |
+
## 5. Multimodal (Text + Image / Video)
|
73 |
+
|
74 |
+
Models: CLIP, BLIP, ViLT
|
75 |
+
Class: AutoModel depending on task
|
76 |
+
|
77 |
+
Example (CLIP):
|
78 |
+
```python
|
79 |
+
from transformers import CLIPProcessor, CLIPModel
|
80 |
+
from PIL import Image
|
81 |
+
|
82 |
+
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
|
83 |
+
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
84 |
+
|
85 |
+
image = Image.open("dog.jpg")
|
86 |
+
inputs = processor(text=["a dog", "a cat"], images=image, return_tensors="pt", padding=True)
|
87 |
+
|
88 |
+
outputs = model(**inputs)
|
89 |
+
logits = outputs.logits_per_image
|
90 |
+
print(logits.softmax(dim=-1))
|
91 |
+
```
|
92 |
+
## 🔹 Conclusion
|
93 |
+
|
94 |
+
➡️ Yes, Hugging Face Transformers includes models for text, images, audio/voice, video, and multimodal tasks.
|
95 |
+
It started with NLP but has expanded into computer vision, speech, and multimodal AI.
|