ankitkushwaha90 commited on
Commit
2c1a634
·
verified ·
1 Parent(s): 964a3d3

Create huggingface_library_transformer.md

Browse files
Files changed (1) hide show
  1. huggingface_library_transformer.md +95 -0
huggingface_library_transformer.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The Hugging Face Transformers library absolutely supports models for images, audio/voice, and video — not just text.
2
+
3
+ Here’s how it works:
4
+
5
+ ## ✅ Hugging Face Transformers Supports Multiple Data Types
6
+ ## 1. Text (NLP)
7
+
8
+ Models: BERT, GPT-2, T5, LLaMA, Mistral
9
+ Class: AutoModelForSequenceClassification, AutoModelForCausalLM, etc.
10
+
11
+ ## 2. Images (Vision Transformers)
12
+
13
+ Models: ViT, DeiT, BEiT, Swin, ConvNeXT, DINO, MAE
14
+ Class: AutoModelForImageClassification
15
+
16
+ Example:
17
+ ```python
18
+ from transformers import AutoImageProcessor, AutoModelForImageClassification
19
+ from PIL import Image
20
+ import torch
21
+
22
+ processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
23
+ model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
24
+
25
+ image = Image.open("cat.jpg")
26
+ inputs = processor(images=image, return_tensors="pt")
27
+ outputs = model(**inputs)
28
+ pred = outputs.logits.argmax(-1)
29
+ print(pred)
30
+ ```
31
+
32
+ ## 3. Audio / Speech
33
+
34
+ Models: Wav2Vec2, Whisper, HuBERT, SpeechT5
35
+ Class: AutoModelForSpeechRecognition
36
+
37
+ Example:
38
+ ```python
39
+ from transformers import AutoProcessor, AutoModelForSpeechRecognition
40
+ import torch
41
+ import soundfile as sf
42
+
43
+ processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
44
+ model = AutoModelForSpeechRecognition.from_pretrained("facebook/wav2vec2-base-960h")
45
+
46
+ speech, rate = sf.read("speech.wav")
47
+ inputs = processor(speech, sampling_rate=rate, return_tensors="pt")
48
+ logits = model(**inputs).logits
49
+ pred_ids = torch.argmax(logits, dim=-1)
50
+ transcription = processor.batch_decode(pred_ids)
51
+ print(transcription)
52
+ ```
53
+ ## 4. Video
54
+
55
+ Models: TimeSformer, VideoMAE
56
+ Class: AutoModelForVideoClassification
57
+
58
+ Example:
59
+ ```python
60
+ from transformers import AutoImageProcessor, AutoModelForVideoClassification
61
+ import torch, av
62
+
63
+ processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
64
+ model = AutoModelForVideoClassification.from_pretrained("MCG-NJU/videomae-base")
65
+
66
+ # (Pseudo example - you'd extract video frames first)
67
+ frames = [...]
68
+ inputs = processor(frames, return_tensors="pt")
69
+ outputs = model(**inputs)
70
+ print(outputs.logits.argmax(-1))
71
+ ```
72
+ ## 5. Multimodal (Text + Image / Video)
73
+
74
+ Models: CLIP, BLIP, ViLT
75
+ Class: AutoModel depending on task
76
+
77
+ Example (CLIP):
78
+ ```python
79
+ from transformers import CLIPProcessor, CLIPModel
80
+ from PIL import Image
81
+
82
+ model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
83
+ processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
84
+
85
+ image = Image.open("dog.jpg")
86
+ inputs = processor(text=["a dog", "a cat"], images=image, return_tensors="pt", padding=True)
87
+
88
+ outputs = model(**inputs)
89
+ logits = outputs.logits_per_image
90
+ print(logits.softmax(dim=-1))
91
+ ```
92
+ ## 🔹 Conclusion
93
+
94
+ ➡️ Yes, Hugging Face Transformers includes models for text, images, audio/voice, video, and multimodal tasks.
95
+ It started with NLP but has expanded into computer vision, speech, and multimodal AI.