Update README.md
Browse files
README.md
CHANGED
@@ -2,13 +2,14 @@
|
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
datasets:
|
5 |
-
- ajaykarthick/imdb-movie-reviews
|
6 |
language:
|
7 |
- en
|
8 |
metrics:
|
9 |
- accuracy
|
10 |
- f1
|
11 |
- recall
|
|
|
12 |
base_model:
|
13 |
- distilbert/distilbert-base-uncased-finetuned-sst-2-english
|
14 |
---
|
@@ -77,22 +78,25 @@ from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
77 |
import torch
|
78 |
|
79 |
# Load the model and tokenizer from the Hugging Face Hub
|
80 |
-
model_name = "DeepAxion/distilbert-imdb-sentiment"
|
81 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
82 |
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
|
|
|
|
|
83 |
|
84 |
# Example Inference
|
85 |
text = "This movie totally blew me away, absolutely brilliant acting and a fantastic plot!"
|
86 |
|
87 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
88 |
|
|
|
89 |
with torch.no_grad():
|
90 |
outputs = model(**inputs)
|
91 |
logits = outputs.logits
|
92 |
probabilities = torch.softmax(logits, dim=-1)
|
93 |
prediction = torch.argmax(probabilities, dim=-1).item()
|
94 |
|
95 |
-
sentiment_labels = {0: "Negative", 1: "Positive"}
|
96 |
|
97 |
print(f"Input Text: \"{text}\"")
|
98 |
print(f"Predicted Sentiment: {sentiment_labels[prediction]}")
|
@@ -107,10 +111,10 @@ The model was fine-tuned on the IMDb Large Movie Review Dataset. This dataset co
|
|
107 |
|
108 |
Dataset Card: https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews (or the official IMDb dataset link if different)
|
109 |
|
110 |
-
|
111 |
Text was tokenized using the DistilBertTokenizerFast associated with the base model. Input sequences were truncated to a maximum length of 512 tokens and padded to the longest sequence in the batch. Labels were mapped to 0 for negative and 1 for positive.
|
112 |
|
113 |
-
|
114 |
- Training regime: Mixed precision (fp16) was likely used for faster training and reduced memory footprint. (Confirm this if you know your specific training setup)
|
115 |
|
116 |
- Optimizer: AdamW
|
@@ -125,7 +129,24 @@ Text was tokenized using the DistilBertTokenizerFast associated with the base mo
|
|
125 |
|
126 |
- Framework: PyTorch
|
127 |
|
128 |
-
|
129 |
Training Time: [E.g., Approximately 1-2 hours on a single Colab T4 GPU] (Estimate based on your experience)
|
130 |
|
131 |
-
Model Size: The model.safetensors file is approximately 255 MB.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
datasets:
|
5 |
+
- ajaykarthick/imdb-movie-reviews
|
6 |
language:
|
7 |
- en
|
8 |
metrics:
|
9 |
- accuracy
|
10 |
- f1
|
11 |
- recall
|
12 |
+
- precision
|
13 |
base_model:
|
14 |
- distilbert/distilbert-base-uncased-finetuned-sst-2-english
|
15 |
---
|
|
|
78 |
import torch
|
79 |
|
80 |
# Load the model and tokenizer from the Hugging Face Hub
|
81 |
+
model_name = "DeepAxion/distilbert-imdb-sentiment"
|
|
|
82 |
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
83 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
84 |
+
# put the model in eval mode
|
85 |
+
model.eval()
|
86 |
|
87 |
# Example Inference
|
88 |
text = "This movie totally blew me away, absolutely brilliant acting and a fantastic plot!"
|
89 |
|
90 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
91 |
|
92 |
+
# turn on eval mode
|
93 |
with torch.no_grad():
|
94 |
outputs = model(**inputs)
|
95 |
logits = outputs.logits
|
96 |
probabilities = torch.softmax(logits, dim=-1)
|
97 |
prediction = torch.argmax(probabilities, dim=-1).item()
|
98 |
|
99 |
+
sentiment_labels = {0: "Negative", 1: "Positive"}
|
100 |
|
101 |
print(f"Input Text: \"{text}\"")
|
102 |
print(f"Predicted Sentiment: {sentiment_labels[prediction]}")
|
|
|
111 |
|
112 |
Dataset Card: https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews (or the official IMDb dataset link if different)
|
113 |
|
114 |
+
### Preprocessing
|
115 |
Text was tokenized using the DistilBertTokenizerFast associated with the base model. Input sequences were truncated to a maximum length of 512 tokens and padded to the longest sequence in the batch. Labels were mapped to 0 for negative and 1 for positive.
|
116 |
|
117 |
+
### Training Hyperparameters
|
118 |
- Training regime: Mixed precision (fp16) was likely used for faster training and reduced memory footprint. (Confirm this if you know your specific training setup)
|
119 |
|
120 |
- Optimizer: AdamW
|
|
|
129 |
|
130 |
- Framework: PyTorch
|
131 |
|
132 |
+
### Speeds, Sizes, Times
|
133 |
Training Time: [E.g., Approximately 1-2 hours on a single Colab T4 GPU] (Estimate based on your experience)
|
134 |
|
135 |
+
Model Size: The model.safetensors file is approximately 255 MB.
|
136 |
+
|
137 |
+
## Metrics
|
138 |
+
The primary evaluation metrics used were:
|
139 |
+
|
140 |
+
- Accuracy: The proportion of correctly classified samples.
|
141 |
+
- F1-Score (weighted/macro): A measure combining precision and recall, useful for balanced assessment.
|
142 |
+
- Recall: The proportion of actual positive/negative samples that were correctly identified.
|
143 |
+
- Precision: The proportion of classified postive/negative that were actually positive/negative
|
144 |
+
|
145 |
+
### Result
|
146 |
+
- Accuracy: 94%
|
147 |
+
- Recall: 94%
|
148 |
+
- Precision: 94%
|
149 |
+
- F1: 93%
|
150 |
+
|
151 |
+
## Summary
|
152 |
+
The fine-tuned DistilBERT model demonstrates strong performance on the IMDb sentiment classification task, achieving high accuracy, F1-score, and recall on the test set.
|