RandomForest / README.md
Dada80's picture
Update README.md
a288599 verified
|
raw
history blame
2.35 kB
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
## Model Details
This model classifies news headlines as either NBC or Fox News.
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Jack Bader, Kaiyuan Wang, Pairan Xu
- **Taks:** Binary classification (NBC News vs. Fox News)
- **Preprocessing:** TF-IDF vectorization applied to the text data
- stop_words = "english"
- max_features = 1000
- **Model type:** Random Forest
- **Freamwork:** Scikit-learn
-
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
- Accuracy Score
### Model Evaluation
```python
import pandas as pd
import joblib
from huggingface_hub import hf_hub_download
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
# Mount to drive
from google.colab import drive
drive.mount('/content/drive')
# Load test set
test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252")
# Log in w/ huggingface token
!huggingface-cli login
# Download the model
model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl")
# Download the vectorizer
tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl")
# Load the model
pipeline = joblib.load(model)
# Load the vectorizer
tfidf_vectorizer = joblib.load(tfidf_vectorizer)
# Extract the headlines from the test set
X_test = test_df['title']
# Apply transformation to the headlines into numerical features
X_test_transformed = tfidf_vectorizer.transform(X_test)
# Make predictions using the pipeline
y_pred = pipeline.predict(X_test_transformed)
# Extract 'labels' as target
y_test = test_df['label']
# Print classification report
print(classification_report(y_test, y_pred))