empirischtech
/

Llama-3.1-10B-Instruct

@@ -35,7 +35,7 @@ base_model:
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
-tokenizer = AutoTokenizer.from_pretrained("upstage/llama-65b-instruct")
 model = AutoModelForCausalLM.from_pretrained(
     "upstage/llama-65b-instruct",
     device_map="auto",
@@ -61,13 +61,13 @@ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
 ## Evaluation Results
 ### Overview
-- We conducted a performance evaluation based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
-We evaluated our model on four benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU`, and `TruthfulQA`.
-We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness), specifically commit [b281b0921b636bc36ad05c0b0b0763bd6dd43463](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)
-- We used [MT-bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), a set of challenging multi-turn open-ended questions, to evaluate the models
 ### Main Results
-| Model | H4(Avg) | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
 |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
 | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(Ours, Open LLM Leaderboard) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
 | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375  |
@@ -80,8 +80,8 @@ We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-
 ### Scripts to generate evalution results
-- Prepare evaluation environments:
-```
 # install from https://github.com/EleutherAI/lm-evaluation-harness
 pip install lm-eval>=0.4.7

 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+tokenizer = AutoTokenizer.from_pretrained("empirischtech/Llama-3.1-10b-instruct")
 model = AutoModelForCausalLM.from_pretrained(
     "upstage/llama-65b-instruct",
     device_map="auto",
 ## Evaluation Results
 ### Overview
+- The performance evaluation is based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
+The model is evaluated on three benchmark datasets, which include `ARC-Challenge`, `HellaSwag` and `MMLU`.
+The library used is [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness)
 ### Main Results
+| Model |  ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
 |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
 | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(Ours, Open LLM Leaderboard) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
 | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375  |
 ### Scripts to generate evalution results
+```python
 # install from https://github.com/EleutherAI/lm-evaluation-harness
 pip install lm-eval>=0.4.7