rwmasood commited on
Commit
7b682b4
·
verified ·
1 Parent(s): c261d05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -35,7 +35,7 @@ base_model:
35
  import torch
36
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
37
 
38
- tokenizer = AutoTokenizer.from_pretrained("upstage/llama-65b-instruct")
39
  model = AutoModelForCausalLM.from_pretrained(
40
  "upstage/llama-65b-instruct",
41
  device_map="auto",
@@ -61,13 +61,13 @@ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
61
  ## Evaluation Results
62
 
63
  ### Overview
64
- - We conducted a performance evaluation based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
65
- We evaluated our model on four benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU`, and `TruthfulQA`.
66
- We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness), specifically commit [b281b0921b636bc36ad05c0b0b0763bd6dd43463](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)
67
- - We used [MT-bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), a set of challenging multi-turn open-ended questions, to evaluate the models
68
 
69
  ### Main Results
70
- | Model | H4(Avg) | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
71
  |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
72
  | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(Ours, Open LLM Leaderboard) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
73
  | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375 |
@@ -80,8 +80,8 @@ We used the [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-
80
 
81
 
82
  ### Scripts to generate evalution results
83
- - Prepare evaluation environments:
84
- ```
85
  # install from https://github.com/EleutherAI/lm-evaluation-harness
86
  pip install lm-eval>=0.4.7
87
 
 
35
  import torch
36
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
37
 
38
+ tokenizer = AutoTokenizer.from_pretrained("empirischtech/Llama-3.1-10b-instruct")
39
  model = AutoModelForCausalLM.from_pretrained(
40
  "upstage/llama-65b-instruct",
41
  device_map="auto",
 
61
  ## Evaluation Results
62
 
63
  ### Overview
64
+ - The performance evaluation is based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
65
+ The model is evaluated on three benchmark datasets, which include `ARC-Challenge`, `HellaSwag` and `MMLU`.
66
+ The library used is [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness)
67
+
68
 
69
  ### Main Results
70
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
71
  |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
72
  | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(Ours, Open LLM Leaderboard) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
73
  | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375 |
 
80
 
81
 
82
  ### Scripts to generate evalution results
83
+
84
+ ```python
85
  # install from https://github.com/EleutherAI/lm-evaluation-harness
86
  pip install lm-eval>=0.4.7
87