nvidia
/

Llama-3.1-8B-Instruct-FP8

Text Generation

text-generation-inference

Model card Files Files and versions Community

zhiyucheng commited on Oct 11, 2024

Commit

b64a154

·

verified ·

1 Parent(s): b335a0b

Reformat table

Files changed (1) hide show

README.md +27 -6

README.md CHANGED Viewed

@@ -90,12 +90,33 @@ Please refer to the [TensorRT-LLM benchmarking documentation](https://github.com
 ## Evaluation
 The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark results are presented in the table below:
-| Precision | MMLU  | TPS     |
-|-----------|-------|---------|
-| FP16      | 68.6  |  8,579.93  |
-| FP8       | 68.3  | 11,062.90  |
 We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.

 ## Evaluation
 The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark results are presented in the table below:
+<table>
+  <tr>
+   <td><strong>Precision</strong>
+   </td>
+   <td><strong>MMLU</strong>
+   </td>
+   <td><strong>TPS</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>FP16
+   </td>
+   <td>68.6
+   </td>
+   <td>8,579.93
+   </td>
+  </tr>
+  <tr>
+   <td>FP8
+   </td>
+   <td>68.3
+   </td>
+   <td>11,062.90
+   </td>
+  </tr>
+  <tr>
+</table>
 We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.3x** speedup with FP8.