empirischtech
/

Llama-3.1-10B-Instruct

Text Generation

Model card Files Files and versions Community

rwmasood commited on Feb 18

Commit

6dab81a

·

verified ·

1 Parent(s): 95976c5

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -27,7 +27,7 @@ base_model:
 Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
-To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers called Mixture-of-Experts (MoE), followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
 In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.

 Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
+To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers called **Mixture-of-Experts (MoE)**, followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
 In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.