Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -9,17 +9,19 @@ This repo contains the **context-based instruction synthesizer** used in our pap | |
| 9 | 
             
            We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
         | 
| 10 |  | 
| 11 | 
             
            <p align='center'>
         | 
| 12 | 
            -
                <img src=" | 
| 13 | 
             
            </p>
         | 
| 14 |  | 
| 15 | 
             
            ## Synthesize Instruction-Response Pairs from Any Raw Corproa
         | 
| 16 | 
             
            We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
         | 
| 17 |  | 
| 18 | 
             
            <p align='center'>
         | 
| 19 | 
            -
                <img src="./ | 
| 20 | 
             
            </p>
         | 
| 21 |  | 
| 22 | 
            -
             | 
|  | |
|  | |
| 23 | 
             
            ```python
         | 
| 24 | 
             
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 25 |  | 
|  | |
| 9 | 
             
            We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
         | 
| 10 |  | 
| 11 | 
             
            <p align='center'>
         | 
| 12 | 
            +
                <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
         | 
| 13 | 
             
            </p>
         | 
| 14 |  | 
| 15 | 
             
            ## Synthesize Instruction-Response Pairs from Any Raw Corproa
         | 
| 16 | 
             
            We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
         | 
| 17 |  | 
| 18 | 
             
            <p align='center'>
         | 
| 19 | 
            +
                <img src="./https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
         | 
| 20 | 
             
            </p>
         | 
| 21 |  | 
| 22 | 
            +
            The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
         | 
| 23 | 
            +
             | 
| 24 | 
            +
            To prompt the synthesizer to generate instruction-response pairs based on a given raw text:
         | 
| 25 | 
             
            ```python
         | 
| 26 | 
             
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 27 |  | 
