Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -92,3 +92,90 @@ cd text-generation-inference/integration_tests | |
| 92 | 
             
            make gen-client
         | 
| 93 | 
             
            pip install . --no-cache-dir
         | 
| 94 | 
             
            ```
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 92 | 
             
            make gen-client
         | 
| 93 | 
             
            pip install . --no-cache-dir
         | 
| 94 | 
             
            ```
         | 
| 95 | 
            +
             | 
| 96 | 
            +
            ### Minimal Sample
         | 
| 97 | 
            +
             | 
| 98 | 
            +
            *To try this out with the fms-native compiled model, please execute the following:*
         | 
| 99 | 
            +
             | 
| 100 | 
            +
            #### Install
         | 
| 101 | 
            +
             | 
| 102 | 
            +
            ```bash
         | 
| 103 | 
            +
            git clone https://github.com/foundation-model-stack/fms-extras
         | 
| 104 | 
            +
            (cd fms-extras && pip install -e .)
         | 
| 105 | 
            +
            pip install transformers==4.35.0 sentencepiece numpy
         | 
| 106 | 
            +
            ```
         | 
| 107 | 
            +
             | 
| 108 | 
            +
            #### Run Sample
         | 
| 109 | 
            +
             | 
| 110 | 
            +
            ```bash
         | 
| 111 | 
            +
            python sample_client.py
         | 
| 112 | 
            +
            ```
         | 
| 113 | 
            +
             | 
| 114 | 
            +
            _Note: first prompt may be slower as there is a slight warmup time_
         | 
| 115 | 
            +
             | 
| 116 | 
            +
            ### Minimal Sample
         | 
| 117 | 
            +
             | 
| 118 | 
            +
            #### Install
         | 
| 119 | 
            +
             | 
| 120 | 
            +
            ```bash
         | 
| 121 | 
            +
            git clone --branch main --single-branch llama_3_variants https://github.com/JRosenkranz/fms-extras
         | 
| 122 | 
            +
            (cd fms-extras && pip install -e .)
         | 
| 123 | 
            +
            pip install transformers==4.35.0 sentencepiece numpy
         | 
| 124 | 
            +
            ```
         | 
| 125 | 
            +
             | 
| 126 | 
            +
            #### Run Sample
         | 
| 127 | 
            +
             | 
| 128 | 
            +
            ##### batch_size=1 (compile + cudagraphs)
         | 
| 129 | 
            +
             | 
| 130 | 
            +
            ```bash
         | 
| 131 | 
            +
            MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
         | 
| 132 | 
            +
            python fms-extras/scripts/paged_speculative_inference.py \
         | 
| 133 | 
            +
                --architecture=llama3 \
         | 
| 134 | 
            +
                --variant=8b \
         | 
| 135 | 
            +
                --model_path=$MODEL_PATH \
         | 
| 136 | 
            +
                --model_source=hf \
         | 
| 137 | 
            +
                --tokenizer=$MODEL_PATH \
         | 
| 138 | 
            +
                --speculator_path=ibm-fms/codellama-13b-accelerator \
         | 
| 139 | 
            +
                --speculator_source=hf \
         | 
| 140 | 
            +
                --speculator_variant=3_2b \
         | 
| 141 | 
            +
                --top_k_tokens_per_head=4,3,2,2 \
         | 
| 142 | 
            +
                --compile \
         | 
| 143 | 
            +
                --compile_mode=reduce-overhead
         | 
| 144 | 
            +
            ```
         | 
| 145 | 
            +
             | 
| 146 | 
            +
            ##### batch_size=1 (compile)
         | 
| 147 | 
            +
             | 
| 148 | 
            +
            ```bash
         | 
| 149 | 
            +
            MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
         | 
| 150 | 
            +
            python fms-extras/scripts/paged_speculative_inference.py \
         | 
| 151 | 
            +
                --architecture=llama3 \
         | 
| 152 | 
            +
                --variant=8b \
         | 
| 153 | 
            +
                --model_path=$MODEL_PATH \
         | 
| 154 | 
            +
                --model_source=hf \
         | 
| 155 | 
            +
                --tokenizer=$MODEL_PATH \
         | 
| 156 | 
            +
                --speculator_path=ibm-fms/codellama-13b-accelerator \
         | 
| 157 | 
            +
                --speculator_source=hf \
         | 
| 158 | 
            +
                --speculator_variant=3_2b \
         | 
| 159 | 
            +
                --top_k_tokens_per_head=4,3,2,2 \
         | 
| 160 | 
            +
                --compile
         | 
| 161 | 
            +
            ```
         | 
| 162 | 
            +
             | 
| 163 | 
            +
            ##### batch_size=4 (compile)
         | 
| 164 | 
            +
             | 
| 165 | 
            +
            ```bash
         | 
| 166 | 
            +
            MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
         | 
| 167 | 
            +
            python fms-extras/scripts/paged_speculative_inference.py \
         | 
| 168 | 
            +
                --architecture=llama3 \
         | 
| 169 | 
            +
                --variant=8b \
         | 
| 170 | 
            +
                --model_path=$MODEL_PATH \
         | 
| 171 | 
            +
                --model_source=hf \
         | 
| 172 | 
            +
                --tokenizer=$MODEL_PATH \
         | 
| 173 | 
            +
                --speculator_path=ibm-fms/codellama-13b-accelerator \
         | 
| 174 | 
            +
                --speculator_source=hf \
         | 
| 175 | 
            +
                --speculator_variant=3_2b \
         | 
| 176 | 
            +
                --top_k_tokens_per_head=4,3,2,2 \
         | 
| 177 | 
            +
                --batch_input \
         | 
| 178 | 
            +
                --compile
         | 
| 179 | 
            +
            ```
         | 
| 180 | 
            +
             | 
| 181 | 
            +
            Sample code can be found [here](https://github.com/foundation-model-stack/fms-extras/pull/24)
         | 
