dinerburger commited on
Commit
0051aa1
·
verified ·
1 Parent(s): 2552762

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -20,8 +20,17 @@ This is a 4.125 EXL2 quant of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingf
20
 
21
  This quant was made using a [customized version](https://github.com/dinerburger/exllamav2/tree/max-quant-first-last) of exllamav2-0.2.7 (patch graciously provided by [DeusImperator](https://huggingface.co/DeusImperator)) with default dataset and extended quantization sample length (8k instead of default 2k). It also uses -head_bits=8 and max accuracy quant for first and last layer (8bpw), all other layers of the model use normally chosen methods (method and name (4.125bpw_L) inspired by GGUF naming scheme).
22
 
23
- This allows use of the use of a staggering 64K context at Q4 KV cache quantization on a single 24GB VRAM card
24
- with minimal loss of accuracy. (Remember to set `rope_scale` to 2 in your tabbyAPI config file however)
 
 
 
 
 
 
 
 
 
25
 
26
  # Qwen2.5-Coder-32B-Instruct Original Card
27
  <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
 
20
 
21
  This quant was made using a [customized version](https://github.com/dinerburger/exllamav2/tree/max-quant-first-last) of exllamav2-0.2.7 (patch graciously provided by [DeusImperator](https://huggingface.co/DeusImperator)) with default dataset and extended quantization sample length (8k instead of default 2k). It also uses -head_bits=8 and max accuracy quant for first and last layer (8bpw), all other layers of the model use normally chosen methods (method and name (4.125bpw_L) inspired by GGUF naming scheme).
22
 
23
+ ## A note about context length
24
+ By default, this model caps out at 32K context. Additional configuration is required to unlock full 128K context. Namely, this code block must be added to config.json:
25
+
26
+ ```
27
+ "rope_scaling": {
28
+ "factor": 4.0,
29
+ "original_max_position_embeddings": 32768,
30
+ "type": "yarn"
31
+ }```
32
+
33
+ Once this is done, you can push the model to 64K context at Q4 KV cache quantization on a single 24GB VRAM card with minimal loss of accuracy.
34
 
35
  # Qwen2.5-Coder-32B-Instruct Original Card
36
  <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">