joaogante HF Staff commited on
Commit
d938814
·
1 Parent(s): bdb93c2

add example

Browse files
Files changed (1) hide show
  1. README.md +80 -7
README.md CHANGED
@@ -4,23 +4,24 @@ tags:
4
  - custom_generate
5
  ---
6
 
7
- ⚠️ WORK IN PROGRESS ⚠️
8
-
9
  ## Description
10
- Implementation of the cache introduced in the [Attention Sinks paper](https://huggingface.co/papers/2309.17453).
11
  It allows the model to generate beyond the length of its context window, without losing fluency in the conversation.
12
- It's also a solution to contain the memory footprint of the KV cache. As it discards past tokens, the model will lose
13
- the ability to generate tokens that depend on the context that was discarded.
 
14
 
15
- This implementation should match the `SinkCache` class present in `transformers<4.53.0`.
16
 
17
  ![Sink Cache diagram from the original paper](https://arxiv.org/html/2309.17453v4/x1.png)
18
 
 
19
  ## Base model
 
20
 
21
 
22
  ## Model compatibility
23
- - Decoder-only models
24
 
25
 
26
  ## Additional Arguments
@@ -35,7 +36,79 @@ in `generate.py`, in this repository.
35
 
36
  ## Example usage
37
 
 
 
38
  ```py
39
  # requires `transformers>=4.52.0`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
 
4
  - custom_generate
5
  ---
6
 
 
 
7
  ## Description
8
+ Implementation of the KV cache introduced in the [Attention Sinks paper](https://huggingface.co/papers/2309.17453).
9
  It allows the model to generate beyond the length of its context window, without losing fluency in the conversation.
10
+ This is done by always keeping the first few tokens ("sink tokens") in the KV cache, as models often pay a large
11
+ amount of attention to them. As it discards past non-sink tokens, the model will lose the ability to generate tokens
12
+ that depend on the context that was discarded. It's also a solution to contain the memory footprint of the KV cache.
13
 
14
+ This implementation matches the `SinkCache` class present in `transformers<4.53.0`.
15
 
16
  ![Sink Cache diagram from the original paper](https://arxiv.org/html/2309.17453v4/x1.png)
17
 
18
+
19
  ## Base model
20
+ - [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
21
 
22
 
23
  ## Model compatibility
24
+ - Decoder-only transformers models
25
 
26
 
27
  ## Additional Arguments
 
36
 
37
  ## Example usage
38
 
39
+ We can use the custom generation method in this repository like the the base `generate` from `transformers`:
40
+
41
  ```py
42
  # requires `transformers>=4.52.0`
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+
45
+ # Preparing model, tokenizer, and model inputs
46
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
47
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", device_map="auto")
48
+ messages = [{"role": "user", "content": "Tell me a story about a cat."}]
49
+ text = tokenizer.apply_chat_template(
50
+ messages,
51
+ tokenize=False,
52
+ add_generation_prompt=True,
53
+ enable_thinking=False
54
+ )
55
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
56
+
57
+ # Using sink cache
58
+ gen_out = model.generate(
59
+ # usual `generate` arguments
60
+ **model_inputs,
61
+ do_sample=False,
62
+ max_new_tokens=100,
63
+ return_dict_in_generate=True,
64
+ # sink cache arguments (default `window_length=256`)
65
+ custom_generate="transformers-community/sink_cache",
66
+ trust_remote_code=True,
67
+ )
68
+ print(tokenizer.batch_decode(gen_out.sequences, skip_special_tokens=True))
69
+ assert "sinkcache" in str(type(gen_out.past_key_values)).lower()
70
+ # ['user\nTell me a story about a cat.\nassistant\n<think>\n\n</think>\n\nOnce upon a time, in a cozy village nestled
71
+ # between rolling hills and a sparkling lake, there lived a cat named Luna. Luna was small and fluffy, with a curious
72
+ # eyes that sparkled with wonder. She had a soft, warm coat that shimmered like the morning sun, and her tail was
73
+ # always wagging in playful motions.\n\nOne day, while exploring the village, Luna noticed a curious sight: a young
74
+ # boy playing with a ball on the lake. She followed him closely, her heart racing']
75
+ ```
76
+
77
+ Continuing the example above, we can confirm some properties of the `SinkCache`
78
+
79
+ ```py
80
+ # `max_new_tokens` < `window_length` in the example above -> matches output with the default cache
81
+ gen_out = model.generate(
82
+ **model_inputs,
83
+ do_sample=False,
84
+ max_new_tokens=100,
85
+ return_dict_in_generate=True,
86
+ )
87
+ print(tokenizer.batch_decode(gen_out.sequences, skip_special_tokens=True))
88
+ assert "dynamiccache" in str(type(gen_out.past_key_values)).lower()
89
+ # ['user\nTell me a story about a cat.\nassistant\n<think>\n\n</think>\n\nOnce upon a time, in a cozy village nestled
90
+ # between rolling hills and a sparkling lake, there lived a cat named Luna. Luna was small and fluffy, with a curious
91
+ # eyes that sparkled with wonder. She had a soft, warm coat that shimmered like the morning sun, and her tail was
92
+ # always wagging in playful motions.\n\nOne day, while exploring the village, Luna noticed a curious sight: a young
93
+ # boy playing with a ball on the lake. She followed him closely, her heart racing']
94
 
95
+ # if we set a smaller `window_length`, the story is less coherent after that point, but the used cache is also
96
+ # significantly smaller
97
+ gen_out = model.generate(
98
+ # usual `generate` arguments
99
+ **model_inputs,
100
+ do_sample=False,
101
+ max_new_tokens=100,
102
+ return_dict_in_generate=True,
103
+ # sink cache arguments
104
+ custom_generate="transformers-community/sink_cache",
105
+ trust_remote_code=True,
106
+ window_length=50,
107
+ )
108
+ print(tokenizer.batch_decode(gen_out.sequences, skip_special_tokens=True))
109
+ # ["user\nTell me a story about a cat.\nassistant\n<think>\n\n</think>\n\nOnce upon a time, in a cozy village nestled
110
+ # between rolling hills and a sparkling lake, there lived a cat named Luna. Luna was small and fluffy, with a curious
111
+ # heart. She loved exploring the village and playing with her friends.\n\nOne day, Luna noticed something unusual.
112
+ # She looked around and saw a shadow moving in the dark. She ran quickly, but she couldn't see the shadow. She
113
+ # thought maybe it was a ghost or something else.\n\nAs she was running, she heard a voice."]
114
  ```