zyxciss commited on
Commit
07d61c1
·
verified ·
1 Parent(s): d3b9c8d

Upload 20 files

Browse files
LICENSE ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License 2.0
2
+
3
+ Copyright (c) 2025 AIRAS
4
+
5
+ Licensed under the Apache License, Version 2.0 (the "License");
6
+ you may not use this file except in compliance with the License.
7
+ You may obtain a copy of the License at
8
+
9
+ http://www.apache.org/licenses/LICENSE-2.0
10
+
11
+ Unless required by applicable law or agreed to in writing, software
12
+ distributed under the License is distributed on an "AS IS" BASIS,
13
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ See the License for the specific language governing permissions and
15
+ limitations under the License.
README.md CHANGED
@@ -1,3 +1,216 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license_name: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ library_name: transformers
9
+ base_model:
10
+ - Sapnous/Sapnous-12B
11
+ license: apache-2.0
12
+ ---
13
+
14
+ ![icon.png](https://cdn-uploads.huggingface.co/production/uploads/675d3ca88d0f15d76e49d5ea/YhcU9ACkEsJXPAgQZz1bX.png)
15
+
16
+
17
+ # Sapnous-12B: A Vision-Language Model for Enhanced World Perception
18
+
19
+ Sapnous-12B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency.
20
+
21
+ ## Model Architecture
22
+
23
+ - **Base Architecture**: 12B parameters
24
+ - **Hidden Size**: 4096
25
+ - **Attention Heads**: 32
26
+ - **Key/Value Heads**: 8
27
+ - **Hidden Layers**: 60
28
+ - **Window Size**: 32768
29
+ - **Vision Encoder**:
30
+ - Depth: 32 layers
31
+ - Hidden Size: 1280
32
+ - Attention Heads: 16
33
+ - Patch Size: 14x14
34
+ - Window Size: 112
35
+
36
+ ## Scores
37
+
38
+
39
+ ---
40
+ ### **📊 Benchmark Results**
41
+
42
+ #### **Multimodal Benchmarks**
43
+ | Benchmark | Sapnous-MoE (Updated) | Sapnous-VR-12B | Sapnous-VR-6B |
44
+ |----------------------------|-----------------|-----------------|-----------------|
45
+ | MMMU_val | **64.4** | **62.0** | **60.2** |
46
+ | MMMU-Pro_val | **44.9** | **42.0** | **40.7** |
47
+ | DocVQA_test | **97.8** | **98.2** | **95.6** |
48
+ | InfoVQA_test | **88.7** | **84.4** | **81.9** |
49
+ | ChartQA_test | **94.2** | **89.8** | **87.2** |
50
+ | TextVQA_val | **91.2** | **87.0** | **84.6** |
51
+ | OCRBench | **929.0** | **880.0** | **861** |
52
+ | CC_OCR | **83.7** | **79.2** | **77.3** |
53
+ | MMStar | **69.3** | **65.5** | **63.6** |
54
+ | MMBench-V1.1-En_test | **89.6** | **85.0** | **82.4** |
55
+ | MMT-Bench_test | **69.0** | **65.2** | **63.3** |
56
+ | MMStar | **69.2** | **65.5** | **63.6** |
57
+ | MMVet_GPT-4-Turbo | **73.3** | **69.2** | **67.2** |
58
+ | HallBench_avg | **58.0** | **54.0** | **52.5** |
59
+ | MathVista_testmini | **74.0** | **70.0** | **67.9** |
60
+ | MathVision | **27.7** | **26.0** | **24.8** |
61
+
62
+ ---
63
+
64
+ #### **Reasoning & Visual Understanding Benchmarks**
65
+ | Benchmark | Sapnous-MoE (Updated) | Sapnous-VR-12B | Sapnous-VR-6B |
66
+ |----------------------------|-----------------|-----------------|-----------------|
67
+ | VQAv2 (val) | **80.3** | **76.5** | **74.1** |
68
+ | Text VQA (val) | **81.1** | **77.5** | **74.7** |
69
+ | DocVQA (val, unseen) | **77.2** | **73.0** | **71.0** |
70
+ | MMMU (val, 0-shot) | **55.4** | **51.5** | **49.2** |
71
+ | ChartQA (test) | **61.0** | **57.5** | **54.1** |
72
+ | InfographicsQA (val, unseen) | **63.7** | **59.0** | **57.1** |
73
+ | AI2 Diagram (test) | **82.3** | **78.0** | **75.6** |
74
+ | MMMU (val, CoT) | **66.5** | **62.5** | **60.6** |
75
+ | MMMU-Pro, Standard (10 opts, test) | **50.0** | **47.0** | **45.5** |
76
+ | MMMU-Pro, Vision (test) | **39.6** | **35.0** | **33.9** |
77
+ | MathVista (testmini) | **63.0** | **60.0** | **57.5** |
78
+ | ChartQA (test, CoT) | **93.3** | **89.0** | **86.0** |
79
+ | AI2 Diagram (test) | **100.9** | **96.5** | **93.5** |
80
+ | DocVQA (test) | **98.9** | **94.0** | **91.3** |
81
+ | VQAv2 (test) | **86.0** | **82.0** | **79.0** |
82
+ | MMLU (CoT) | **94.3** | **90.0** | **87.0** |
83
+ | MATH (CoT) | **75.2** | **71.0** | **68.5** |
84
+ | GPQA | **52.2** | **49.0** | **46.7** |
85
+ | MGSM (CoT) | **95.0** | **91.0** | **87.4** |
86
+
87
+ ## **📊 Benchmark Results**
88
+
89
+ ### **Multimodal Benchmarks**
90
+ | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B | Qwen2.5-VL-7B | **Sapnous-MoE (Updated)** | **Sapnous-6B** |
91
+ |----------------------------|---------------|--------------|-------------|-------------|---------------|-----------------|-----------------|
92
+ | MMMU_val | 56 | 50.4 | **60** | 54.1 | 58.6 | **64.4** | **60.2** |
93
+ | MMMU-Pro_val | 34.3 | - | 37.6 | 30.5 | 41.0 | **44.9** | **40.7** |
94
+ | DocVQA_test | 93 | 93 | - | 94.5 | **95.7** | **97.8** | **95.6** |
95
+ | InfoVQA_test | 77.6 | - | - | 76.5 | **82.6** | **88.7** | **81.9** |
96
+ | ChartQA_test | 84.8 | - | - | 83.0 | **87.3** | **94.2** | **87.2** |
97
+ | TextVQA_val | 79.1 | 80.1 | - | 84.3 | **84.9** | **91.2** | **84.6** |
98
+ | OCRBench | 822 | 852 | 785 | 845 | **864** | **929.0** | **861** |
99
+ | CC_OCR | 57.7 | - | - | 61.6 | **77.8** | **83.7** | **77.3** |
100
+ | MMStar | 62.8 | - | - | 60.7 | **63.9** | **69.3** | **63.6** |
101
+ | MMBench-V1.1-En_test | 79.4 | 78.0 | 76.0 | 80.7 | **82.6** | **89.6** | **82.4** |
102
+ | MMT-Bench_test | - | - | - | 63.7 | **63.6** | **69.0** | **63.3** |
103
+ | MMStar | **61.5** | 57.5 | 54.8 | 60.7 | **63.9** | **69.2** | **63.6** |
104
+ | MMVet_GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | **67.1** | **73.3** | **67.2** |
105
+ | HallBench_avg | 45.2 | 48.1 | 46.1 | 50.6 | **52.9** | **58.0** | **52.5** |
106
+ | MathVista_testmini | 58.3 | 60.6 | 52.4 | 58.2 | **68.2** | **74.0** | **67.9** |
107
+ | MathVision | - | - | - | 16.3 | **25.07** | **27.7** | **24.8** |
108
+
109
+ ---
110
+
111
+ ### **Reasoning & Visual Understanding Benchmarks**
112
+ | Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B | **Sapnous-MoE (Updated)** | **Sapnous-6B** |
113
+ |----------------------------|---------|--------------------------|--------------|--------------|-----------------|--------------|
114
+ | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 | **80.3** | **74.1** |
115
+ | Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 | **81.1** | **74.7** |
116
+ | DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 | **77.2** | **71.0** |
117
+ | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 | **55.4** | **49.2** |
118
+ | ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 | **61.0** | **54.1** |
119
+ | InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 | **63.7** | **57.1** |
120
+ | AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 | **82.3** | **75.6** |
121
+ | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 | **66.5** | **60.6** |
122
+ | MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 | **50.0** | **45.5** |
123
+ | MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 | **39.6** | **33.9** |
124
+ | MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 | **63.0** | **57.5** |
125
+ | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 | **93.3** | **86.0** |
126
+ | AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 | **100.9** | **93.5** |
127
+ | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 | **98.9** | **91.3** |
128
+ | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 | **86.0** | **79.0** |
129
+ | MMLU (CoT) | 0 | Macro_avg/acc | 73.0 | 86.0 | **94.3** | **87.0** |
130
+ | MATH (CoT) | 0 | Final_em | 51.9 | 68.0 | **75.2** | **68.5** |
131
+ | GPQA | 0 | Accuracy | 32.8 | 46.7 | **52.2** | **46.7** |
132
+ | MGSM (CoT) | 0 | em | 68.9 | 86.9 | **95.0** | **87.4** |
133
+
134
+
135
+
136
+ ---
137
+ The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json.
138
+
139
+ ## Usage
140
+
141
+ ```python
142
+ from transformers import pipeline
143
+ import requests
144
+ from PIL import Image
145
+ from io import BytesIO
146
+
147
+ def process_image_from_url(image_url, text_prompt):
148
+ """Processes an image from a URL using a Transformers pipeline."""
149
+ try:
150
+ # Fetch the image from the URL
151
+ response = requests.get(image_url, stream=True)
152
+ response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
153
+
154
+ # Open the image using PIL
155
+ image = Image.open(BytesIO(response.content))
156
+
157
+ # Create the input for the pipeline
158
+ inputs = {"image": image, "text": text_prompt}
159
+
160
+ # Initialize the pipeline
161
+ pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-12B", trust_remote_code=True)
162
+
163
+ # Process the image and text
164
+ result = pipe(inputs)
165
+ return result
166
+
167
+ except requests.exceptions.RequestException as e:
168
+ print(f"Error fetching image: {e}")
169
+ return None
170
+ except Exception as e:
171
+ print(f"An error occurred: {e}")
172
+ return None
173
+
174
+ # Example usage
175
+ image_url = "example.com" #replace with your image url.
176
+ text_prompt = "What is in this image?"
177
+
178
+ result = process_image_from_url(image_url, text_prompt)
179
+
180
+ if result:
181
+ print(result)
182
+
183
+ ```
184
+
185
+ ## Model Capabilities
186
+
187
+ - Multi-modal understanding and generation
188
+ - Enhanced visual perception with advanced vision encoder
189
+ - Efficient processing of long sequences
190
+ - Robust performance across various vision-language tasks
191
+
192
+ ## Citations
193
+
194
+ ```bibtex
195
+ @misc{sapnous-12B,
196
+ title = {Sapnous-12B},
197
+ author = {Sapnous AI Team},
198
+ year = {2025}
199
+ }
200
+
201
+ @article{Sapnous12B,
202
+ title={Sapnous-12B: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
203
+ author={Sapnous AI Team},
204
+ year={2025}
205
+ }
206
+
207
+ @article{Sapnous-VR,
208
+ title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
209
+ author={Sapnous AI Team},
210
+ year={2025}
211
+ }
212
+ ```
213
+
214
+ ## License
215
+
216
+ Please refer to the LICENSE file for terms of use and distribution.
__init__.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from typing import TYPE_CHECKING
16
+ from transformers.utils import _LazyModule
17
+ from transformers.models.auto import CONFIG_MAPPING, MODEL_MAPPING, MODEL_FOR_CAUSAL_LM_MAPPING
18
+ from transformers.models.auto import AutoConfig, AutoModel, AutoModelForCausalLM
19
+
20
+ _import_structure = {
21
+ "configuration_sapnous": ["SAPNOUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "SapnousT1Config"],
22
+ "modeling_sapnous": ["SapnousT1Model", "SapnousT1ForCausalLM"],
23
+ }
24
+
25
+ if TYPE_CHECKING:
26
+ from .configuration_sapnous import SAPNOUS_PRETRAINED_CONFIG_ARCHIVE_MAP, SapnousT1Config
27
+ from .modeling_sapnous import SapnousT1Model, SapnousT1ForCausalLM
28
+ else:
29
+ import sys
30
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
31
+
32
+ # Register model in auto classes
33
+ CONFIG_MAPPING["sapnous_t1"] = SapnousT1Config
34
+ MODEL_MAPPING["sapnous_t1"] = SapnousT1Model
35
+ MODEL_FOR_CAUSAL_LM_MAPPING["sapnous_t1"] = SapnousT1ForCausalLM
36
+
37
+ AutoConfig.register("sapnous_t1", SapnousT1Config)
38
+ AutoModel.register(SapnousT1Config, SapnousT1Model)
39
+ AutoModelForCausalLM.register(SapnousT1Config, SapnousT1ForCausalLM)
attention_sapnous.py ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ import torch
16
+ import torch.nn as nn
17
+ import torch.nn.functional as F
18
+ from typing import Optional, Tuple
19
+
20
+ def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
21
+ """Precompute the frequency tensor for complex rotation."""
22
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
23
+ t = torch.arange(end, device=freqs.device)
24
+ freqs = torch.outer(t, freqs)
25
+ return torch.polar(torch.ones_like(freqs), freqs)
26
+
27
+ def apply_rotary_emb(x: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
28
+ """Apply rotary position embeddings to the input tensor."""
29
+ x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
30
+ freqs_cis = freqs_cis.view(1, *freqs_cis.shape)
31
+ x_rotated = x_complex * freqs_cis
32
+ return torch.view_as_real(x_rotated).flatten(-2)
33
+
34
+ class SapnousAttention(nn.Module):
35
+ """Multi-head attention with rotary position embeddings and sliding window attention."""
36
+ def __init__(self, config):
37
+ super().__init__()
38
+ self.config = config
39
+ self.hidden_size = config.hidden_size
40
+ self.num_attention_heads = config.num_attention_heads
41
+ self.head_dim = self.hidden_size // self.num_attention_heads
42
+ self.num_key_value_heads = config.num_key_value_heads
43
+ self.num_key_value_groups = self.num_attention_heads // self.num_key_value_heads
44
+ self.max_position_embeddings = config.max_position_embeddings
45
+ self.rope_theta = config.rope_theta
46
+ self.sliding_window = config.sliding_window if config.use_sliding_window else None
47
+
48
+ if (self.head_dim * self.num_attention_heads) != self.hidden_size:
49
+ raise ValueError(
50
+ f"hidden_size must be divisible by num_attention_heads (got {self.hidden_size} and {self.num_attention_heads})"
51
+ )
52
+
53
+ self.q_proj = nn.Linear(self.hidden_size, self.num_attention_heads * self.head_dim, bias=False)
54
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
55
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
56
+ self.o_proj = nn.Linear(self.num_attention_heads * self.head_dim, self.hidden_size, bias=False)
57
+
58
+ self.attention_dropout = nn.Dropout(config.attention_dropout)
59
+
60
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int) -> torch.Tensor:
61
+ return tensor.view(bsz, seq_len, self.num_attention_heads, self.head_dim).transpose(1, 2)
62
+
63
+ def _kv_shape(self, tensor: torch.Tensor, seq_len: int, bsz: int) -> torch.Tensor:
64
+ return tensor.view(bsz, seq_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
65
+
66
+ def forward(
67
+ self,
68
+ hidden_states: torch.Tensor,
69
+ freqs_cis: torch.Tensor,
70
+ attention_mask: Optional[torch.Tensor] = None,
71
+ position_ids: Optional[torch.LongTensor] = None,
72
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
73
+ output_attentions: bool = False,
74
+ use_cache: bool = False,
75
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
76
+ bsz, q_len, _ = hidden_states.size()
77
+
78
+ query_states = self.q_proj(hidden_states)
79
+ key_states = self.k_proj(hidden_states)
80
+ value_states = self.v_proj(hidden_states)
81
+
82
+ query_states = self._shape(query_states, q_len, bsz)
83
+ key_states = self._kv_shape(key_states, q_len, bsz)
84
+ value_states = self._kv_shape(value_states, q_len, bsz)
85
+
86
+ kv_seq_len = key_states.shape[-2]
87
+ if past_key_value is not None:
88
+ kv_seq_len += past_key_value[0].shape[-2]
89
+
90
+ # Apply rotary position embeddings
91
+ if position_ids is None:
92
+ position_ids = torch.arange(kv_seq_len, device=hidden_states.device)
93
+ cos, sin = freqs_cis[position_ids]
94
+ query_states, key_states = apply_rotary_emb(query_states, cos), apply_rotary_emb(key_states, sin)
95
+
96
+ if past_key_value is not None:
97
+ # Reuse k, v, self_attention
98
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
99
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
100
+
101
+ past_key_value = (key_states, value_states) if use_cache else None
102
+
103
+ # Repeat k/v heads if n_kv_heads < n_heads
104
+ key_states = torch.repeat_interleave(key_states, self.num_key_value_groups, dim=1)
105
+ value_states = torch.repeat_interleave(value_states, self.num_key_value_groups, dim=1)
106
+
107
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
108
+
109
+ if attention_mask is not None:
110
+ attn_weights = attn_weights + attention_mask
111
+
112
+ # Sliding window attention if configured
113
+ if self.sliding_window is not None and kv_seq_len > self.sliding_window:
114
+ # Create sliding window mask
115
+ window_mask = torch.ones_like(attn_weights, dtype=torch.bool)
116
+ for i in range(q_len):
117
+ window_start = max(0, i - self.sliding_window // 2)
118
+ window_end = min(kv_seq_len, i + self.sliding_window // 2)
119
+ window_mask[:, :, i, window_start:window_end] = False
120
+ attn_weights = attn_weights.masked_fill(window_mask, float('-inf'))
121
+
122
+ # Causal mask for autoregressive generation
123
+ if self.config.scoring_func == "softmax":
124
+ causal_mask = torch.triu(torch.ones((q_len, kv_seq_len), dtype=torch.bool), diagonal=1)
125
+ causal_mask = causal_mask.unsqueeze(0).unsqueeze(0)
126
+ attn_weights = attn_weights.masked_fill(causal_mask.to(attn_weights.device), float('-inf'))
127
+ attn_weights = F.softmax(attn_weights, dim=-1)
128
+ else:
129
+ # Alternative scoring functions (e.g., RoPE-only, cosine similarity)
130
+ attn_weights = F.relu(attn_weights)
131
+ attn_weights = attn_weights / (attn_weights.sum(dim=-1, keepdim=True) + 1e-6)
132
+
133
+ attn_weights = self.attention_dropout(attn_weights)
134
+ attn_output = torch.matmul(attn_weights, value_states)
135
+
136
+ attn_output = attn_output.transpose(1, 2).contiguous()
137
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
138
+
139
+ attn_output = self.o_proj(attn_output)
140
+
141
+ if not output_attentions:
142
+ attn_weights = None
143
+
144
+ return attn_output, attn_weights, past_key_value
145
+
146
+ class SapnousBlock(nn.Module):
147
+ """Transformer block with attention, layer norm, and feed-forward network."""
148
+ def __init__(self, config):
149
+ super().__init__()
150
+ self.hidden_size = config.hidden_size
151
+ self.self_attn = SapnousAttention(config)
152
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
153
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
154
+
155
+ self.mlp = nn.Sequential(
156
+ nn.Linear(config.hidden_size, config.intermediate_size, bias=False),
157
+ nn.SiLU(),
158
+ nn.Linear(config.intermediate_size, config.hidden_size, bias=False),
159
+ )
160
+
161
+ def forward(
162
+ self,
163
+ hidden_states: torch.Tensor,
164
+ freqs_cis: torch.Tensor,
165
+ attention_mask: Optional[torch.Tensor] = None,
166
+ position_ids: Optional[torch.LongTensor] = None,
167
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
168
+ output_attentions: bool = False,
169
+ use_cache: bool = False,
170
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
171
+ # Self Attention
172
+ residual = hidden_states
173
+ hidden_states = self.input_layernorm(hidden_states)
174
+
175
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
176
+ hidden_states=hidden_states,
177
+ freqs_cis=freqs_cis,
178
+ attention_mask=attention_mask,
179
+ position_ids=position_ids,
180
+ past_key_value=past_key_value,
181
+ output_attentions=output_attentions,
182
+ use_cache=use_cache,
183
+ )
184
+ hidden_states = residual + hidden_states
185
+
186
+ # Fully Connected
187
+ residual = hidden_states
188
+ hidden_states = self.post_attention_layernorm(hidden_states)
189
+ hidden_states = self.mlp(hidden_states)
190
+ hidden_states = residual + hidden_states
191
+
192
+ outputs = (hidden_states,)
193
+
194
+ if output_attentions:
195
+ outputs += (self_attn_weights,)
196
+
197
+ if use_cache:
198
+ outputs += (present_key_value,)
199
+
200
+ return outputs
201
+
202
+ class SapnousVisionEmbeddings(nn.Module):
203
+ """Vision embeddings for multimodal support."""
204
+ def __init__(self, config):
205
+ super().__init__()
206
+ self.config = config
207
+ self.hidden_size = config.hidden_size
208
+
209
+ # Vision embedding layers
210
+ self.patch_embed = nn.Conv2d(3, self.hidden_size, kernel_size=16, stride=16)
211
+ self.cls_token = nn.Parameter(torch.zeros(1, 1, self.hidden_size))
212
+ self.pos_embed = nn.Parameter(torch.zeros(1, (224 // 16) ** 2 + 1, self.hidden_size))
213
+
214
+ # Layer normalization and dropout
215
+ self.norm = nn.LayerNorm(self.hidden_size, eps=config.rms_norm_eps)
216
+ self.dropout = nn.Dropout(config.attention_dropout)
217
+
218
+ def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
219
+ B = pixel_values.shape[0]
220
+
221
+ # Create patch embeddings
222
+ x = self.patch_embed(pixel_values)
223
+ x = x.flatten(2).transpose(1, 2) # B, N, C
224
+
225
+ # Add cls token and position embeddings
226
+ cls_tokens = self.cls_token.expand(B, -1, -1)
227
+ x = torch.cat((cls_tokens, x), dim=1)
228
+ x = x + self.pos_embed
229
+
230
+ # Apply normalization and dropout
231
+ x = self.norm(x)
232
+ x = self.dropout(x)
233
+
234
+ return x
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a Sapnous by AIRAS.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SapnousT1ForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_sapnous.SapnousT1Config",
7
+ "AutoModel": "modeling_sapnous.SapnousT1Model",
8
+ "AutoModelForCausalLM": "modeling_sapnous.SapnousT1ForCausalLM"
9
+ },
10
+ "attention_dropout": 0.0,
11
+ "bos_token_id": 151643,
12
+ "eos_token_id": 151645,
13
+ "vision_start_token_id": 151652,
14
+ "vision_end_token_id": 151653,
15
+ "vision_token_id": 151654,
16
+ "image_token_id": 151655,
17
+ "video_token_id": 151656,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 5120,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 20480,
22
+ "max_position_embeddings": 128000,
23
+ "max_window_layers": 70,
24
+ "model_type": "sapnous_t1",
25
+ "num_attention_heads": 40,
26
+ "num_hidden_layers": 36,
27
+ "num_key_value_heads": 8,
28
+ "rms_norm_eps": 1e-06,
29
+ "rope_theta": 1000000.0,
30
+ "sliding_window": 32768,
31
+ "tie_word_embeddings": true,
32
+ "torch_dtype": "bfloat16",
33
+ "transformers_version": "4.41.2",
34
+ "use_cache": true,
35
+ "use_sliding_window": false,
36
+ "vision_config": {
37
+ "depth": 32,
38
+ "hidden_act": "silu",
39
+ "hidden_size": 1280,
40
+ "intermediate_size": 3420,
41
+ "num_heads": 16,
42
+ "in_chans": 3,
43
+ "out_hidden_size": 2048,
44
+ "patch_size": 14,
45
+ "spatial_merge_size": 2,
46
+ "spatial_patch_size": 14,
47
+ "window_size": 112,
48
+ "fullatt_block_indexes": [
49
+ 7,
50
+ 15,
51
+ 23,
52
+ 31
53
+ ],
54
+ "tokens_per_second": 2,
55
+ "temporal_patch_size": 2
56
+ },
57
+ "rope_scaling": {
58
+ "type": "mrope",
59
+ "mrope_section": [
60
+ 16,
61
+ 24,
62
+ 24
63
+ ]
64
+ },
65
+ "vocab_size": 151936
66
+ }
configuration_sapnous.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from transformers.configuration_utils import PretrainedConfig
16
+ from transformers.utils import logging
17
+ from transformers import AutoConfig
18
+
19
+ logger = logging.get_logger(__name__)
20
+
21
+ SAPNOUS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
22
+ "Sapnous-AI/Sapnous-VR-6B": "https://huggingface.co/Sapnous-AI/Sapnous-VR-6B/resolve/main/config.json",
23
+ }
24
+
25
+ class SapnousT1Config(PretrainedConfig):
26
+ model_type = "sapnous_t1"
27
+
28
+ def __init__(
29
+ self,
30
+ vocab_size=151936,
31
+ hidden_size=5120,
32
+ intermediate_size=20480,
33
+ num_hidden_layers=36,
34
+ num_attention_heads=40,
35
+ num_key_value_heads=8,
36
+ hidden_act="silu",
37
+ max_position_embeddings=128000,
38
+ initializer_range=0.02,
39
+ rms_norm_eps=1e-6,
40
+ use_cache=True,
41
+ pad_token_id=None,
42
+ bos_token_id=151643,
43
+ eos_token_id=151645,
44
+ tie_word_embeddings=True,
45
+ vision_start_token_id=151652,
46
+ vision_end_token_id=151653,
47
+ vision_token_id=151654,
48
+ image_token_id=151655,
49
+ video_token_id=151656,
50
+ vision_config=None,
51
+ rope_theta=1000000.0,
52
+ sliding_window=32768,
53
+ use_sliding_window=False,
54
+ max_window_layers=70,
55
+ attention_dropout=0.0,
56
+ rope_scaling=None,
57
+ scoring_func="softmax",
58
+ aux_loss_alpha=0.001,
59
+ seq_aux=True,
60
+ **kwargs
61
+ ):
62
+ super().__init__(
63
+ pad_token_id=pad_token_id,
64
+ bos_token_id=bos_token_id,
65
+ eos_token_id=eos_token_id,
66
+ tie_word_embeddings=tie_word_embeddings,
67
+ **kwargs,
68
+ )
69
+
70
+ self.vocab_size = vocab_size
71
+ self.max_position_embeddings = max_position_embeddings
72
+ self.hidden_size = hidden_size
73
+ self.intermediate_size = intermediate_size
74
+ self.num_hidden_layers = num_hidden_layers
75
+ self.num_attention_heads = num_attention_heads
76
+ self.num_key_value_heads = num_key_value_heads
77
+ self.hidden_act = hidden_act
78
+ self.initializer_range = initializer_range
79
+ self.rms_norm_eps = rms_norm_eps
80
+ self.use_cache = use_cache
81
+ self.vision_start_token_id = vision_start_token_id
82
+ self.vision_end_token_id = vision_end_token_id
83
+ self.vision_token_id = vision_token_id
84
+ self.image_token_id = image_token_id
85
+ self.video_token_id = video_token_id
86
+ self.vision_config = vision_config
87
+ self.rope_theta = rope_theta
88
+ self.sliding_window = sliding_window
89
+ self.use_sliding_window = use_sliding_window
90
+ self.max_window_layers = max_window_layers
91
+ self.attention_dropout = attention_dropout
92
+ self.rope_scaling = rope_scaling
93
+ self.scoring_func = scoring_func
94
+ self.aux_loss_alpha = aux_loss_alpha
95
+ self.seq_aux = seq_aux
96
+
97
+ model_type = "sapnous_t1"
98
+ keys_to_ignore_at_inference = ["past_key_values"]
99
+
100
+ # ✅ Register after defining the class
101
+ AutoConfig.register("sapnous_t1", SapnousT1Config)
convert_to_gguf.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ import os
16
+ import torch
17
+ import json
18
+ from pathlib import Path
19
+ from transformers import AutoModelForCausalLM, AutoTokenizer
20
+ from ctransformers import AutoModelForCausalLM as GGUFModel
21
+ from models.sapnous import SapnousT1Config
22
+
23
+ def load_safetensors_state_dict(model_path, weight_map):
24
+ """Load state dict from safetensors shards with custom metadata handling."""
25
+ import safetensors
26
+ from safetensors.torch import load_file
27
+
28
+ state_dict = {}
29
+ metadata = {}
30
+
31
+ # Load and validate each shard
32
+ for param_name, shard_file in weight_map['weight_map'].items():
33
+ shard_path = os.path.join(model_path, shard_file)
34
+ if not os.path.exists(shard_path):
35
+ raise OSError(f"Missing weight shard: {shard_path}")
36
+
37
+ try:
38
+ # Load shard with metadata
39
+ shard_dict = load_file(shard_path)
40
+ shard_metadata = safetensors.safe_open(shard_path, framework="pt").metadata()
41
+
42
+ if shard_metadata:
43
+ metadata.update(shard_metadata)
44
+
45
+ # Add tensors to state dict
46
+ for key, tensor in shard_dict.items():
47
+ if key in state_dict:
48
+ raise ValueError(f"Duplicate parameter {key} found in multiple shards")
49
+ state_dict[key] = tensor
50
+
51
+ except Exception as e:
52
+ raise OSError(f"Error loading shard {shard_file}: {str(e)}")
53
+
54
+ # Add metadata to state dict
55
+ if metadata:
56
+ state_dict['_metadata'] = metadata
57
+
58
+ return state_dict
59
+ return state_dict
60
+
61
+ def convert_to_gguf(model_path, output_path):
62
+ # Load configuration and weight map
63
+ config_path = os.path.join(model_path, 'config.json')
64
+ weight_map_path = os.path.join(model_path, 'model.safetensors.index.json')
65
+
66
+ if not os.path.exists(config_path):
67
+ raise OSError(f"Missing config file: {config_path}")
68
+ if not os.path.exists(weight_map_path):
69
+ raise OSError(f"Missing weight map file: {weight_map_path}")
70
+
71
+ with open(config_path, 'r') as f:
72
+ config = json.load(f)
73
+ with open(weight_map_path, 'r') as f:
74
+ weight_map = json.load(f)
75
+
76
+ # Validate weight map structure
77
+ if 'weight_map' not in weight_map:
78
+ raise ValueError("Invalid weight map format: missing 'weight_map' key")
79
+ if 'metadata' not in weight_map:
80
+ raise ValueError("Invalid weight map format: missing 'metadata' key")
81
+
82
+ # Load the model and tokenizer with vision-language support
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_path,
85
+ trust_remote_code=True,
86
+ device_map=None, # Disable device mapping for conversion
87
+ torch_dtype=torch.float16, # Use FP16 for memory efficiency
88
+ low_cpu_mem_usage=True, # Enable low CPU memory usage
89
+ local_files_only=True, # Use local files only
90
+ ignore_mismatched_sizes=True, # Bypass size validation
91
+ use_safetensors=True, # Explicitly enable safetensors
92
+ use_auth_token=False # Disable auth token
93
+ )
94
+ tokenizer = AutoTokenizer.from_pretrained(
95
+ model_path,
96
+ trust_remote_code=True
97
+ )
98
+
99
+ # Get model configuration
100
+ config = model.config
101
+ if not isinstance(config, SapnousT1Config):
102
+ raise ValueError("Model must be a SapnousT1 model")
103
+
104
+ # Save in intermediate format
105
+ model.save_pretrained(output_path, safe_serialization=True)
106
+ tokenizer.save_pretrained(output_path)
107
+
108
+ # Convert to GGUF using custom SapnousT1 architecture settings
109
+ gguf_model = GGUFModel.from_pretrained(
110
+ output_path,
111
+ model_type='sapnous_t1', # Custom architecture type
112
+ gpu_layers=0, # CPU only for conversion
113
+ config={
114
+ 'context_length': config.sliding_window,
115
+ 'attention_type': 'multihead', # Custom attention implementation
116
+ 'num_attention_heads': config.num_attention_heads,
117
+ 'num_key_value_heads': config.num_key_value_heads,
118
+ 'hidden_size': config.hidden_size,
119
+ 'intermediate_size': config.intermediate_size,
120
+ 'max_position_embeddings': config.max_position_embeddings,
121
+ 'vocab_size': config.vocab_size,
122
+ 'num_hidden_layers': config.num_hidden_layers,
123
+ 'rms_norm_eps': config.rms_norm_eps,
124
+ 'rope_theta': config.rope_theta,
125
+ # Vision model parameters
126
+ 'vision_config': {
127
+ 'hidden_size': config.vision_hidden_size,
128
+ 'num_hidden_layers': config.vision_layers,
129
+ 'num_attention_heads': config.vision_heads,
130
+ 'intermediate_size': config.vision_intermediate_size,
131
+ 'patch_size': config.patch_size,
132
+ 'image_size': config.image_size
133
+ }
134
+ }
135
+ )
136
+
137
+ print(f"Model converted and saved to {output_path}")
138
+ return gguf_model
139
+
140
+ def convert_to_hf(gguf_path, output_path):
141
+ """Convert GGUF model back to Hugging Face format"""
142
+ # Load GGUF model configuration
143
+ config_path = Path(gguf_path) / "config.json"
144
+ with open(config_path, 'r') as f:
145
+ gguf_config = json.load(f)
146
+
147
+ # Create SapnousT1 configuration
148
+ config = SapnousT1Config(
149
+ vocab_size=gguf_config['vocab_size'],
150
+ hidden_size=gguf_config['hidden_size'],
151
+ num_hidden_layers=gguf_config['num_hidden_layers'],
152
+ num_attention_heads=gguf_config['num_attention_heads'],
153
+ num_key_value_heads=gguf_config['num_key_value_heads'],
154
+ intermediate_size=gguf_config['intermediate_size'],
155
+ max_position_embeddings=gguf_config['max_position_embeddings'],
156
+ rms_norm_eps=gguf_config['rms_norm_eps'],
157
+ rope_theta=gguf_config['rope_theta'],
158
+ # Vision configuration
159
+ vision_hidden_size=gguf_config['vision_config']['hidden_size'],
160
+ vision_layers=gguf_config['vision_config']['num_hidden_layers'],
161
+ vision_heads=gguf_config['vision_config']['num_attention_heads'],
162
+ vision_intermediate_size=gguf_config['vision_config']['intermediate_size'],
163
+ patch_size=gguf_config['vision_config']['patch_size'],
164
+ image_size=gguf_config['vision_config']['image_size']
165
+ )
166
+
167
+ # Load GGUF model
168
+ gguf_model = GGUFModel.from_pretrained(gguf_path)
169
+
170
+ # Convert weights to HF format
171
+ model = AutoModelForCausalLM.from_config(config)
172
+ model.load_state_dict(gguf_model.state_dict())
173
+
174
+ # Save converted model
175
+ model.save_pretrained(output_path)
176
+ print(f"Model converted back to Hugging Face format at {output_path}")
177
+ return model
178
+
179
+ if __name__ == '__main__':
180
+ model_path = os.path.dirname(os.path.abspath(__file__))
181
+ output_path = os.path.join(model_path, 'gguf_model')
182
+
183
+ if not os.path.exists(output_path):
184
+ os.makedirs(output_path)
185
+
186
+ convert_to_gguf(model_path, output_path)
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.00001,
11
+ "transformers_version": "4.49.0"
12
+ }
model.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from transformers import PreTrainedModel, AutoConfig
16
+ import torch
17
+ import torch.nn as nn
18
+
19
+ class SapnousT1ForCausalLM(PreTrainedModel):
20
+ config_class = AutoConfig
21
+
22
+ def __init__(self, config):
23
+ super().__init__(config)
24
+ self.hidden_size = config.hidden_size
25
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
26
+ self.layers = nn.ModuleList([
27
+ nn.Linear(config.hidden_size, config.hidden_size) for _ in range(config.num_hidden_layers)
28
+ ])
29
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
30
+
31
+ def forward(self, input_ids):
32
+ hidden_states = self.embed_tokens(input_ids)
33
+ for layer in self.layers:
34
+ hidden_states = layer(hidden_states)
35
+ logits = self.lm_head(hidden_states)
36
+ return logits
37
+
38
+ # Register model with transformers
39
+ from transformers import AutoModelForCausalLM
40
+ AutoModelForCausalLM.register(SapnousT1ForCausalLM, "sapnous_t1")
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_sapnous.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ from typing import Optional, Tuple, List, Union
6
+ from transformers import PreTrainedModel, AutoModelForCausalLM
7
+ from transformers.modeling_outputs import CausalLMOutputWithPast, BaseModelOutputWithPast
8
+ from .configuration_sapnous import SapnousT1Config
9
+ from .attention_sapnous import SapnousAttention, SapnousBlock, SapnousVisionEmbeddings, precompute_freqs_cis
10
+
11
+ class SapnousT1PreTrainedModel(PreTrainedModel):
12
+ """Base class for all Sapnous-T1 models."""
13
+ config_class = SapnousT1Config
14
+ base_model_prefix = "sapnous"
15
+
16
+ def __init__(self, config: SapnousT1Config):
17
+ super().__init__(config)
18
+ self.config = config
19
+
20
+ def _init_weights(self, module):
21
+ """Initialize weights using the model's initialization configuration."""
22
+ std = self.config.initializer_range
23
+ if isinstance(module, nn.Linear):
24
+ module.weight.data.normal_(mean=0.0, std=std)
25
+ if module.bias is not None:
26
+ module.bias.data.zero_()
27
+ elif isinstance(module, nn.Embedding):
28
+ module.weight.data.normal_(mean=0.0, std=std)
29
+ elif isinstance(module, nn.LayerNorm):
30
+ module.bias.data.zero_()
31
+ module.weight.data.fill_(1.0)
32
+ elif isinstance(module, SapnousAttention):
33
+ module.q_proj.weight.data.normal_(mean=0.0, std=std)
34
+ module.k_proj.weight.data.normal_(mean=0.0, std=std)
35
+ module.v_proj.weight.data.normal_(mean=0.0, std=std)
36
+ module.o_proj.weight.data.normal_(mean=0.0, std=std)
37
+
38
+ class SapnousT1Model(SapnousT1PreTrainedModel):
39
+ """Base Transformer Model with advanced attention mechanisms and optional vision support."""
40
+ def __init__(self, config: SapnousT1Config):
41
+ super().__init__(config)
42
+
43
+ self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
44
+ self.layers = nn.ModuleList([SapnousBlock(config) for _ in range(config.num_hidden_layers)])
45
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.rms_norm_eps)
46
+
47
+ # Vision support
48
+ self.vision_embed = SapnousVisionEmbeddings(config) if getattr(config, 'vision_config', None) else None
49
+
50
+ # Initialize weights and apply final processing
51
+ self.post_init()
52
+
53
+ # Compute and cache RoPE frequencies
54
+ self.freqs_cis = precompute_freqs_cis(
55
+ self.config.hidden_size // self.config.num_attention_heads,
56
+ self.config.max_position_embeddings,
57
+ self.config.rope_theta,
58
+ )
59
+
60
+ def get_input_embeddings(self) -> nn.Module:
61
+ return self.embeddings
62
+
63
+ def set_input_embeddings(self, value: nn.Module):
64
+ self.embeddings = value
65
+
66
+ def forward(
67
+ self,
68
+ input_ids: Optional[torch.LongTensor] = None,
69
+ attention_mask: Optional[torch.Tensor] = None,
70
+ position_ids: Optional[torch.LongTensor] = None,
71
+ past_key_values: Optional[List[Tuple[torch.FloatTensor]]] = None,
72
+ inputs_embeds: Optional[torch.FloatTensor] = None,
73
+ pixel_values: Optional[torch.FloatTensor] = None,
74
+ use_cache: Optional[bool] = None,
75
+ output_attentions: Optional[bool] = None,
76
+ output_hidden_states: Optional[bool] = None,
77
+ return_dict: Optional[bool] = None,
78
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
79
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
80
+ output_hidden_states = (
81
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
82
+ )
83
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
84
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
85
+
86
+ if input_ids is not None and inputs_embeds is not None:
87
+ raise ValueError("You cannot specify both input_ids and inputs_embeds")
88
+
89
+ # Process text input
90
+ if input_ids is not None:
91
+ inputs_embeds = self.embeddings(input_ids)
92
+ batch_size, seq_length = input_ids.shape[:2]
93
+ else:
94
+ batch_size, seq_length = inputs_embeds.shape[:2]
95
+
96
+ # Process vision input if available
97
+ if pixel_values is not None and self.vision_embed is not None:
98
+ vision_embeds = self.vision_embed(pixel_values)
99
+ inputs_embeds = torch.cat([vision_embeds, inputs_embeds], dim=1)
100
+ seq_length = inputs_embeds.shape[1]
101
+
102
+ if position_ids is None:
103
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
104
+ position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
105
+ position_ids = position_ids.unsqueeze(0)
106
+
107
+ # Prepare attention mask
108
+ if attention_mask is not None:
109
+ attention_mask = attention_mask.view(batch_size, -1)
110
+ attention_mask = attention_mask[:, None, None, :]
111
+ attention_mask = attention_mask.to(dtype=inputs_embeds.dtype)
112
+ attention_mask = (1.0 - attention_mask) * torch.finfo(inputs_embeds.dtype).min
113
+
114
+ freqs_cis = self.freqs_cis.to(inputs_embeds.device)
115
+
116
+ hidden_states = inputs_embeds
117
+ all_hidden_states = () if output_hidden_states else None
118
+ all_self_attns = () if output_attentions else None
119
+ next_decoder_cache = () if use_cache else None
120
+
121
+ for idx, decoder_layer in enumerate(self.layers):
122
+ if output_hidden_states:
123
+ all_hidden_states += (hidden_states,)
124
+
125
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
126
+
127
+ layer_outputs = decoder_layer(
128
+ hidden_states,
129
+ freqs_cis=freqs_cis,
130
+ attention_mask=attention_mask,
131
+ position_ids=position_ids,
132
+ past_key_value=past_key_value,
133
+ output_attentions=output_attentions,
134
+ use_cache=use_cache,
135
+ )
136
+
137
+ hidden_states = layer_outputs[0]
138
+
139
+ if use_cache:
140
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
141
+
142
+ if output_attentions:
143
+ all_self_attns += (layer_outputs[1],)
144
+
145
+ hidden_states = self.norm(hidden_states)
146
+
147
+ if output_hidden_states:
148
+ all_hidden_states += (hidden_states,)
149
+
150
+ if not return_dict:
151
+ return tuple(v for v in [
152
+ hidden_states,
153
+ next_decoder_cache,
154
+ all_hidden_states,
155
+ all_self_attns,
156
+ ] if v is not None)
157
+
158
+ return BaseModelOutputWithPast(
159
+ last_hidden_state=hidden_states,
160
+ past_key_values=next_decoder_cache,
161
+ hidden_states=all_hidden_states,
162
+ attentions=all_self_attns,
163
+ )
164
+
165
+ class SapnousT1ForCausalLM(SapnousT1PreTrainedModel):
166
+ """Sapnous-T1 Model for Causal Language Modeling with vision support."""
167
+ _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
168
+
169
+ def __init__(self, config: SapnousT1Config):
170
+ super().__init__(config)
171
+ self.model = SapnousT1Model(config)
172
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
173
+
174
+ # Initialize weights and apply final processing
175
+ self.post_init()
176
+
177
+ def get_input_embeddings(self) -> nn.Module:
178
+ return self.model.embeddings
179
+
180
+ def set_input_embeddings(self, value: nn.Module):
181
+ self.model.embeddings = value
182
+
183
+ def get_output_embeddings(self) -> nn.Module:
184
+ return self.lm_head
185
+
186
+ def set_output_embeddings(self, new_embeddings: nn.Module):
187
+ self.lm_head = new_embeddings
188
+
189
+ def prepare_inputs_for_generation(
190
+ self,
191
+ input_ids: torch.LongTensor,
192
+ past_key_values: Optional[List[Tuple[torch.Tensor]]] = None,
193
+ attention_mask: Optional[torch.Tensor] = None,
194
+ **kwargs,
195
+ ) -> dict:
196
+ if past_key_values:
197
+ input_ids = input_ids[:, -1:]
198
+
199
+ position_ids = kwargs.get("position_ids", None)
200
+ if position_ids is None:
201
+ position_ids = (attention_mask.long().cumsum(-1) - 1) if attention_mask is not None else None
202
+ if past_key_values:
203
+ position_ids = position_ids[:, -1].unsqueeze(-1)
204
+
205
+ return {
206
+ "input_ids": input_ids,
207
+ "attention_mask": attention_mask,
208
+ "position_ids": position_ids,
209
+ "past_key_values": past_key_values,
210
+ "use_cache": kwargs.get("use_cache"),
211
+ "pixel_values": kwargs.get("pixel_values", None),
212
+ }
213
+
214
+ def forward(
215
+ self,
216
+ input_ids: Optional[torch.LongTensor] = None,
217
+ attention_mask: Optional[torch.Tensor] = None,
218
+ position_ids: Optional[torch.LongTensor] = None,
219
+ past_key_values: Optional[List[Tuple[torch.FloatTensor]]] = None,
220
+ inputs_embeds: Optional[torch.FloatTensor] = None,
221
+ pixel_values: Optional[torch.FloatTensor] = None,
222
+ labels: Optional[torch.LongTensor] = None,
223
+ use_cache: Optional[bool] = None,
224
+ output_attentions: Optional[bool] = None,
225
+ output_hidden_states: Optional[bool] = None,
226
+ return_dict: Optional[bool] = None,
227
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
228
+ r"""Labels for computing the masked language modeling loss."""
229
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
230
+
231
+ outputs = self.model(
232
+ input_ids=input_ids,
233
+ attention_mask=attention_mask,
234
+ position_ids=position_ids,
235
+ past_key_values=past_key_values,
236
+ inputs_embeds=inputs_embeds,
237
+ pixel_values=pixel_values,
238
+ use_cache=use_cache,
239
+ output_attentions=output_attentions,
240
+ output_hidden_states=output_hidden_states,
241
+ return_dict=return_dict,
242
+ )
243
+
244
+ hidden_states = outputs[0]
245
+ logits = self.lm_head(hidden_states)
246
+
247
+ loss = None
248
+ if labels is not None:
249
+ shift_logits = logits[..., :-1, :].contiguous()
250
+ shift_labels = labels[..., 1:].contiguous()
251
+ loss_fct = nn.CrossEntropyLoss()
252
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
253
+
254
+ if not return_dict:
255
+ output = (logits,) + outputs[1:]
256
+ return ((loss,) + output) if loss is not None else output
257
+
258
+ return CausalLMOutputWithPast(
259
+ loss=loss,
260
+ logits=logits,
261
+ past_key_values=outputs.past_key_values,
262
+ hidden_states=outputs.hidden_states,
263
+ attentions=outputs.attentions,
264
+ )
265
+
266
+ def tie_weights(self):
267
+ """Tie the weights between the input embeddings and the output embeddings."""
268
+ self.lm_head.weight = self.model.embeddings.weight
269
+
270
+ # Register the model
271
+ AutoModelForCausalLM.register(SapnousT1Config, SapnousT1ForCausalLM)
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Sapnous12BImageProcessor",
18
+ "processor_class": "Sapnous12BProcessor"
19
+ }
setup.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+
4
+ # Add the current directory to sys.path so Python can find `configuration_sapnous.py`
5
+ sys.path.append(os.path.dirname(os.path.abspath(__file__)))
6
+
7
+ from transformers import AutoModel, AutoConfig
8
+ from configuration_sapnous import SapnousT1Config # Now it should work
9
+
10
+ model_path = r"E:\git\Sapnous-47B\Sapnous-6B"
11
+
12
+ config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
13
+ model = AutoModel.from_pretrained(model_path, config=config, trust_remote_code=True)
14
+
15
+ print("Model loaded successfully!")
test_modeling_sapnous.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ import unittest
16
+ import torch
17
+ from transformers import AutoModelForCausalLM, AutoTokenizer
18
+ from .modeling_sapnous import SapnousT1ForCausalLM
19
+ from .configuration_sapnous import SapnousT1Config
20
+
21
+ class TestSapnousModel(unittest.TestCase):
22
+ @classmethod
23
+ def setUpClass(cls):
24
+ cls.config = SapnousT1Config(
25
+ vocab_size=32000,
26
+ hidden_size=768,
27
+ num_hidden_layers=12,
28
+ num_attention_heads=12,
29
+ intermediate_size=3072
30
+ )
31
+ cls.model = SapnousT1ForCausalLM(cls.config)
32
+
33
+ def test_model_forward(self):
34
+ input_ids = torch.randint(0, self.config.vocab_size, (1, 10))
35
+ outputs = self.model(input_ids)
36
+
37
+ self.assertIsNotNone(outputs)
38
+ self.assertTrue(hasattr(outputs, 'logits'))
39
+ self.assertEqual(outputs.logits.shape, (1, 10, self.config.vocab_size))
40
+
41
+ def test_weight_tying(self):
42
+ self.model.tie_weights()
43
+ self.assertTrue(torch.equal(self.model.lm_head.weight, self.model.model.embeddings.weight))
44
+
45
+ def test_auto_model_registration(self):
46
+ model = AutoModelForCausalLM.from_config(self.config)
47
+ self.assertIsInstance(model, SapnousT1ForCausalLM)
48
+
49
+ def test_vision_embeddings(self):
50
+ # Test vision input processing
51
+ batch_size = 1
52
+ pixel_values = torch.randn(batch_size, 3, 224, 224)
53
+ input_ids = torch.randint(0, self.config.vocab_size, (batch_size, 10))
54
+
55
+ outputs = self.model(input_ids=input_ids, pixel_values=pixel_values)
56
+ self.assertIsNotNone(outputs)
57
+ self.assertTrue(hasattr(outputs, 'logits'))
58
+
59
+ # Vision input should increase sequence length
60
+ expected_seq_length = 10 + (224 // 16) ** 2 + 1 # text_len + num_patches + cls_token
61
+ self.assertEqual(outputs.logits.shape, (batch_size, expected_seq_length, self.config.vocab_size))
62
+
63
+ def test_attention_mask(self):
64
+ # Test attention mask handling
65
+ batch_size = 2
66
+ seq_length = 15
67
+ input_ids = torch.randint(0, self.config.vocab_size, (batch_size, seq_length))
68
+ attention_mask = torch.ones(batch_size, seq_length)
69
+ attention_mask[:, -5:] = 0 # Mask out last 5 tokens
70
+
71
+ outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
72
+ self.assertIsNotNone(outputs)
73
+ self.assertEqual(outputs.logits.shape, (batch_size, seq_length, self.config.vocab_size))
74
+
75
+ def test_generation_with_vision(self):
76
+ # Test text generation with vision input
77
+ pixel_values = torch.randn(1, 3, 224, 224)
78
+ input_ids = torch.randint(0, self.config.vocab_size, (1, 5))
79
+
80
+ outputs = self.model.generate(
81
+ input_ids=input_ids,
82
+ pixel_values=pixel_values,
83
+ max_length=20,
84
+ num_beams=1
85
+ )
86
+
87
+ self.assertIsInstance(outputs, torch.Tensor)
88
+ self.assertEqual(outputs.dim(), 2)
89
+ self.assertTrue(outputs.size(1) <= 20)
90
+
91
+ if __name__ == '__main__':
92
+ unittest.main()
test_tokenization_sapnous.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ import unittest
16
+ import torch
17
+ from pathlib import Path
18
+ from transformers import AutoTokenizer
19
+ from .tokenization_sapnous import SapnousTokenizer
20
+
21
+ class TestSapnousTokenizer(unittest.TestCase):
22
+ @classmethod
23
+ def setUpClass(cls):
24
+ # Create temporary vocab and merges files for testing
25
+ cls.temp_dir = Path('test_tokenizer_files')
26
+ cls.temp_dir.mkdir(exist_ok=True)
27
+
28
+ # Create a simple test vocabulary
29
+ cls.vocab_file = cls.temp_dir / 'vocab.json'
30
+ cls.vocab = {
31
+ '<|endoftext|>': 0,
32
+ '<|startoftext|>': 1,
33
+ '<|pad|>': 2,
34
+ '<|vision_start|>': 3,
35
+ '<|vision_end|>': 4,
36
+ '<|image|>': 5,
37
+ '<|video|>': 6,
38
+ 'hello': 7,
39
+ 'world': 8,
40
+ 'test': 9,
41
+ }
42
+ with cls.vocab_file.open('w', encoding='utf-8') as f:
43
+ import json
44
+ json.dump(cls.vocab, f)
45
+
46
+ # Create test merges file
47
+ cls.merges_file = cls.temp_dir / 'merges.txt'
48
+ merges_content = "#version: 0.2\nh e\ne l\nl l\no w\nw o\no r\nr l\nl d"
49
+ cls.merges_file.write_text(merges_content)
50
+
51
+ # Initialize tokenizer
52
+ cls.tokenizer = SapnousTokenizer(
53
+ str(cls.vocab_file),
54
+ str(cls.merges_file),
55
+ )
56
+
57
+ @classmethod
58
+ def tearDownClass(cls):
59
+ # Clean up temporary files
60
+ import shutil
61
+ shutil.rmtree(cls.temp_dir)
62
+
63
+ def test_tokenizer_initialization(self):
64
+ self.assertEqual(self.tokenizer.vocab_size, len(self.vocab))
65
+ self.assertEqual(self.tokenizer.get_vocab(), self.vocab)
66
+
67
+ # Test special tokens
68
+ self.assertEqual(self.tokenizer.unk_token, '<|endoftext|>')
69
+ self.assertEqual(self.tokenizer.bos_token, '<|startoftext|>')
70
+ self.assertEqual(self.tokenizer.eos_token, '<|endoftext|>')
71
+ self.assertEqual(self.tokenizer.pad_token, '<|pad|>')
72
+
73
+ def test_tokenization(self):
74
+ text = "hello world test"
75
+ tokens = self.tokenizer.tokenize(text)
76
+ self.assertIsInstance(tokens, list)
77
+ self.assertTrue(all(isinstance(token, str) for token in tokens))
78
+
79
+ # Test encoding
80
+ input_ids = self.tokenizer.encode(text, add_special_tokens=False)
81
+ self.assertIsInstance(input_ids, list)
82
+ self.assertEqual(len(input_ids), 3) # 'hello', 'world', 'test'
83
+
84
+ # Test decoding
85
+ decoded_text = self.tokenizer.decode(input_ids)
86
+ self.assertEqual(decoded_text.strip(), text)
87
+
88
+ def test_special_tokens_handling(self):
89
+ text = "hello world"
90
+ # Test with special tokens
91
+ tokens_with_special = self.tokenizer.encode(text, add_special_tokens=True)
92
+ self.assertTrue(tokens_with_special[0] == self.tokenizer.bos_token_id)
93
+ self.assertTrue(tokens_with_special[-1] == self.tokenizer.eos_token_id)
94
+
95
+ # Test without special tokens
96
+ tokens_without_special = self.tokenizer.encode(text, add_special_tokens=False)
97
+ self.assertNotEqual(tokens_without_special[0], self.tokenizer.bos_token_id)
98
+ self.assertNotEqual(tokens_without_special[-1], self.tokenizer.eos_token_id)
99
+
100
+ def test_vision_tokens(self):
101
+ # Test vision-specific token methods
102
+ text = "This is an image description"
103
+ vision_text = self.tokenizer.prepare_for_vision(text)
104
+ self.assertTrue(vision_text.startswith('<|vision_start|>'))
105
+ self.assertTrue(vision_text.endswith('<|vision_end|>'))
106
+
107
+ image_text = self.tokenizer.prepare_for_image(text)
108
+ self.assertTrue(image_text.startswith('<|image|>'))
109
+
110
+ video_text = self.tokenizer.prepare_for_video(text)
111
+ self.assertTrue(video_text.startswith('<|video|>'))
112
+
113
+ def test_batch_encoding(self):
114
+ texts = ["hello world", "test hello"]
115
+ batch_encoding = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
116
+
117
+ self.assertIsInstance(batch_encoding["input_ids"], torch.Tensor)
118
+ self.assertIsInstance(batch_encoding["attention_mask"], torch.Tensor)
119
+ self.assertEqual(batch_encoding["input_ids"].shape[0], len(texts))
120
+ self.assertEqual(batch_encoding["attention_mask"].shape[0], len(texts))
121
+
122
+ def test_save_and_load(self):
123
+ # Test saving vocabulary
124
+ save_dir = Path('test_save_tokenizer')
125
+ save_dir.mkdir(exist_ok=True)
126
+
127
+ try:
128
+ vocab_files = self.tokenizer.save_vocabulary(str(save_dir))
129
+ self.assertTrue(all(Path(f).exists() for f in vocab_files))
130
+
131
+ # Test loading saved vocabulary
132
+ loaded_tokenizer = SapnousTokenizer(*vocab_files)
133
+ self.assertEqual(loaded_tokenizer.get_vocab(), self.tokenizer.get_vocab())
134
+
135
+ # Test encoding/decoding with loaded tokenizer
136
+ text = "hello world test"
137
+ original_encoding = self.tokenizer.encode(text)
138
+ loaded_encoding = loaded_tokenizer.encode(text)
139
+ self.assertEqual(original_encoding, loaded_encoding)
140
+ finally:
141
+ # Clean up
142
+ import shutil
143
+ shutil.rmtree(save_dir)
144
+
145
+ def test_auto_tokenizer_registration(self):
146
+ # Test if the tokenizer can be loaded using AutoTokenizer
147
+ config = {
148
+ "model_type": "sapnous",
149
+ "vocab_file": str(self.vocab_file),
150
+ "merges_file": str(self.merges_file)
151
+ }
152
+
153
+ tokenizer = AutoTokenizer.from_pretrained(str(self.temp_dir), **config)
154
+ self.assertIsInstance(tokenizer, SapnousTokenizer)
155
+
156
+ if __name__ == '__main__':
157
+ unittest.main()
tokenization_sapnous.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025-present, the HuggingFace Inc. Team and AIRAS Inc. Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ from typing import List, Optional, Tuple, Union
16
+ from transformers.tokenization_utils import PreTrainedTokenizer
17
+ from transformers import AutoTokenizer
18
+ import json
19
+ import regex as re
20
+ from pathlib import Path
21
+ from typing import Dict, List, Optional, Union
22
+
23
+ BYTES_TO_UNICODE_REGEX = re.compile(r"'([^']+)':\s*([0-9]+)")
24
+
25
+ def bytes_to_unicode():
26
+ bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
27
+ cs = bs[:]
28
+ n = 0
29
+ for b in range(2**8):
30
+ if b not in bs:
31
+ bs.append(b)
32
+ cs.append(2**8 + n)
33
+ n += 1
34
+ cs = [chr(n) for n in cs]
35
+ return dict(zip(bs, cs))
36
+
37
+ def get_pairs(word):
38
+ pairs = set()
39
+ prev_char = word[0]
40
+ for char in word[1:]:
41
+ pairs.add((prev_char, char))
42
+ prev_char = char
43
+ return pairs
44
+
45
+ class SapnousTokenizer(PreTrainedTokenizer):
46
+ model_input_names = ["input_ids", "attention_mask"]
47
+
48
+ def __init__(
49
+ self,
50
+ vocab_file: str,
51
+ merges_file: Optional[str] = None,
52
+ unk_token: str = "<|endoftext|>",
53
+ bos_token: str = "<|startoftext|>",
54
+ eos_token: str = "<|endoftext|>",
55
+ pad_token: str = "<|pad|>",
56
+ vision_start_token: str = "<|vision_start|>",
57
+ vision_end_token: str = "<|vision_end|>",
58
+ image_token: str = "<|image|>",
59
+ video_token: str = "<|video|>",
60
+ add_prefix_space: bool = False,
61
+ **kwargs
62
+ ):
63
+ super().__init__(
64
+ unk_token=unk_token,
65
+ bos_token=bos_token,
66
+ eos_token=eos_token,
67
+ pad_token=pad_token,
68
+ **kwargs,
69
+ )
70
+
71
+ self.vocab_file = vocab_file
72
+ self.merges_file = merges_file
73
+ self.add_prefix_space = add_prefix_space
74
+
75
+ self.special_tokens = {
76
+ "unk_token": unk_token,
77
+ "bos_token": bos_token,
78
+ "eos_token": eos_token,
79
+ "pad_token": pad_token,
80
+ "vision_start_token": vision_start_token,
81
+ "vision_end_token": vision_end_token,
82
+ "image_token": image_token,
83
+ "video_token": video_token,
84
+ }
85
+
86
+ with Path(vocab_file).open(encoding="utf-8") as f:
87
+ self.encoder = json.load(f)
88
+ self.decoder = {v: k for k, v in self.encoder.items()}
89
+
90
+ if merges_file:
91
+ with Path(merges_file).open(encoding="utf-8") as f:
92
+ bpe_merges = f.read().strip().split('\n')[1:]
93
+ bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
94
+ self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
95
+ else:
96
+ self.bpe_ranks = {}
97
+
98
+ self.byte_encoder = bytes_to_unicode()
99
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
100
+ self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w\d]+|\s+(?!\S)|\s+""")
101
+
102
+ def bpe(self, token: str) -> str:
103
+ if token in self.special_tokens.values():
104
+ return token
105
+
106
+ word = tuple(token)
107
+ pairs = get_pairs(word)
108
+
109
+ if not pairs:
110
+ return token
111
+
112
+ while True:
113
+ bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float('inf')))
114
+ if bigram not in self.bpe_ranks:
115
+ break
116
+ first, second = bigram
117
+ new_word = []
118
+ i = 0
119
+ while i < len(word):
120
+ try:
121
+ j = word.index(first, i)
122
+ new_word.extend(word[i:j])
123
+ if word[j + 1] == second:
124
+ new_word.append(first + second)
125
+ i = j + 2
126
+ else:
127
+ new_word.append(word[j])
128
+ i = j + 1
129
+ except ValueError:
130
+ new_word.extend(word[i:])
131
+ break
132
+ word = tuple(new_word)
133
+ if len(word) == 1:
134
+ break
135
+ pairs = get_pairs(word)
136
+ return ' '.join(word)
137
+
138
+ def _tokenize(self, text: str) -> List[str]:
139
+ if self.add_prefix_space:
140
+ text = ' ' + text
141
+
142
+ bpe_tokens = []
143
+ for token in re.findall(self.pat, text):
144
+ token = ''.join(self.byte_encoder[ord(b)] for b in token)
145
+ bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
146
+ return bpe_tokens
147
+
148
+ def _convert_token_to_id(self, token: str) -> int:
149
+ return self.encoder.get(token, self.encoder.get(self.unk_token))
150
+
151
+ def _convert_id_to_token(self, index: int) -> str:
152
+ return self.decoder.get(index, self.unk_token)
153
+
154
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
155
+ text = ''.join(tokens)
156
+ text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
157
+ return text
158
+
159
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str, str]:
160
+ if not filename_prefix:
161
+ filename_prefix = ""
162
+
163
+ vocab_file = Path(save_directory) / f"{filename_prefix}vocab.json"
164
+ merge_file = Path(save_directory) / f"{filename_prefix}merges.txt"
165
+
166
+ with vocab_file.open('w', encoding='utf-8') as f:
167
+ json.dump(self.encoder, f, ensure_ascii=False)
168
+
169
+ if self.merges_file:
170
+ with merge_file.open('w', encoding='utf-8') as f:
171
+ for merge in self.bpe_ranks:
172
+ f.write(f"{merge[0]} {merge[1]}\n")
173
+ return str(vocab_file), str(merge_file)
174
+
175
+ return str(vocab_file)
176
+
177
+ def prepare_for_vision(self, text: str) -> str:
178
+ """Prepare text for vision tasks by adding special tokens."""
179
+ return f"{self.vision_start_token}{text}{self.vision_end_token}"
180
+
181
+ def prepare_for_image(self, text: str) -> str:
182
+ """Prepare text for image tasks."""
183
+ return f"{self.image_token}{text}"
184
+
185
+ def prepare_for_video(self, text: str) -> str:
186
+ """Prepare text for video tasks."""
187
+ return f"{self.video_token}{text}"
188
+
189
+ @property
190
+ def vocab_size(self) -> int:
191
+ return len(self.encoder)
192
+
193
+ def get_vocab(self) -> Dict[str, int]:
194
+ return self.encoder.copy()
195
+
196
+ # Register the tokenizer
197
+ AutoTokenizer.register(SapnousTokenizer, "sapnous")
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "<|object_ref_start|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "<|object_ref_end|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<|box_start|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<|box_end|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<|quad_start|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "151651": {
69
+ "content": "<|quad_end|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "151652": {
77
+ "content": "<|vision_start|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "151653": {
85
+ "content": "<|vision_end|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "151654": {
93
+ "content": "<|vision_pad|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "151655": {
101
+ "content": "<|image_pad|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "151656": {
109
+ "content": "<|video_pad|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "151657": {
117
+ "content": "<tool_call>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "151658": {
125
+ "content": "</tool_call>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "151659": {
133
+ "content": "<|fim_prefix|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "151660": {
141
+ "content": "<|fim_middle|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "151661": {
149
+ "content": "<|fim_suffix|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "151662": {
157
+ "content": "<|fim_pad|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "151663": {
165
+ "content": "<|repo_name|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "151664": {
173
+ "content": "<|file_sep|>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ }
180
+ },
181
+ "additional_special_tokens": [
182
+ "<|im_start|>",
183
+ "<|im_end|>",
184
+ "<|object_ref_start|>",
185
+ "<|object_ref_end|>",
186
+ "<|box_start|>",
187
+ "<|box_end|>",
188
+ "<|quad_start|>",
189
+ "<|quad_end|>",
190
+ "<|vision_start|>",
191
+ "<|vision_end|>",
192
+ "<|vision_pad|>",
193
+ "<|image_pad|>",
194
+ "<|video_pad|>"
195
+ ],
196
+ "bos_token": null,
197
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "model_max_length": 131072,
202
+ "pad_token": "<|endoftext|>",
203
+ "split_special_tokens": false,
204
+ "tokenizer_class": "SapnousT1Tokenizer",
205
+ "unk_token": null,
206
+ "add_bos_token": false
207
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff