zhiyuanhucs commited on
Commit
64ff90d
·
verified ·
1 Parent(s): ca238b6

Upload model files

Browse files
README.md ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
+ datasets:
5
+ - xlangai/AgentNet
6
+ - xlangai/aguvis-stage1
7
+ - smolagents/aguvis-stage-2
8
+ - osunlp/UGround-V1-Data
9
+ language:
10
+ - en
11
+ license: mit
12
+ metrics:
13
+ - accuracy
14
+ - code_eval
15
+ pipeline_tag: image-text-to-text
16
+ library_name: transformers
17
+ tags:
18
+ - VLM
19
+ - Computer-Use-Agent
20
+ - OS-Agent
21
+ - GUI
22
+ - Grounding
23
+ ---
24
+
25
+ <h1 style="
26
+ font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
27
+ font-size:48px;
28
+ font-weight:700;
29
+ line-height:1.25;
30
+ text-align:center;
31
+ margin:0 0 24px;">
32
+ OpenCUA: Open Foundations for Computer-Use Agents
33
+ </h1>
34
+
35
+ <div style="
36
+ display:flex;
37
+ justify-content:center;
38
+ gap:12px;
39
+ flex-wrap:wrap;
40
+ margin-bottom:28px;">
41
+
42
+ <a href="https://opencua.xlang.ai/" style="
43
+ display:inline-block;
44
+ padding:8px 24px;
45
+ background:#2b2b2b;
46
+ color:#ffffff;
47
+ border-radius:36px;
48
+ text-decoration:none;
49
+ font-weight:600;
50
+ font-size:16px;">
51
+ 🌐 Website
52
+ </a>
53
+
54
+ <a href="https://arxiv.org/abs/2508.09123" style="
55
+ display:inline-block;
56
+ padding:8px 24px;
57
+ background:#2b2b2b;
58
+ color:#ffffff;
59
+ border-radius:36px;
60
+ text-decoration:none;
61
+ font-weight:600;
62
+ font-size:16px;">
63
+ 📝 Paper
64
+ </a>
65
+
66
+ <a href="https://github.com/xlang-ai/OpenCUA" style="
67
+ display:inline-block;
68
+ padding:8px 24px;
69
+ background:#2b2b2b;
70
+ color:#ffffff;
71
+ border-radius:36px;
72
+ text-decoration:none;
73
+ font-weight:600;
74
+ font-size:16px;">
75
+ 💻 Code
76
+ </a>
77
+ </div>
78
+
79
+ <div style="max-width:900px;margin:0 auto;">
80
+
81
+ # Introduction
82
+ <div style="
83
+ max-width: 880px; /* 可按需调节整体宽度 */
84
+ margin: 0 auto; /* 居中容器 */
85
+ text-align: justify; /* 关键:两端对齐 */
86
+ text-justify: inter-word; /* 优化英文对齐效果 */
87
+ line-height: 1.6;">
88
+
89
+ OpenCUA models (OpenCUA-7B and OpenCUA-32B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction and Qwen2.5-VL-32B-Instruction.
90
+ They demonstrate superior performance across CUA benchmarks. In particular, <b>OpenCUA-32B</b> achieves an average success rate of **34.8%** on [OSWorld-Verified](https://os-world.github.io/),
91
+ establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Both models also have strong grounding performance, OpenCUA-32B achieves 59.6% on [OSWorld-G](https://osworld-grounding.github.io/) and 55.3% on [Screenspot-Pro](https://arxiv.org/abs/2504.07981).
92
+ </div>
93
+
94
+ ### Key Features
95
+
96
+ - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
97
+ - **Multi-OS Support**: Trained on demonstrations across Ubuntu, Windows, and macOS
98
+ - **Visual Grounding**: Strong GUI element recognition and spatial reasoning capabilities
99
+ - **Multi-Image Context**: Processes up to 3 screenshot history for better context understanding
100
+ - **Reflective Reasoning**: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning
101
+
102
+
103
+ # Performance
104
+
105
+ ### Online Agent Evaluation
106
+ OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
107
+ OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
108
+ It also closes the gap to proprietary Claude models.
109
+ <div align="center">
110
+
111
+ | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
112
+ |-------------------------------|:--------:|:--------:|:---------:|
113
+ | **Proprietary** | | | |
114
+ | OpenAI CUA | 26.0 | 31.3 | 31.4 |
115
+ | Seed 1.5-VL | 27.9 | — | 34.1 |
116
+ | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
117
+ | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
118
+ | **Open-Source** | | | |
119
+ | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
120
+ | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
121
+ | Kimi-VL-A3B | 9.7 | — | 10.3 |
122
+ | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
123
+ | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
124
+ | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
125
+ | **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
126
+ </div>
127
+
128
+ *OpenCUA scores are the mean of 3 independent runs.*
129
+
130
+ ### GUI Grounding Performance
131
+ <div align="center">
132
+
133
+ | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
134
+ |-------|-----------|---------------|----------------|
135
+ | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
136
+ | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
137
+ | UI-TARS-72B | 57.1 | 90.3 | 38.1 |
138
+ | **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
139
+ | **OpenCUA-Qwen2-7B** | 45.7 | 88.5 | 23.7 |
140
+ | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 |
141
+ | **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
142
+ </div>
143
+
144
+
145
+ ### AgentNetBench (Offline Evaluation)
146
+ <div align="center">
147
+
148
+ | **Model** | **Coordinate Actions** | **Content Actions** | **Function Actions** | **Average** |
149
+ |-------|-------------------|-----------------|------------------|---------|
150
+ | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
151
+ | Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
152
+ | Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
153
+ | OpenAI CUA | 71.7 | 57.3 | **80.0** | 73.1 |
154
+ | **OpenCUA-7B** | 79.0 | 62.0 | 44.3 | 75.2 |
155
+ | **OpenCUA-32B** | **81.9** | 66.1 | 55.7 | **79.1** |
156
+ </div>
157
+
158
+ # 🚀 Quick Start
159
+ <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
160
+ <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
161
+
162
+ To align with our training infrastructure, we have modified the model in two places:
163
+ <ul style="margin-top: 8px;">
164
+ <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
165
+ <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
166
+ <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li>
167
+ </ul>
168
+ </div>
169
+
170
+
171
+ ## Installation & Download
172
+
173
+ First, install the required transformers dependencies:
174
+
175
+ ```bash
176
+ conda create -n opencua python=3.10
177
+ conda activate opencua
178
+ pip install -r requirement.txt
179
+ ```
180
+
181
+ Download the model weight from huggingface:
182
+ ```bash
183
+ from huggingface_hub import snapshot_download
184
+ snapshot_download(
185
+ repo_id="xlangai/OpenCUA-7B",
186
+ local_dir="OpenCUA-7B",
187
+ local_dir_use_symlinks=False
188
+ )
189
+ ```
190
+
191
+ ## 🎯 GUI Grounding
192
+
193
+ The following code demonstrates how to use OpenCUA models for GUI grounding tasks:
194
+
195
+ ```python
196
+ import base64
197
+ import torch
198
+ from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
199
+ from PIL import Image
200
+ import json
201
+
202
+ def encode_image(image_path: str) -> str:
203
+ """Encode image to base64 string for model input."""
204
+ with open(image_path, "rb") as f:
205
+ return base64.b64encode(f.read()).decode()
206
+
207
+ def load_opencua_model(model_path: str):
208
+ """Load OpenCUA model, tokenizer, and image processor."""
209
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
210
+ model = AutoModel.from_pretrained(
211
+ model_path,
212
+ torch_dtype="auto",
213
+ device_map="auto",
214
+ trust_remote_code=True
215
+ )
216
+ image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
217
+
218
+ return model, tokenizer, image_processor
219
+
220
+ def create_grounding_messages(image_path: str, instruction: str):
221
+ """Create chat messages for GUI grounding task."""
222
+ system_prompt = (
223
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
224
+ "You need to perform a series of pyautogui actions to complete the task."
225
+ )
226
+
227
+ messages = [
228
+ {"role": "system", "content": system_prompt},
229
+ {
230
+ "role": "user",
231
+ "content": [
232
+ {"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
233
+ {"type": "text", "text": instruction},
234
+ ],
235
+ },
236
+ ]
237
+ return messages
238
+
239
+ def run_inference(model, tokenizer, image_processor, messages, image_path):
240
+ """Run inference on the model."""
241
+ # Prepare text input
242
+ input_ids = tokenizer.apply_chat_template(
243
+ messages, tokenize=True, add_generation_prompt=True
244
+ )
245
+ input_ids = torch.tensor([input_ids]).to(model.device)
246
+
247
+ # Prepare image input
248
+ image = Image.open(image_path).convert('RGB')
249
+ image_info = image_processor.preprocess(images=[image])
250
+ pixel_values = torch.tensor(image_info['pixel_values']).to(
251
+ dtype=torch.bfloat16, device=model.device
252
+ )
253
+ grid_thws = torch.tensor(image_info['image_grid_thw'])
254
+
255
+ # Generate response
256
+ with torch.no_grad():
257
+ generated_ids = model.generate(
258
+ input_ids,
259
+ pixel_values=pixel_values,
260
+ grid_thws=grid_thws,
261
+ max_new_tokens=512,
262
+ temperature=0
263
+ )
264
+
265
+ # Decode output
266
+ prompt_len = input_ids.shape[1]
267
+ generated_ids = generated_ids[:, prompt_len:]
268
+ output_text = tokenizer.batch_decode(
269
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
270
+ )[0]
271
+
272
+ return output_text
273
+
274
+ # Example usage
275
+ model_path = "OpenCUA/OpenCUA-7B" # or other model variants
276
+ image_path = "screenshot.png"
277
+ instruction = "Click on the submit button"
278
+
279
+ # Load model
280
+ model, tokenizer, image_processor = load_opencua_model(model_path)
281
+
282
+ # Create messages and run inference
283
+ messages = create_grounding_messages(image_path, instruction)
284
+ result = run_inference(model, tokenizer, image_processor, messages, image_path)
285
+
286
+ print("Model output:", result)
287
+ ```
288
+
289
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
290
+ <em>Expected result:</em> ```python
291
+ pyautogui.click(x=1443, y=343)
292
+ ```
293
+ </div>
294
+
295
+ You can also run the five grounding examples in [OpenCUA/model/inference/huggingface_inference.py](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/huggingface_inference.py):
296
+ ```
297
+ cd ./model/inference/
298
+ python huggingface_inference.py
299
+ ```
300
+
301
+ ## 🖥️ Computer Use Agent
302
+ **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
303
+
304
+ Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
305
+ ```
306
+ python run_multienv_opencua.py \
307
+ --headless \
308
+ --observation_type screenshot \
309
+ --model OpenCUA-32B \
310
+ --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
311
+ --max_steps 100 \
312
+ --num_envs 30 \
313
+ --coordinate_type qwen25
314
+ ```
315
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
316
+ <em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
317
+ </div>
318
+
319
+ ---
320
+
321
+ # AgentNet Dataset - Large-Scale Computer-Use Dataset
322
+
323
+ <div align="center">
324
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
325
+ </div>
326
+
327
+ AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
328
+
329
+ 👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
330
+
331
+ Download the dataset here:
332
+ ```
333
+ pip install -U huggingface_hub
334
+ huggingface-cli download xlangai/AgentNet --repo-type dataset --local-dir ./AgentNet
335
+ ```
336
+
337
+ Collecting computer-use agent training data requires 3 steps:
338
+ - Demonstrate human computer-use task via [AgentNetTool](https://agentnet-tool.xlang.ai/);
339
+ - Preprocess the demonstration using [Action Reduction & State-Action Matching](./data/data-processor);
340
+ - For each step, [synthesize reflective long CoT](./data/cot-generator)
341
+
342
+
343
+ ## 1 AgentNetTool – Annotation & Verification Tool
344
+ <div align="center">
345
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/ETjCOoIRR7f1YZCJ2kfiW.png" width="700" alt="AgentNet Tool">
346
+ </div>
347
+
348
+
349
+ Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators’ machines. It captures synchronized **screen video**, **mouse/keyboard events**, and **accessibility trees**, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.
350
+
351
+ 👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
352
+
353
+
354
+
355
+ ## 2 DataProcessor – Action Reduction & State–Action Matching
356
+ Raw demonstrations can contain thousands of low-level events that are too dense for model training.
357
+ The **DataProcessor** module (`./data/data-process/`) performs two key steps:
358
+
359
+ 1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
360
+ 2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
361
+
362
+ These processed trajectories underlie all downstream training and evaluation.
363
+
364
+ ---
365
+
366
+ ## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
367
+ To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
368
+ The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
369
+
370
+ * reflect on the previous action,
371
+ * explain *why* an action is chosen given the current observation and history,
372
+ * note potential alternative actions, and
373
+ * forecast the expected next state.
374
+
375
+ Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
376
+
377
+
378
+ # Evaluation
379
+
380
+ <div align="center">
381
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/emy1QCJwQj9KqHkVmtNH2.png" width="800" alt="AgentNetBench">
382
+ </div>
383
+
384
+
385
+ **AgentNetBench** (`./AgentNetBench/`) provides a realistic offline evaluator for OS agent trajectories. It compares model-predicted low-level actions (click, moveTo, write, press, scroll, terminate, etc.) against ground-truth human actions and reports detailed metrics.
386
+
387
+ 👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
388
+
389
+ # TODO
390
+ ## vLLM Support
391
+ We are actively working with the vLLM team to add support for OpenCUA models.
392
+
393
+ **Workaround:** For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available.
394
+
395
+ ## Training Code
396
+ OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well.
397
+
398
+ # Acknowledge
399
+ <p>
400
+ We thank Su Yu, Caiming Xiong, Binyuan Hui, and the anonymous reviewers for their insightful discussions and valuable feedback.
401
+ We are grateful to Moonshot AI for providing training infrastructure and annotated data.
402
+ We also sincerely appreciate Calvin, Ziwei Chen, Jin Zhang, Ze Li, Zhengtao Wang, Yanxu Chen, and Qizheng Gu from the Kimi Team for their strong infrastructure support and helpful guidance.
403
+ The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
404
+ We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
405
+ </p>
406
+
407
+ # License
408
+
409
+ This project is licensed under the MIT License - see the LICENSE file in the root folder for details.
410
+
411
+ ## Research Use and Disclaimer
412
+
413
+ OpenCUA models are intended for **research and educational purposes only**.
414
+
415
+ ### Prohibited Uses
416
+ - The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
417
+ - Use for illegal, unethical, or harmful activities is strictly prohibited
418
+
419
+ ### Disclaimer
420
+ - The authors, contributors, and copyright holders are **not responsible** for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use
421
+ - Use of the "OpenCUA" name, logo, or trademarks does **not** imply any endorsement or affiliation unless separate written permission is obtained
422
+ - Users are solely responsible for ensuring their use complies with applicable laws and regulations
423
+
424
+ ## Important Notes on Coordinate Systems
425
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
426
+ <ul style="margin: 0;">
427
+ <li><strong><code>OpenCUA/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
428
+ <li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
429
+ <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
430
+ <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
431
+ </ul>
432
+ </div>
433
+
434
+ **OpenCUA models use different coordinate systems depending on the base model:**
435
+
436
+ - **OpenCUA-Qwen2-7B**: Outputs **relative coordinates** (0.0 to 1.0 range)
437
+ ```python
438
+ # Example output: pyautogui.click(x=0.5, y=0.3)
439
+ # x=0.5 means 50% from left edge, y=0.3 means 30% from top edge
440
+
441
+ # Convert to absolute coordinates:
442
+ def qwen2_relative_to_absolute(rel_x, rel_y, original_width, original_height):
443
+ abs_x = int(rel_x * original_width)
444
+ abs_y = int(rel_y * original_height)
445
+ return abs_x, abs_y
446
+ ```
447
+
448
+ - **OpenCUA-7B and OpenCUA-32B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
449
+ ```python
450
+ # Example output: pyautogui.click(x=960, y=324)
451
+ # These are coordinates on the smart-resized image, not the original image
452
+
453
+ # Convert to original image coordinates:
454
+ # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
455
+ def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
456
+ # First, calculate the smart-resized dimensions
457
+ resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
458
+
459
+ # Convert model output to relative coordinates on original image
460
+ rel_x = model_x / resized_width
461
+ rel_y = model_y / resized_height
462
+
463
+ # Then convert to absolute coordinates on original image
464
+ abs_x = int(rel_x * original_width)
465
+ abs_y = int(rel_y * original_height)
466
+ return abs_x, abs_y
467
+ ```
468
+
469
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
470
+ <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
471
+ <p style="margin: 8px 0 0;">
472
+ The Qwen2.5-VL models use a “smart resize” preprocessing that maintains aspect ratio while fitting within pixel constraints.
473
+ For coordinate conversion, you need the smart resize function from the
474
+ <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
475
+ official Qwen2.5-VL implementation</a>.
476
+ </p>
477
+ </div>
478
+
479
+ ## Citation
480
+
481
+ If you use OpenCUA models in your research, please cite our work:
482
+
483
+ ```bibtex
484
+ @misc{wang2025opencuaopenfoundationscomputeruse,
485
+ title={OpenCUA: Open Foundations for Computer-Use Agents},
486
+ author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
487
+ year={2025},
488
+ eprint={2508.09123},
489
+ archivePrefix={arXiv},
490
+ primaryClass={cs.AI},
491
+ url={https://arxiv.org/abs/2508.09123},
492
+ }
493
+ ```
494
+
495
+ </div>
__pycache__/processing_opencua.cpython-312.pyc ADDED
Binary file (1.03 kB). View file
 
__pycache__/tokenization_opencua.cpython-312.pyc ADDED
Binary file (16.3 kB). View file
 
config.bak.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenCUAForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_opencua.OpenCUAConfig",
7
+ "AutoModel": "modeling_opencua.OpenCUAForConditionalGeneration",
8
+ "AutoModelForCausalLM": "modeling_opencua.OpenCUAForConditionalGeneration"
9
+ },
10
+ "ignore_index": -100,
11
+ "media_placeholder_token_id": 151664,
12
+ "model_type": "opencua",
13
+ "pad_token_id": 0,
14
+ "text_config": {
15
+ "bos_token_id": 151643,
16
+ "eos_token_id": 151644,
17
+ "head_dim": 128,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 3584,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 18944,
22
+ "k_proj_bias": true,
23
+ "max_length": 20,
24
+ "min_length": 0,
25
+ "model_type": "qwen2",
26
+ "num_attention_heads": 28,
27
+ "num_beam_groups": 1,
28
+ "num_beams": 1,
29
+ "num_hidden_layers": 28,
30
+ "num_key_value_heads": 4,
31
+ "pad_token_id": 152063,
32
+ "pretraining_sequence_length": 128000,
33
+ "q_proj_bias": true,
34
+ "rms_norm_eps": 1e-05,
35
+ "rope_theta": 1000000.0,
36
+ "tie_word_embeddings": false,
37
+ "torch_dtype": "bfloat16",
38
+ "use_bfloat16": false,
39
+ "use_cache": true,
40
+ "v_proj_bias": true,
41
+ "vocab_size": 152064
42
+ },
43
+ "tie_word_embeddings": false,
44
+ "torch_dtype": "bfloat16",
45
+ "transformers_version": "4.48.3",
46
+ "vision_config": {
47
+ "depth": 32,
48
+ "fullatt_block_indexes": [
49
+ 7,
50
+ 15,
51
+ 23,
52
+ 31
53
+ ],
54
+ "hidden_act": "silu",
55
+ "hidden_size": 1280,
56
+ "num_heads": 16,
57
+ "in_chans": 3,
58
+ "intermediate_size": 3420,
59
+
60
+ "patch_size": 14,
61
+ "spatial_merge_size": 2,
62
+ "spatial_patch_size": 14,
63
+ "temporal_patch_size": 2,
64
+ "out_hidden_size": 3584,
65
+ "tokens_per_second": 2,
66
+ "window_size": 112
67
+ },
68
+ "vocab_size": 152064
69
+ }
config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "vision_start_token_id": 151652,
9
+ "vision_end_token_id": 151653,
10
+ "vision_token_id": 151654,
11
+ "image_token_id": 151655,
12
+ "video_token_id": 151656,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 3584,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 18944,
17
+ "max_position_embeddings": 128000,
18
+ "max_window_layers": 28,
19
+ "model_type": "qwen2_5_vl",
20
+ "num_attention_heads": 28,
21
+ "num_hidden_layers": 28,
22
+ "num_key_value_heads": 4,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_theta": 1000000.0,
25
+ "sliding_window": 32768,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.41.2",
29
+ "use_cache": true,
30
+ "use_sliding_window": false,
31
+ "vision_config": {
32
+ "depth": 32,
33
+ "hidden_act": "silu",
34
+ "hidden_size": 1280,
35
+ "intermediate_size": 3420,
36
+ "num_heads": 16,
37
+ "in_chans": 3,
38
+ "out_hidden_size": 3584,
39
+ "patch_size": 14,
40
+ "spatial_merge_size": 2,
41
+ "spatial_patch_size": 14,
42
+ "window_size": 112,
43
+ "fullatt_block_indexes": [
44
+ 7,
45
+ 15,
46
+ 23,
47
+ 31
48
+ ],
49
+ "tokens_per_second": 2,
50
+ "temporal_patch_size": 2
51
+ },
52
+ "rope_scaling": {
53
+ "type": "default"},
54
+ "vocab_size": 152064
55
+ }
configuration_opencua.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import Qwen2_5_VLVisionConfig
3
+ from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
4
+
5
+
6
+ class OpenCUAConfig(PretrainedConfig):
7
+ """OpenCUA-2.5-7B model configuration.
8
+
9
+ Args:
10
+ vision_config: Configuration for the vision model.Qwen2_5_VLVisionConfig
11
+ text_config: Configuration for the text model. Qwen2Config
12
+ pad_token_id: The token ID to use for padding.
13
+ """
14
+
15
+ model_type = "opencua"
16
+
17
+ def __init__(
18
+ self,
19
+ vision_config: dict | Qwen2_5_VLVisionConfig | None = None,
20
+ text_config: dict | Qwen2Config | None = None,
21
+ ignore_index: int = -100,
22
+ media_placeholder_token_id: int = 151664,
23
+ pad_token_id: int = 0,
24
+ **kwargs
25
+ ):
26
+ if isinstance(vision_config, dict):
27
+ vision_config = Qwen2_5_VLVisionConfig(**vision_config)
28
+ self.vision_config = vision_config
29
+
30
+ if isinstance(text_config, dict):
31
+ text_config = Qwen2Config(**text_config)
32
+ self.text_config = text_config
33
+
34
+ self.ignore_index = ignore_index
35
+ self.media_placeholder_token_id = media_placeholder_token_id
36
+
37
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
38
+
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_length": 32768,
3
+ "eos_token_id": 151644
4
+ }
model-1-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81b20357bf0a19193a26806b4fbcc0443c882fc694e637e02ebcd5670d0fd4eb
3
+ size 2909254736
model-10-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f0b033c515d92acb396d8bc9978cdcad561a27bca2dc17c7202838f3652d35b
3
+ size 466117168
model-11-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4043300169c14f8a1b1a1b9aead1234394d2d4b61db5d1c77222a90519a9dce8
3
+ size 466117176
model-12-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23a0a37b8918a4c062cd7ed95834f7e2ed6d3801885c8ef4a66c0b681462ff19
3
+ size 466117176
model-13-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:708cae96ee482105895212a1856e0c5a71bbe40471540d34e2d4cbe7fd0ab76a
3
+ size 466117176
model-14-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23490bd2d6389fc8ee01a0f808c865a3b1c6c927e7d21e844975a50041cff79c
3
+ size 466117176
model-15-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:104eee18d818525d3776e1d2bd5fe9c0e5d0d3298b8ab4c59382cd96222b0fc6
3
+ size 466117176
model-16-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2f3cf3ec4d780acaae3d06139653af043502ad9200560a3dcf41a1ab589145a
3
+ size 466117176
model-17-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9a7f1e09274bb6a0452c5d2c9280c99b268ceff6bfb33ec2c3b9838e4b0be43
3
+ size 466117176
model-18-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:112aaae346d51b8c75a2ceccaa53d78edbefb6ec4270a805ba38d4d4f5e83f64
3
+ size 466117176
model-19-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f17abaea5d9a632a63670de0241737d0fecb8b596c99ecf1207f501d900862e
3
+ size 466117176
model-2-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c45530684019875a72e9cdaff5152f2193bb95c9d227e95880e3c1352ca2d89
3
+ size 466117168
model-20-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab2ddeedf0122fbbb61da61334a4c2b05527170e86bd989492682c923d20e176
3
+ size 466117176
model-21-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:201213a71cd90492d7c16ef1a7bf1d7c509f3ba8de3878a3e0ad31aa337b9fd3
3
+ size 466117176
model-22-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9d9a20bc4ab5f20f7cc0235308ca43c0e75b63d6a216828d9f0653a2d61fd09
3
+ size 466117176
model-23-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e4c6f5b2de2d06d38a36ae83f4bac8f1ecb02b9728189fd20b1faa7220a7060
3
+ size 466117176
model-24-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa21ebc5c563ee16e07e89c04bea75beb33054232f527e91c1dcb2ab23b1e7c2
3
+ size 466117176
model-25-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2f82e9856bebb354084dee10c1ea079e04a6e8d75c55f4635046933de332dc2
3
+ size 466117176
model-26-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc325bb0e58fd55dc98618c8535e840eef815d1afb50b5c13d7a53e6f5075d1f
3
+ size 466117176
model-27-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff43cd1fe2bc5582aafcb76bd7369145df01d2950a43f30f40a08a619ff04808
3
+ size 466117176
model-28-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff0f9999c077caa7920159984ae0116d844c8b1ae8e18215472c8cac8f86fd61
3
+ size 1556119320
model-3-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f102b49039d07ec5a410585b6a97cf749af04958188f25746b1292abb627b2f5
3
+ size 466117168
model-4-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58d2b72b6e77c062cd798886a6b55a4a5fae73839c28608d6c9cf6f2fce7b483
3
+ size 466117168
model-5-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:288cbf94ad3f543a2a36b0fc4a8f1da7709379016bf55f18d77f58cf076ba154
3
+ size 466117168
model-6-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:663b8b73ea760cf04c68a696aa56bb31f553c72434c182176a49ed797acacbd7
3
+ size 466117168
model-7-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25286a5c4ff9d4adb178f85c842e54328d8ca919ad5102832cc137eebcc77916
3
+ size 466117168
model-8-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eda0f72734081d6c2c57e76cf9c1f191fd87d0cf117f5a95130928932df79f61
3
+ size 466117168
model-9-of-28.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7b5d2e1dc2bf66e500b986a1be8eb5ba672e38f2ca201ef01e474b415314c8a
3
+ size 466117168
model.args.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2489a260658ec0235d9c92a059b58d9f188c5e29034572f61c02e346223508e
3
+ size 80428
model.safetensors.index.json ADDED
@@ -0,0 +1,764 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16584336896
4
+ },
5
+ "weight_map": {
6
+ "model.layers.0.self_attn.q_proj.weight": "model-1-of-28.safetensors",
7
+ "model.layers.0.self_attn.q_proj.bias": "model-1-of-28.safetensors",
8
+ "model.layers.0.self_attn.k_proj.weight": "model-1-of-28.safetensors",
9
+ "model.layers.0.self_attn.k_proj.bias": "model-1-of-28.safetensors",
10
+ "model.layers.0.self_attn.v_proj.weight": "model-1-of-28.safetensors",
11
+ "model.layers.0.self_attn.v_proj.bias": "model-1-of-28.safetensors",
12
+ "model.layers.0.self_attn.o_proj.weight": "model-1-of-28.safetensors",
13
+ "model.layers.0.mlp.gate_proj.weight": "model-1-of-28.safetensors",
14
+ "model.layers.0.mlp.down_proj.weight": "model-1-of-28.safetensors",
15
+ "model.layers.0.mlp.up_proj.weight": "model-1-of-28.safetensors",
16
+ "model.layers.0.input_layernorm.weight": "model-1-of-28.safetensors",
17
+ "model.layers.0.post_attention_layernorm.weight": "model-1-of-28.safetensors",
18
+ "model.embed_tokens.weight": "model-1-of-28.safetensors",
19
+ "visual.blocks.0.attn.proj.bias": "model-1-of-28.safetensors",
20
+ "visual.blocks.0.attn.proj.weight": "model-1-of-28.safetensors",
21
+ "visual.blocks.0.attn.qkv.bias": "model-1-of-28.safetensors",
22
+ "visual.blocks.0.attn.qkv.weight": "model-1-of-28.safetensors",
23
+ "visual.blocks.0.mlp.down_proj.bias": "model-1-of-28.safetensors",
24
+ "visual.blocks.0.mlp.down_proj.weight": "model-1-of-28.safetensors",
25
+ "visual.blocks.0.mlp.gate_proj.bias": "model-1-of-28.safetensors",
26
+ "visual.blocks.0.mlp.gate_proj.weight": "model-1-of-28.safetensors",
27
+ "visual.blocks.0.mlp.up_proj.bias": "model-1-of-28.safetensors",
28
+ "visual.blocks.0.mlp.up_proj.weight": "model-1-of-28.safetensors",
29
+ "visual.blocks.0.norm1.weight": "model-1-of-28.safetensors",
30
+ "visual.blocks.0.norm2.weight": "model-1-of-28.safetensors",
31
+ "visual.blocks.1.attn.proj.bias": "model-1-of-28.safetensors",
32
+ "visual.blocks.1.attn.proj.weight": "model-1-of-28.safetensors",
33
+ "visual.blocks.1.attn.qkv.bias": "model-1-of-28.safetensors",
34
+ "visual.blocks.1.attn.qkv.weight": "model-1-of-28.safetensors",
35
+ "visual.blocks.1.mlp.down_proj.bias": "model-1-of-28.safetensors",
36
+ "visual.blocks.1.mlp.down_proj.weight": "model-1-of-28.safetensors",
37
+ "visual.blocks.1.mlp.gate_proj.bias": "model-1-of-28.safetensors",
38
+ "visual.blocks.1.mlp.gate_proj.weight": "model-1-of-28.safetensors",
39
+ "visual.blocks.1.mlp.up_proj.bias": "model-1-of-28.safetensors",
40
+ "visual.blocks.1.mlp.up_proj.weight": "model-1-of-28.safetensors",
41
+ "visual.blocks.1.norm1.weight": "model-1-of-28.safetensors",
42
+ "visual.blocks.1.norm2.weight": "model-1-of-28.safetensors",
43
+ "visual.blocks.10.attn.proj.bias": "model-1-of-28.safetensors",
44
+ "visual.blocks.10.attn.proj.weight": "model-1-of-28.safetensors",
45
+ "visual.blocks.10.attn.qkv.bias": "model-1-of-28.safetensors",
46
+ "visual.blocks.10.attn.qkv.weight": "model-1-of-28.safetensors",
47
+ "visual.blocks.10.mlp.down_proj.bias": "model-1-of-28.safetensors",
48
+ "visual.blocks.10.mlp.down_proj.weight": "model-1-of-28.safetensors",
49
+ "visual.blocks.10.mlp.gate_proj.bias": "model-1-of-28.safetensors",
50
+ "visual.blocks.10.mlp.gate_proj.weight": "model-1-of-28.safetensors",
51
+ "visual.blocks.10.mlp.up_proj.bias": "model-1-of-28.safetensors",
52
+ "visual.blocks.10.mlp.up_proj.weight": "model-1-of-28.safetensors",
53
+ "visual.blocks.10.norm1.weight": "model-1-of-28.safetensors",
54
+ "visual.blocks.10.norm2.weight": "model-1-of-28.safetensors",
55
+ "visual.blocks.11.attn.proj.bias": "model-1-of-28.safetensors",
56
+ "visual.blocks.11.attn.proj.weight": "model-1-of-28.safetensors",
57
+ "visual.blocks.11.attn.qkv.bias": "model-1-of-28.safetensors",
58
+ "visual.blocks.11.attn.qkv.weight": "model-1-of-28.safetensors",
59
+ "visual.blocks.11.mlp.down_proj.bias": "model-1-of-28.safetensors",
60
+ "visual.blocks.11.mlp.down_proj.weight": "model-1-of-28.safetensors",
61
+ "visual.blocks.11.mlp.gate_proj.bias": "model-1-of-28.safetensors",
62
+ "visual.blocks.11.mlp.gate_proj.weight": "model-1-of-28.safetensors",
63
+ "visual.blocks.11.mlp.up_proj.bias": "model-1-of-28.safetensors",
64
+ "visual.blocks.11.mlp.up_proj.weight": "model-1-of-28.safetensors",
65
+ "visual.blocks.11.norm1.weight": "model-1-of-28.safetensors",
66
+ "visual.blocks.11.norm2.weight": "model-1-of-28.safetensors",
67
+ "visual.blocks.12.attn.proj.bias": "model-1-of-28.safetensors",
68
+ "visual.blocks.12.attn.proj.weight": "model-1-of-28.safetensors",
69
+ "visual.blocks.12.attn.qkv.bias": "model-1-of-28.safetensors",
70
+ "visual.blocks.12.attn.qkv.weight": "model-1-of-28.safetensors",
71
+ "visual.blocks.12.mlp.down_proj.bias": "model-1-of-28.safetensors",
72
+ "visual.blocks.12.mlp.down_proj.weight": "model-1-of-28.safetensors",
73
+ "visual.blocks.12.mlp.gate_proj.bias": "model-1-of-28.safetensors",
74
+ "visual.blocks.12.mlp.gate_proj.weight": "model-1-of-28.safetensors",
75
+ "visual.blocks.12.mlp.up_proj.bias": "model-1-of-28.safetensors",
76
+ "visual.blocks.12.mlp.up_proj.weight": "model-1-of-28.safetensors",
77
+ "visual.blocks.12.norm1.weight": "model-1-of-28.safetensors",
78
+ "visual.blocks.12.norm2.weight": "model-1-of-28.safetensors",
79
+ "visual.blocks.13.attn.proj.bias": "model-1-of-28.safetensors",
80
+ "visual.blocks.13.attn.proj.weight": "model-1-of-28.safetensors",
81
+ "visual.blocks.13.attn.qkv.bias": "model-1-of-28.safetensors",
82
+ "visual.blocks.13.attn.qkv.weight": "model-1-of-28.safetensors",
83
+ "visual.blocks.13.mlp.down_proj.bias": "model-1-of-28.safetensors",
84
+ "visual.blocks.13.mlp.down_proj.weight": "model-1-of-28.safetensors",
85
+ "visual.blocks.13.mlp.gate_proj.bias": "model-1-of-28.safetensors",
86
+ "visual.blocks.13.mlp.gate_proj.weight": "model-1-of-28.safetensors",
87
+ "visual.blocks.13.mlp.up_proj.bias": "model-1-of-28.safetensors",
88
+ "visual.blocks.13.mlp.up_proj.weight": "model-1-of-28.safetensors",
89
+ "visual.blocks.13.norm1.weight": "model-1-of-28.safetensors",
90
+ "visual.blocks.13.norm2.weight": "model-1-of-28.safetensors",
91
+ "visual.blocks.14.attn.proj.bias": "model-1-of-28.safetensors",
92
+ "visual.blocks.14.attn.proj.weight": "model-1-of-28.safetensors",
93
+ "visual.blocks.14.attn.qkv.bias": "model-1-of-28.safetensors",
94
+ "visual.blocks.14.attn.qkv.weight": "model-1-of-28.safetensors",
95
+ "visual.blocks.14.mlp.down_proj.bias": "model-1-of-28.safetensors",
96
+ "visual.blocks.14.mlp.down_proj.weight": "model-1-of-28.safetensors",
97
+ "visual.blocks.14.mlp.gate_proj.bias": "model-1-of-28.safetensors",
98
+ "visual.blocks.14.mlp.gate_proj.weight": "model-1-of-28.safetensors",
99
+ "visual.blocks.14.mlp.up_proj.bias": "model-1-of-28.safetensors",
100
+ "visual.blocks.14.mlp.up_proj.weight": "model-1-of-28.safetensors",
101
+ "visual.blocks.14.norm1.weight": "model-1-of-28.safetensors",
102
+ "visual.blocks.14.norm2.weight": "model-1-of-28.safetensors",
103
+ "visual.blocks.15.attn.proj.bias": "model-1-of-28.safetensors",
104
+ "visual.blocks.15.attn.proj.weight": "model-1-of-28.safetensors",
105
+ "visual.blocks.15.attn.qkv.bias": "model-1-of-28.safetensors",
106
+ "visual.blocks.15.attn.qkv.weight": "model-1-of-28.safetensors",
107
+ "visual.blocks.15.mlp.down_proj.bias": "model-1-of-28.safetensors",
108
+ "visual.blocks.15.mlp.down_proj.weight": "model-1-of-28.safetensors",
109
+ "visual.blocks.15.mlp.gate_proj.bias": "model-1-of-28.safetensors",
110
+ "visual.blocks.15.mlp.gate_proj.weight": "model-1-of-28.safetensors",
111
+ "visual.blocks.15.mlp.up_proj.bias": "model-1-of-28.safetensors",
112
+ "visual.blocks.15.mlp.up_proj.weight": "model-1-of-28.safetensors",
113
+ "visual.blocks.15.norm1.weight": "model-1-of-28.safetensors",
114
+ "visual.blocks.15.norm2.weight": "model-1-of-28.safetensors",
115
+ "visual.blocks.16.attn.proj.bias": "model-1-of-28.safetensors",
116
+ "visual.blocks.16.attn.proj.weight": "model-1-of-28.safetensors",
117
+ "visual.blocks.16.attn.qkv.bias": "model-1-of-28.safetensors",
118
+ "visual.blocks.16.attn.qkv.weight": "model-1-of-28.safetensors",
119
+ "visual.blocks.16.mlp.down_proj.bias": "model-1-of-28.safetensors",
120
+ "visual.blocks.16.mlp.down_proj.weight": "model-1-of-28.safetensors",
121
+ "visual.blocks.16.mlp.gate_proj.bias": "model-1-of-28.safetensors",
122
+ "visual.blocks.16.mlp.gate_proj.weight": "model-1-of-28.safetensors",
123
+ "visual.blocks.16.mlp.up_proj.bias": "model-1-of-28.safetensors",
124
+ "visual.blocks.16.mlp.up_proj.weight": "model-1-of-28.safetensors",
125
+ "visual.blocks.16.norm1.weight": "model-1-of-28.safetensors",
126
+ "visual.blocks.16.norm2.weight": "model-1-of-28.safetensors",
127
+ "visual.blocks.17.attn.proj.bias": "model-1-of-28.safetensors",
128
+ "visual.blocks.17.attn.proj.weight": "model-1-of-28.safetensors",
129
+ "visual.blocks.17.attn.qkv.bias": "model-1-of-28.safetensors",
130
+ "visual.blocks.17.attn.qkv.weight": "model-1-of-28.safetensors",
131
+ "visual.blocks.17.mlp.down_proj.bias": "model-1-of-28.safetensors",
132
+ "visual.blocks.17.mlp.down_proj.weight": "model-1-of-28.safetensors",
133
+ "visual.blocks.17.mlp.gate_proj.bias": "model-1-of-28.safetensors",
134
+ "visual.blocks.17.mlp.gate_proj.weight": "model-1-of-28.safetensors",
135
+ "visual.blocks.17.mlp.up_proj.bias": "model-1-of-28.safetensors",
136
+ "visual.blocks.17.mlp.up_proj.weight": "model-1-of-28.safetensors",
137
+ "visual.blocks.17.norm1.weight": "model-1-of-28.safetensors",
138
+ "visual.blocks.17.norm2.weight": "model-1-of-28.safetensors",
139
+ "visual.blocks.18.attn.proj.bias": "model-1-of-28.safetensors",
140
+ "visual.blocks.18.attn.proj.weight": "model-1-of-28.safetensors",
141
+ "visual.blocks.18.attn.qkv.bias": "model-1-of-28.safetensors",
142
+ "visual.blocks.18.attn.qkv.weight": "model-1-of-28.safetensors",
143
+ "visual.blocks.18.mlp.down_proj.bias": "model-1-of-28.safetensors",
144
+ "visual.blocks.18.mlp.down_proj.weight": "model-1-of-28.safetensors",
145
+ "visual.blocks.18.mlp.gate_proj.bias": "model-1-of-28.safetensors",
146
+ "visual.blocks.18.mlp.gate_proj.weight": "model-1-of-28.safetensors",
147
+ "visual.blocks.18.mlp.up_proj.bias": "model-1-of-28.safetensors",
148
+ "visual.blocks.18.mlp.up_proj.weight": "model-1-of-28.safetensors",
149
+ "visual.blocks.18.norm1.weight": "model-1-of-28.safetensors",
150
+ "visual.blocks.18.norm2.weight": "model-1-of-28.safetensors",
151
+ "visual.blocks.19.attn.proj.bias": "model-1-of-28.safetensors",
152
+ "visual.blocks.19.attn.proj.weight": "model-1-of-28.safetensors",
153
+ "visual.blocks.19.attn.qkv.bias": "model-1-of-28.safetensors",
154
+ "visual.blocks.19.attn.qkv.weight": "model-1-of-28.safetensors",
155
+ "visual.blocks.19.mlp.down_proj.bias": "model-1-of-28.safetensors",
156
+ "visual.blocks.19.mlp.down_proj.weight": "model-1-of-28.safetensors",
157
+ "visual.blocks.19.mlp.gate_proj.bias": "model-1-of-28.safetensors",
158
+ "visual.blocks.19.mlp.gate_proj.weight": "model-1-of-28.safetensors",
159
+ "visual.blocks.19.mlp.up_proj.bias": "model-1-of-28.safetensors",
160
+ "visual.blocks.19.mlp.up_proj.weight": "model-1-of-28.safetensors",
161
+ "visual.blocks.19.norm1.weight": "model-1-of-28.safetensors",
162
+ "visual.blocks.19.norm2.weight": "model-1-of-28.safetensors",
163
+ "visual.blocks.2.attn.proj.bias": "model-1-of-28.safetensors",
164
+ "visual.blocks.2.attn.proj.weight": "model-1-of-28.safetensors",
165
+ "visual.blocks.2.attn.qkv.bias": "model-1-of-28.safetensors",
166
+ "visual.blocks.2.attn.qkv.weight": "model-1-of-28.safetensors",
167
+ "visual.blocks.2.mlp.down_proj.bias": "model-1-of-28.safetensors",
168
+ "visual.blocks.2.mlp.down_proj.weight": "model-1-of-28.safetensors",
169
+ "visual.blocks.2.mlp.gate_proj.bias": "model-1-of-28.safetensors",
170
+ "visual.blocks.2.mlp.gate_proj.weight": "model-1-of-28.safetensors",
171
+ "visual.blocks.2.mlp.up_proj.bias": "model-1-of-28.safetensors",
172
+ "visual.blocks.2.mlp.up_proj.weight": "model-1-of-28.safetensors",
173
+ "visual.blocks.2.norm1.weight": "model-1-of-28.safetensors",
174
+ "visual.blocks.2.norm2.weight": "model-1-of-28.safetensors",
175
+ "visual.blocks.20.attn.proj.bias": "model-1-of-28.safetensors",
176
+ "visual.blocks.20.attn.proj.weight": "model-1-of-28.safetensors",
177
+ "visual.blocks.20.attn.qkv.bias": "model-1-of-28.safetensors",
178
+ "visual.blocks.20.attn.qkv.weight": "model-1-of-28.safetensors",
179
+ "visual.blocks.20.mlp.down_proj.bias": "model-1-of-28.safetensors",
180
+ "visual.blocks.20.mlp.down_proj.weight": "model-1-of-28.safetensors",
181
+ "visual.blocks.20.mlp.gate_proj.bias": "model-1-of-28.safetensors",
182
+ "visual.blocks.20.mlp.gate_proj.weight": "model-1-of-28.safetensors",
183
+ "visual.blocks.20.mlp.up_proj.bias": "model-1-of-28.safetensors",
184
+ "visual.blocks.20.mlp.up_proj.weight": "model-1-of-28.safetensors",
185
+ "visual.blocks.20.norm1.weight": "model-1-of-28.safetensors",
186
+ "visual.blocks.20.norm2.weight": "model-1-of-28.safetensors",
187
+ "visual.blocks.21.attn.proj.bias": "model-1-of-28.safetensors",
188
+ "visual.blocks.21.attn.proj.weight": "model-1-of-28.safetensors",
189
+ "visual.blocks.21.attn.qkv.bias": "model-1-of-28.safetensors",
190
+ "visual.blocks.21.attn.qkv.weight": "model-1-of-28.safetensors",
191
+ "visual.blocks.21.mlp.down_proj.bias": "model-1-of-28.safetensors",
192
+ "visual.blocks.21.mlp.down_proj.weight": "model-1-of-28.safetensors",
193
+ "visual.blocks.21.mlp.gate_proj.bias": "model-1-of-28.safetensors",
194
+ "visual.blocks.21.mlp.gate_proj.weight": "model-1-of-28.safetensors",
195
+ "visual.blocks.21.mlp.up_proj.bias": "model-1-of-28.safetensors",
196
+ "visual.blocks.21.mlp.up_proj.weight": "model-1-of-28.safetensors",
197
+ "visual.blocks.21.norm1.weight": "model-1-of-28.safetensors",
198
+ "visual.blocks.21.norm2.weight": "model-1-of-28.safetensors",
199
+ "visual.blocks.22.attn.proj.bias": "model-1-of-28.safetensors",
200
+ "visual.blocks.22.attn.proj.weight": "model-1-of-28.safetensors",
201
+ "visual.blocks.22.attn.qkv.bias": "model-1-of-28.safetensors",
202
+ "visual.blocks.22.attn.qkv.weight": "model-1-of-28.safetensors",
203
+ "visual.blocks.22.mlp.down_proj.bias": "model-1-of-28.safetensors",
204
+ "visual.blocks.22.mlp.down_proj.weight": "model-1-of-28.safetensors",
205
+ "visual.blocks.22.mlp.gate_proj.bias": "model-1-of-28.safetensors",
206
+ "visual.blocks.22.mlp.gate_proj.weight": "model-1-of-28.safetensors",
207
+ "visual.blocks.22.mlp.up_proj.bias": "model-1-of-28.safetensors",
208
+ "visual.blocks.22.mlp.up_proj.weight": "model-1-of-28.safetensors",
209
+ "visual.blocks.22.norm1.weight": "model-1-of-28.safetensors",
210
+ "visual.blocks.22.norm2.weight": "model-1-of-28.safetensors",
211
+ "visual.blocks.23.attn.proj.bias": "model-1-of-28.safetensors",
212
+ "visual.blocks.23.attn.proj.weight": "model-1-of-28.safetensors",
213
+ "visual.blocks.23.attn.qkv.bias": "model-1-of-28.safetensors",
214
+ "visual.blocks.23.attn.qkv.weight": "model-1-of-28.safetensors",
215
+ "visual.blocks.23.mlp.down_proj.bias": "model-1-of-28.safetensors",
216
+ "visual.blocks.23.mlp.down_proj.weight": "model-1-of-28.safetensors",
217
+ "visual.blocks.23.mlp.gate_proj.bias": "model-1-of-28.safetensors",
218
+ "visual.blocks.23.mlp.gate_proj.weight": "model-1-of-28.safetensors",
219
+ "visual.blocks.23.mlp.up_proj.bias": "model-1-of-28.safetensors",
220
+ "visual.blocks.23.mlp.up_proj.weight": "model-1-of-28.safetensors",
221
+ "visual.blocks.23.norm1.weight": "model-1-of-28.safetensors",
222
+ "visual.blocks.23.norm2.weight": "model-1-of-28.safetensors",
223
+ "visual.blocks.24.attn.proj.bias": "model-1-of-28.safetensors",
224
+ "visual.blocks.24.attn.proj.weight": "model-1-of-28.safetensors",
225
+ "visual.blocks.24.attn.qkv.bias": "model-1-of-28.safetensors",
226
+ "visual.blocks.24.attn.qkv.weight": "model-1-of-28.safetensors",
227
+ "visual.blocks.24.mlp.down_proj.bias": "model-1-of-28.safetensors",
228
+ "visual.blocks.24.mlp.down_proj.weight": "model-1-of-28.safetensors",
229
+ "visual.blocks.24.mlp.gate_proj.bias": "model-1-of-28.safetensors",
230
+ "visual.blocks.24.mlp.gate_proj.weight": "model-1-of-28.safetensors",
231
+ "visual.blocks.24.mlp.up_proj.bias": "model-1-of-28.safetensors",
232
+ "visual.blocks.24.mlp.up_proj.weight": "model-1-of-28.safetensors",
233
+ "visual.blocks.24.norm1.weight": "model-1-of-28.safetensors",
234
+ "visual.blocks.24.norm2.weight": "model-1-of-28.safetensors",
235
+ "visual.blocks.25.attn.proj.bias": "model-1-of-28.safetensors",
236
+ "visual.blocks.25.attn.proj.weight": "model-1-of-28.safetensors",
237
+ "visual.blocks.25.attn.qkv.bias": "model-1-of-28.safetensors",
238
+ "visual.blocks.25.attn.qkv.weight": "model-1-of-28.safetensors",
239
+ "visual.blocks.25.mlp.down_proj.bias": "model-1-of-28.safetensors",
240
+ "visual.blocks.25.mlp.down_proj.weight": "model-1-of-28.safetensors",
241
+ "visual.blocks.25.mlp.gate_proj.bias": "model-1-of-28.safetensors",
242
+ "visual.blocks.25.mlp.gate_proj.weight": "model-1-of-28.safetensors",
243
+ "visual.blocks.25.mlp.up_proj.bias": "model-1-of-28.safetensors",
244
+ "visual.blocks.25.mlp.up_proj.weight": "model-1-of-28.safetensors",
245
+ "visual.blocks.25.norm1.weight": "model-1-of-28.safetensors",
246
+ "visual.blocks.25.norm2.weight": "model-1-of-28.safetensors",
247
+ "visual.blocks.26.attn.proj.bias": "model-1-of-28.safetensors",
248
+ "visual.blocks.26.attn.proj.weight": "model-1-of-28.safetensors",
249
+ "visual.blocks.26.attn.qkv.bias": "model-1-of-28.safetensors",
250
+ "visual.blocks.26.attn.qkv.weight": "model-1-of-28.safetensors",
251
+ "visual.blocks.26.mlp.down_proj.bias": "model-1-of-28.safetensors",
252
+ "visual.blocks.26.mlp.down_proj.weight": "model-1-of-28.safetensors",
253
+ "visual.blocks.26.mlp.gate_proj.bias": "model-1-of-28.safetensors",
254
+ "visual.blocks.26.mlp.gate_proj.weight": "model-1-of-28.safetensors",
255
+ "visual.blocks.26.mlp.up_proj.bias": "model-1-of-28.safetensors",
256
+ "visual.blocks.26.mlp.up_proj.weight": "model-1-of-28.safetensors",
257
+ "visual.blocks.26.norm1.weight": "model-1-of-28.safetensors",
258
+ "visual.blocks.26.norm2.weight": "model-1-of-28.safetensors",
259
+ "visual.blocks.27.attn.proj.bias": "model-1-of-28.safetensors",
260
+ "visual.blocks.27.attn.proj.weight": "model-1-of-28.safetensors",
261
+ "visual.blocks.27.attn.qkv.bias": "model-1-of-28.safetensors",
262
+ "visual.blocks.27.attn.qkv.weight": "model-1-of-28.safetensors",
263
+ "visual.blocks.27.mlp.down_proj.bias": "model-1-of-28.safetensors",
264
+ "visual.blocks.27.mlp.down_proj.weight": "model-1-of-28.safetensors",
265
+ "visual.blocks.27.mlp.gate_proj.bias": "model-1-of-28.safetensors",
266
+ "visual.blocks.27.mlp.gate_proj.weight": "model-1-of-28.safetensors",
267
+ "visual.blocks.27.mlp.up_proj.bias": "model-1-of-28.safetensors",
268
+ "visual.blocks.27.mlp.up_proj.weight": "model-1-of-28.safetensors",
269
+ "visual.blocks.27.norm1.weight": "model-1-of-28.safetensors",
270
+ "visual.blocks.27.norm2.weight": "model-1-of-28.safetensors",
271
+ "visual.blocks.28.attn.proj.bias": "model-1-of-28.safetensors",
272
+ "visual.blocks.28.attn.proj.weight": "model-1-of-28.safetensors",
273
+ "visual.blocks.28.attn.qkv.bias": "model-1-of-28.safetensors",
274
+ "visual.blocks.28.attn.qkv.weight": "model-1-of-28.safetensors",
275
+ "visual.blocks.28.mlp.down_proj.bias": "model-1-of-28.safetensors",
276
+ "visual.blocks.28.mlp.down_proj.weight": "model-1-of-28.safetensors",
277
+ "visual.blocks.28.mlp.gate_proj.bias": "model-1-of-28.safetensors",
278
+ "visual.blocks.28.mlp.gate_proj.weight": "model-1-of-28.safetensors",
279
+ "visual.blocks.28.mlp.up_proj.bias": "model-1-of-28.safetensors",
280
+ "visual.blocks.28.mlp.up_proj.weight": "model-1-of-28.safetensors",
281
+ "visual.blocks.28.norm1.weight": "model-1-of-28.safetensors",
282
+ "visual.blocks.28.norm2.weight": "model-1-of-28.safetensors",
283
+ "visual.blocks.29.attn.proj.bias": "model-1-of-28.safetensors",
284
+ "visual.blocks.29.attn.proj.weight": "model-1-of-28.safetensors",
285
+ "visual.blocks.29.attn.qkv.bias": "model-1-of-28.safetensors",
286
+ "visual.blocks.29.attn.qkv.weight": "model-1-of-28.safetensors",
287
+ "visual.blocks.29.mlp.down_proj.bias": "model-1-of-28.safetensors",
288
+ "visual.blocks.29.mlp.down_proj.weight": "model-1-of-28.safetensors",
289
+ "visual.blocks.29.mlp.gate_proj.bias": "model-1-of-28.safetensors",
290
+ "visual.blocks.29.mlp.gate_proj.weight": "model-1-of-28.safetensors",
291
+ "visual.blocks.29.mlp.up_proj.bias": "model-1-of-28.safetensors",
292
+ "visual.blocks.29.mlp.up_proj.weight": "model-1-of-28.safetensors",
293
+ "visual.blocks.29.norm1.weight": "model-1-of-28.safetensors",
294
+ "visual.blocks.29.norm2.weight": "model-1-of-28.safetensors",
295
+ "visual.blocks.3.attn.proj.bias": "model-1-of-28.safetensors",
296
+ "visual.blocks.3.attn.proj.weight": "model-1-of-28.safetensors",
297
+ "visual.blocks.3.attn.qkv.bias": "model-1-of-28.safetensors",
298
+ "visual.blocks.3.attn.qkv.weight": "model-1-of-28.safetensors",
299
+ "visual.blocks.3.mlp.down_proj.bias": "model-1-of-28.safetensors",
300
+ "visual.blocks.3.mlp.down_proj.weight": "model-1-of-28.safetensors",
301
+ "visual.blocks.3.mlp.gate_proj.bias": "model-1-of-28.safetensors",
302
+ "visual.blocks.3.mlp.gate_proj.weight": "model-1-of-28.safetensors",
303
+ "visual.blocks.3.mlp.up_proj.bias": "model-1-of-28.safetensors",
304
+ "visual.blocks.3.mlp.up_proj.weight": "model-1-of-28.safetensors",
305
+ "visual.blocks.3.norm1.weight": "model-1-of-28.safetensors",
306
+ "visual.blocks.3.norm2.weight": "model-1-of-28.safetensors",
307
+ "visual.blocks.30.attn.proj.bias": "model-1-of-28.safetensors",
308
+ "visual.blocks.30.attn.proj.weight": "model-1-of-28.safetensors",
309
+ "visual.blocks.30.attn.qkv.bias": "model-1-of-28.safetensors",
310
+ "visual.blocks.30.attn.qkv.weight": "model-1-of-28.safetensors",
311
+ "visual.blocks.30.mlp.down_proj.bias": "model-1-of-28.safetensors",
312
+ "visual.blocks.30.mlp.down_proj.weight": "model-1-of-28.safetensors",
313
+ "visual.blocks.30.mlp.gate_proj.bias": "model-1-of-28.safetensors",
314
+ "visual.blocks.30.mlp.gate_proj.weight": "model-1-of-28.safetensors",
315
+ "visual.blocks.30.mlp.up_proj.bias": "model-1-of-28.safetensors",
316
+ "visual.blocks.30.mlp.up_proj.weight": "model-1-of-28.safetensors",
317
+ "visual.blocks.30.norm1.weight": "model-1-of-28.safetensors",
318
+ "visual.blocks.30.norm2.weight": "model-1-of-28.safetensors",
319
+ "visual.blocks.31.attn.proj.bias": "model-1-of-28.safetensors",
320
+ "visual.blocks.31.attn.proj.weight": "model-1-of-28.safetensors",
321
+ "visual.blocks.31.attn.qkv.bias": "model-1-of-28.safetensors",
322
+ "visual.blocks.31.attn.qkv.weight": "model-1-of-28.safetensors",
323
+ "visual.blocks.31.mlp.down_proj.bias": "model-1-of-28.safetensors",
324
+ "visual.blocks.31.mlp.down_proj.weight": "model-1-of-28.safetensors",
325
+ "visual.blocks.31.mlp.gate_proj.bias": "model-1-of-28.safetensors",
326
+ "visual.blocks.31.mlp.gate_proj.weight": "model-1-of-28.safetensors",
327
+ "visual.blocks.31.mlp.up_proj.bias": "model-1-of-28.safetensors",
328
+ "visual.blocks.31.mlp.up_proj.weight": "model-1-of-28.safetensors",
329
+ "visual.blocks.31.norm1.weight": "model-1-of-28.safetensors",
330
+ "visual.blocks.31.norm2.weight": "model-1-of-28.safetensors",
331
+ "visual.blocks.4.attn.proj.bias": "model-1-of-28.safetensors",
332
+ "visual.blocks.4.attn.proj.weight": "model-1-of-28.safetensors",
333
+ "visual.blocks.4.attn.qkv.bias": "model-1-of-28.safetensors",
334
+ "visual.blocks.4.attn.qkv.weight": "model-1-of-28.safetensors",
335
+ "visual.blocks.4.mlp.down_proj.bias": "model-1-of-28.safetensors",
336
+ "visual.blocks.4.mlp.down_proj.weight": "model-1-of-28.safetensors",
337
+ "visual.blocks.4.mlp.gate_proj.bias": "model-1-of-28.safetensors",
338
+ "visual.blocks.4.mlp.gate_proj.weight": "model-1-of-28.safetensors",
339
+ "visual.blocks.4.mlp.up_proj.bias": "model-1-of-28.safetensors",
340
+ "visual.blocks.4.mlp.up_proj.weight": "model-1-of-28.safetensors",
341
+ "visual.blocks.4.norm1.weight": "model-1-of-28.safetensors",
342
+ "visual.blocks.4.norm2.weight": "model-1-of-28.safetensors",
343
+ "visual.blocks.5.attn.proj.bias": "model-1-of-28.safetensors",
344
+ "visual.blocks.5.attn.proj.weight": "model-1-of-28.safetensors",
345
+ "visual.blocks.5.attn.qkv.bias": "model-1-of-28.safetensors",
346
+ "visual.blocks.5.attn.qkv.weight": "model-1-of-28.safetensors",
347
+ "visual.blocks.5.mlp.down_proj.bias": "model-1-of-28.safetensors",
348
+ "visual.blocks.5.mlp.down_proj.weight": "model-1-of-28.safetensors",
349
+ "visual.blocks.5.mlp.gate_proj.bias": "model-1-of-28.safetensors",
350
+ "visual.blocks.5.mlp.gate_proj.weight": "model-1-of-28.safetensors",
351
+ "visual.blocks.5.mlp.up_proj.bias": "model-1-of-28.safetensors",
352
+ "visual.blocks.5.mlp.up_proj.weight": "model-1-of-28.safetensors",
353
+ "visual.blocks.5.norm1.weight": "model-1-of-28.safetensors",
354
+ "visual.blocks.5.norm2.weight": "model-1-of-28.safetensors",
355
+ "visual.blocks.6.attn.proj.bias": "model-1-of-28.safetensors",
356
+ "visual.blocks.6.attn.proj.weight": "model-1-of-28.safetensors",
357
+ "visual.blocks.6.attn.qkv.bias": "model-1-of-28.safetensors",
358
+ "visual.blocks.6.attn.qkv.weight": "model-1-of-28.safetensors",
359
+ "visual.blocks.6.mlp.down_proj.bias": "model-1-of-28.safetensors",
360
+ "visual.blocks.6.mlp.down_proj.weight": "model-1-of-28.safetensors",
361
+ "visual.blocks.6.mlp.gate_proj.bias": "model-1-of-28.safetensors",
362
+ "visual.blocks.6.mlp.gate_proj.weight": "model-1-of-28.safetensors",
363
+ "visual.blocks.6.mlp.up_proj.bias": "model-1-of-28.safetensors",
364
+ "visual.blocks.6.mlp.up_proj.weight": "model-1-of-28.safetensors",
365
+ "visual.blocks.6.norm1.weight": "model-1-of-28.safetensors",
366
+ "visual.blocks.6.norm2.weight": "model-1-of-28.safetensors",
367
+ "visual.blocks.7.attn.proj.bias": "model-1-of-28.safetensors",
368
+ "visual.blocks.7.attn.proj.weight": "model-1-of-28.safetensors",
369
+ "visual.blocks.7.attn.qkv.bias": "model-1-of-28.safetensors",
370
+ "visual.blocks.7.attn.qkv.weight": "model-1-of-28.safetensors",
371
+ "visual.blocks.7.mlp.down_proj.bias": "model-1-of-28.safetensors",
372
+ "visual.blocks.7.mlp.down_proj.weight": "model-1-of-28.safetensors",
373
+ "visual.blocks.7.mlp.gate_proj.bias": "model-1-of-28.safetensors",
374
+ "visual.blocks.7.mlp.gate_proj.weight": "model-1-of-28.safetensors",
375
+ "visual.blocks.7.mlp.up_proj.bias": "model-1-of-28.safetensors",
376
+ "visual.blocks.7.mlp.up_proj.weight": "model-1-of-28.safetensors",
377
+ "visual.blocks.7.norm1.weight": "model-1-of-28.safetensors",
378
+ "visual.blocks.7.norm2.weight": "model-1-of-28.safetensors",
379
+ "visual.blocks.8.attn.proj.bias": "model-1-of-28.safetensors",
380
+ "visual.blocks.8.attn.proj.weight": "model-1-of-28.safetensors",
381
+ "visual.blocks.8.attn.qkv.bias": "model-1-of-28.safetensors",
382
+ "visual.blocks.8.attn.qkv.weight": "model-1-of-28.safetensors",
383
+ "visual.blocks.8.mlp.down_proj.bias": "model-1-of-28.safetensors",
384
+ "visual.blocks.8.mlp.down_proj.weight": "model-1-of-28.safetensors",
385
+ "visual.blocks.8.mlp.gate_proj.bias": "model-1-of-28.safetensors",
386
+ "visual.blocks.8.mlp.gate_proj.weight": "model-1-of-28.safetensors",
387
+ "visual.blocks.8.mlp.up_proj.bias": "model-1-of-28.safetensors",
388
+ "visual.blocks.8.mlp.up_proj.weight": "model-1-of-28.safetensors",
389
+ "visual.blocks.8.norm1.weight": "model-1-of-28.safetensors",
390
+ "visual.blocks.8.norm2.weight": "model-1-of-28.safetensors",
391
+ "visual.blocks.9.attn.proj.bias": "model-1-of-28.safetensors",
392
+ "visual.blocks.9.attn.proj.weight": "model-1-of-28.safetensors",
393
+ "visual.blocks.9.attn.qkv.bias": "model-1-of-28.safetensors",
394
+ "visual.blocks.9.attn.qkv.weight": "model-1-of-28.safetensors",
395
+ "visual.blocks.9.mlp.down_proj.bias": "model-1-of-28.safetensors",
396
+ "visual.blocks.9.mlp.down_proj.weight": "model-1-of-28.safetensors",
397
+ "visual.blocks.9.mlp.gate_proj.bias": "model-1-of-28.safetensors",
398
+ "visual.blocks.9.mlp.gate_proj.weight": "model-1-of-28.safetensors",
399
+ "visual.blocks.9.mlp.up_proj.bias": "model-1-of-28.safetensors",
400
+ "visual.blocks.9.mlp.up_proj.weight": "model-1-of-28.safetensors",
401
+ "visual.blocks.9.norm1.weight": "model-1-of-28.safetensors",
402
+ "visual.blocks.9.norm2.weight": "model-1-of-28.safetensors",
403
+ "visual.merger.ln_q.weight": "model-1-of-28.safetensors",
404
+ "visual.merger.mlp.0.bias": "model-1-of-28.safetensors",
405
+ "visual.merger.mlp.0.weight": "model-1-of-28.safetensors",
406
+ "visual.merger.mlp.2.bias": "model-1-of-28.safetensors",
407
+ "visual.merger.mlp.2.weight": "model-1-of-28.safetensors",
408
+ "visual.patch_embed.proj.weight": "model-1-of-28.safetensors",
409
+ "model.layers.0.self_attn.rotary_emb.inv_freq": "model-1-of-28.safetensors",
410
+ "model.layers.27.self_attn.q_proj.weight": "model-28-of-28.safetensors",
411
+ "model.layers.27.self_attn.q_proj.bias": "model-28-of-28.safetensors",
412
+ "model.layers.27.self_attn.k_proj.weight": "model-28-of-28.safetensors",
413
+ "model.layers.27.self_attn.k_proj.bias": "model-28-of-28.safetensors",
414
+ "model.layers.27.self_attn.v_proj.weight": "model-28-of-28.safetensors",
415
+ "model.layers.27.self_attn.v_proj.bias": "model-28-of-28.safetensors",
416
+ "model.layers.27.self_attn.o_proj.weight": "model-28-of-28.safetensors",
417
+ "model.layers.27.mlp.gate_proj.weight": "model-28-of-28.safetensors",
418
+ "model.layers.27.mlp.down_proj.weight": "model-28-of-28.safetensors",
419
+ "model.layers.27.mlp.up_proj.weight": "model-28-of-28.safetensors",
420
+ "model.layers.27.input_layernorm.weight": "model-28-of-28.safetensors",
421
+ "model.layers.27.post_attention_layernorm.weight": "model-28-of-28.safetensors",
422
+ "model.norm.weight": "model-28-of-28.safetensors",
423
+ "lm_head.weight": "model-28-of-28.safetensors",
424
+ "model.layers.27.self_attn.rotary_emb.inv_freq": "model-28-of-28.safetensors",
425
+ "model.layers.21.self_attn.q_proj.weight": "model-22-of-28.safetensors",
426
+ "model.layers.21.self_attn.q_proj.bias": "model-22-of-28.safetensors",
427
+ "model.layers.21.self_attn.k_proj.weight": "model-22-of-28.safetensors",
428
+ "model.layers.21.self_attn.k_proj.bias": "model-22-of-28.safetensors",
429
+ "model.layers.21.self_attn.v_proj.weight": "model-22-of-28.safetensors",
430
+ "model.layers.21.self_attn.v_proj.bias": "model-22-of-28.safetensors",
431
+ "model.layers.21.self_attn.o_proj.weight": "model-22-of-28.safetensors",
432
+ "model.layers.21.mlp.gate_proj.weight": "model-22-of-28.safetensors",
433
+ "model.layers.21.mlp.down_proj.weight": "model-22-of-28.safetensors",
434
+ "model.layers.21.mlp.up_proj.weight": "model-22-of-28.safetensors",
435
+ "model.layers.21.input_layernorm.weight": "model-22-of-28.safetensors",
436
+ "model.layers.21.post_attention_layernorm.weight": "model-22-of-28.safetensors",
437
+ "model.layers.21.self_attn.rotary_emb.inv_freq": "model-22-of-28.safetensors",
438
+ "model.layers.19.self_attn.q_proj.weight": "model-20-of-28.safetensors",
439
+ "model.layers.19.self_attn.q_proj.bias": "model-20-of-28.safetensors",
440
+ "model.layers.19.self_attn.k_proj.weight": "model-20-of-28.safetensors",
441
+ "model.layers.19.self_attn.k_proj.bias": "model-20-of-28.safetensors",
442
+ "model.layers.19.self_attn.v_proj.weight": "model-20-of-28.safetensors",
443
+ "model.layers.19.self_attn.v_proj.bias": "model-20-of-28.safetensors",
444
+ "model.layers.19.self_attn.o_proj.weight": "model-20-of-28.safetensors",
445
+ "model.layers.19.mlp.gate_proj.weight": "model-20-of-28.safetensors",
446
+ "model.layers.19.mlp.down_proj.weight": "model-20-of-28.safetensors",
447
+ "model.layers.19.mlp.up_proj.weight": "model-20-of-28.safetensors",
448
+ "model.layers.19.input_layernorm.weight": "model-20-of-28.safetensors",
449
+ "model.layers.19.post_attention_layernorm.weight": "model-20-of-28.safetensors",
450
+ "model.layers.19.self_attn.rotary_emb.inv_freq": "model-20-of-28.safetensors",
451
+ "model.layers.25.self_attn.q_proj.weight": "model-26-of-28.safetensors",
452
+ "model.layers.25.self_attn.q_proj.bias": "model-26-of-28.safetensors",
453
+ "model.layers.25.self_attn.k_proj.weight": "model-26-of-28.safetensors",
454
+ "model.layers.25.self_attn.k_proj.bias": "model-26-of-28.safetensors",
455
+ "model.layers.25.self_attn.v_proj.weight": "model-26-of-28.safetensors",
456
+ "model.layers.25.self_attn.v_proj.bias": "model-26-of-28.safetensors",
457
+ "model.layers.25.self_attn.o_proj.weight": "model-26-of-28.safetensors",
458
+ "model.layers.25.mlp.gate_proj.weight": "model-26-of-28.safetensors",
459
+ "model.layers.25.mlp.down_proj.weight": "model-26-of-28.safetensors",
460
+ "model.layers.25.mlp.up_proj.weight": "model-26-of-28.safetensors",
461
+ "model.layers.25.input_layernorm.weight": "model-26-of-28.safetensors",
462
+ "model.layers.25.post_attention_layernorm.weight": "model-26-of-28.safetensors",
463
+ "model.layers.25.self_attn.rotary_emb.inv_freq": "model-26-of-28.safetensors",
464
+ "model.layers.24.self_attn.q_proj.weight": "model-25-of-28.safetensors",
465
+ "model.layers.24.self_attn.q_proj.bias": "model-25-of-28.safetensors",
466
+ "model.layers.24.self_attn.k_proj.weight": "model-25-of-28.safetensors",
467
+ "model.layers.24.self_attn.k_proj.bias": "model-25-of-28.safetensors",
468
+ "model.layers.24.self_attn.v_proj.weight": "model-25-of-28.safetensors",
469
+ "model.layers.24.self_attn.v_proj.bias": "model-25-of-28.safetensors",
470
+ "model.layers.24.self_attn.o_proj.weight": "model-25-of-28.safetensors",
471
+ "model.layers.24.mlp.gate_proj.weight": "model-25-of-28.safetensors",
472
+ "model.layers.24.mlp.down_proj.weight": "model-25-of-28.safetensors",
473
+ "model.layers.24.mlp.up_proj.weight": "model-25-of-28.safetensors",
474
+ "model.layers.24.input_layernorm.weight": "model-25-of-28.safetensors",
475
+ "model.layers.24.post_attention_layernorm.weight": "model-25-of-28.safetensors",
476
+ "model.layers.24.self_attn.rotary_emb.inv_freq": "model-25-of-28.safetensors",
477
+ "model.layers.13.self_attn.q_proj.weight": "model-14-of-28.safetensors",
478
+ "model.layers.13.self_attn.q_proj.bias": "model-14-of-28.safetensors",
479
+ "model.layers.13.self_attn.k_proj.weight": "model-14-of-28.safetensors",
480
+ "model.layers.13.self_attn.k_proj.bias": "model-14-of-28.safetensors",
481
+ "model.layers.13.self_attn.v_proj.weight": "model-14-of-28.safetensors",
482
+ "model.layers.13.self_attn.v_proj.bias": "model-14-of-28.safetensors",
483
+ "model.layers.13.self_attn.o_proj.weight": "model-14-of-28.safetensors",
484
+ "model.layers.13.mlp.gate_proj.weight": "model-14-of-28.safetensors",
485
+ "model.layers.13.mlp.down_proj.weight": "model-14-of-28.safetensors",
486
+ "model.layers.13.mlp.up_proj.weight": "model-14-of-28.safetensors",
487
+ "model.layers.13.input_layernorm.weight": "model-14-of-28.safetensors",
488
+ "model.layers.13.post_attention_layernorm.weight": "model-14-of-28.safetensors",
489
+ "model.layers.13.self_attn.rotary_emb.inv_freq": "model-14-of-28.safetensors",
490
+ "model.layers.15.self_attn.q_proj.weight": "model-16-of-28.safetensors",
491
+ "model.layers.15.self_attn.q_proj.bias": "model-16-of-28.safetensors",
492
+ "model.layers.15.self_attn.k_proj.weight": "model-16-of-28.safetensors",
493
+ "model.layers.15.self_attn.k_proj.bias": "model-16-of-28.safetensors",
494
+ "model.layers.15.self_attn.v_proj.weight": "model-16-of-28.safetensors",
495
+ "model.layers.15.self_attn.v_proj.bias": "model-16-of-28.safetensors",
496
+ "model.layers.15.self_attn.o_proj.weight": "model-16-of-28.safetensors",
497
+ "model.layers.15.mlp.gate_proj.weight": "model-16-of-28.safetensors",
498
+ "model.layers.15.mlp.down_proj.weight": "model-16-of-28.safetensors",
499
+ "model.layers.15.mlp.up_proj.weight": "model-16-of-28.safetensors",
500
+ "model.layers.15.input_layernorm.weight": "model-16-of-28.safetensors",
501
+ "model.layers.15.post_attention_layernorm.weight": "model-16-of-28.safetensors",
502
+ "model.layers.15.self_attn.rotary_emb.inv_freq": "model-16-of-28.safetensors",
503
+ "model.layers.17.self_attn.q_proj.weight": "model-18-of-28.safetensors",
504
+ "model.layers.17.self_attn.q_proj.bias": "model-18-of-28.safetensors",
505
+ "model.layers.17.self_attn.k_proj.weight": "model-18-of-28.safetensors",
506
+ "model.layers.17.self_attn.k_proj.bias": "model-18-of-28.safetensors",
507
+ "model.layers.17.self_attn.v_proj.weight": "model-18-of-28.safetensors",
508
+ "model.layers.17.self_attn.v_proj.bias": "model-18-of-28.safetensors",
509
+ "model.layers.17.self_attn.o_proj.weight": "model-18-of-28.safetensors",
510
+ "model.layers.17.mlp.gate_proj.weight": "model-18-of-28.safetensors",
511
+ "model.layers.17.mlp.down_proj.weight": "model-18-of-28.safetensors",
512
+ "model.layers.17.mlp.up_proj.weight": "model-18-of-28.safetensors",
513
+ "model.layers.17.input_layernorm.weight": "model-18-of-28.safetensors",
514
+ "model.layers.17.post_attention_layernorm.weight": "model-18-of-28.safetensors",
515
+ "model.layers.17.self_attn.rotary_emb.inv_freq": "model-18-of-28.safetensors",
516
+ "model.layers.20.self_attn.q_proj.weight": "model-21-of-28.safetensors",
517
+ "model.layers.20.self_attn.q_proj.bias": "model-21-of-28.safetensors",
518
+ "model.layers.20.self_attn.k_proj.weight": "model-21-of-28.safetensors",
519
+ "model.layers.20.self_attn.k_proj.bias": "model-21-of-28.safetensors",
520
+ "model.layers.20.self_attn.v_proj.weight": "model-21-of-28.safetensors",
521
+ "model.layers.20.self_attn.v_proj.bias": "model-21-of-28.safetensors",
522
+ "model.layers.20.self_attn.o_proj.weight": "model-21-of-28.safetensors",
523
+ "model.layers.20.mlp.gate_proj.weight": "model-21-of-28.safetensors",
524
+ "model.layers.20.mlp.down_proj.weight": "model-21-of-28.safetensors",
525
+ "model.layers.20.mlp.up_proj.weight": "model-21-of-28.safetensors",
526
+ "model.layers.20.input_layernorm.weight": "model-21-of-28.safetensors",
527
+ "model.layers.20.post_attention_layernorm.weight": "model-21-of-28.safetensors",
528
+ "model.layers.20.self_attn.rotary_emb.inv_freq": "model-21-of-28.safetensors",
529
+ "model.layers.26.self_attn.q_proj.weight": "model-27-of-28.safetensors",
530
+ "model.layers.26.self_attn.q_proj.bias": "model-27-of-28.safetensors",
531
+ "model.layers.26.self_attn.k_proj.weight": "model-27-of-28.safetensors",
532
+ "model.layers.26.self_attn.k_proj.bias": "model-27-of-28.safetensors",
533
+ "model.layers.26.self_attn.v_proj.weight": "model-27-of-28.safetensors",
534
+ "model.layers.26.self_attn.v_proj.bias": "model-27-of-28.safetensors",
535
+ "model.layers.26.self_attn.o_proj.weight": "model-27-of-28.safetensors",
536
+ "model.layers.26.mlp.gate_proj.weight": "model-27-of-28.safetensors",
537
+ "model.layers.26.mlp.down_proj.weight": "model-27-of-28.safetensors",
538
+ "model.layers.26.mlp.up_proj.weight": "model-27-of-28.safetensors",
539
+ "model.layers.26.input_layernorm.weight": "model-27-of-28.safetensors",
540
+ "model.layers.26.post_attention_layernorm.weight": "model-27-of-28.safetensors",
541
+ "model.layers.26.self_attn.rotary_emb.inv_freq": "model-27-of-28.safetensors",
542
+ "model.layers.8.self_attn.q_proj.weight": "model-9-of-28.safetensors",
543
+ "model.layers.8.self_attn.q_proj.bias": "model-9-of-28.safetensors",
544
+ "model.layers.8.self_attn.k_proj.weight": "model-9-of-28.safetensors",
545
+ "model.layers.8.self_attn.k_proj.bias": "model-9-of-28.safetensors",
546
+ "model.layers.8.self_attn.v_proj.weight": "model-9-of-28.safetensors",
547
+ "model.layers.8.self_attn.v_proj.bias": "model-9-of-28.safetensors",
548
+ "model.layers.8.self_attn.o_proj.weight": "model-9-of-28.safetensors",
549
+ "model.layers.8.mlp.gate_proj.weight": "model-9-of-28.safetensors",
550
+ "model.layers.8.mlp.down_proj.weight": "model-9-of-28.safetensors",
551
+ "model.layers.8.mlp.up_proj.weight": "model-9-of-28.safetensors",
552
+ "model.layers.8.input_layernorm.weight": "model-9-of-28.safetensors",
553
+ "model.layers.8.post_attention_layernorm.weight": "model-9-of-28.safetensors",
554
+ "model.layers.8.self_attn.rotary_emb.inv_freq": "model-9-of-28.safetensors",
555
+ "model.layers.22.self_attn.q_proj.weight": "model-23-of-28.safetensors",
556
+ "model.layers.22.self_attn.q_proj.bias": "model-23-of-28.safetensors",
557
+ "model.layers.22.self_attn.k_proj.weight": "model-23-of-28.safetensors",
558
+ "model.layers.22.self_attn.k_proj.bias": "model-23-of-28.safetensors",
559
+ "model.layers.22.self_attn.v_proj.weight": "model-23-of-28.safetensors",
560
+ "model.layers.22.self_attn.v_proj.bias": "model-23-of-28.safetensors",
561
+ "model.layers.22.self_attn.o_proj.weight": "model-23-of-28.safetensors",
562
+ "model.layers.22.mlp.gate_proj.weight": "model-23-of-28.safetensors",
563
+ "model.layers.22.mlp.down_proj.weight": "model-23-of-28.safetensors",
564
+ "model.layers.22.mlp.up_proj.weight": "model-23-of-28.safetensors",
565
+ "model.layers.22.input_layernorm.weight": "model-23-of-28.safetensors",
566
+ "model.layers.22.post_attention_layernorm.weight": "model-23-of-28.safetensors",
567
+ "model.layers.22.self_attn.rotary_emb.inv_freq": "model-23-of-28.safetensors",
568
+ "model.layers.16.self_attn.q_proj.weight": "model-17-of-28.safetensors",
569
+ "model.layers.16.self_attn.q_proj.bias": "model-17-of-28.safetensors",
570
+ "model.layers.16.self_attn.k_proj.weight": "model-17-of-28.safetensors",
571
+ "model.layers.16.self_attn.k_proj.bias": "model-17-of-28.safetensors",
572
+ "model.layers.16.self_attn.v_proj.weight": "model-17-of-28.safetensors",
573
+ "model.layers.16.self_attn.v_proj.bias": "model-17-of-28.safetensors",
574
+ "model.layers.16.self_attn.o_proj.weight": "model-17-of-28.safetensors",
575
+ "model.layers.16.mlp.gate_proj.weight": "model-17-of-28.safetensors",
576
+ "model.layers.16.mlp.down_proj.weight": "model-17-of-28.safetensors",
577
+ "model.layers.16.mlp.up_proj.weight": "model-17-of-28.safetensors",
578
+ "model.layers.16.input_layernorm.weight": "model-17-of-28.safetensors",
579
+ "model.layers.16.post_attention_layernorm.weight": "model-17-of-28.safetensors",
580
+ "model.layers.16.self_attn.rotary_emb.inv_freq": "model-17-of-28.safetensors",
581
+ "model.layers.18.self_attn.q_proj.weight": "model-19-of-28.safetensors",
582
+ "model.layers.18.self_attn.q_proj.bias": "model-19-of-28.safetensors",
583
+ "model.layers.18.self_attn.k_proj.weight": "model-19-of-28.safetensors",
584
+ "model.layers.18.self_attn.k_proj.bias": "model-19-of-28.safetensors",
585
+ "model.layers.18.self_attn.v_proj.weight": "model-19-of-28.safetensors",
586
+ "model.layers.18.self_attn.v_proj.bias": "model-19-of-28.safetensors",
587
+ "model.layers.18.self_attn.o_proj.weight": "model-19-of-28.safetensors",
588
+ "model.layers.18.mlp.gate_proj.weight": "model-19-of-28.safetensors",
589
+ "model.layers.18.mlp.down_proj.weight": "model-19-of-28.safetensors",
590
+ "model.layers.18.mlp.up_proj.weight": "model-19-of-28.safetensors",
591
+ "model.layers.18.input_layernorm.weight": "model-19-of-28.safetensors",
592
+ "model.layers.18.post_attention_layernorm.weight": "model-19-of-28.safetensors",
593
+ "model.layers.18.self_attn.rotary_emb.inv_freq": "model-19-of-28.safetensors",
594
+ "model.layers.12.self_attn.q_proj.weight": "model-13-of-28.safetensors",
595
+ "model.layers.12.self_attn.q_proj.bias": "model-13-of-28.safetensors",
596
+ "model.layers.12.self_attn.k_proj.weight": "model-13-of-28.safetensors",
597
+ "model.layers.12.self_attn.k_proj.bias": "model-13-of-28.safetensors",
598
+ "model.layers.12.self_attn.v_proj.weight": "model-13-of-28.safetensors",
599
+ "model.layers.12.self_attn.v_proj.bias": "model-13-of-28.safetensors",
600
+ "model.layers.12.self_attn.o_proj.weight": "model-13-of-28.safetensors",
601
+ "model.layers.12.mlp.gate_proj.weight": "model-13-of-28.safetensors",
602
+ "model.layers.12.mlp.down_proj.weight": "model-13-of-28.safetensors",
603
+ "model.layers.12.mlp.up_proj.weight": "model-13-of-28.safetensors",
604
+ "model.layers.12.input_layernorm.weight": "model-13-of-28.safetensors",
605
+ "model.layers.12.post_attention_layernorm.weight": "model-13-of-28.safetensors",
606
+ "model.layers.12.self_attn.rotary_emb.inv_freq": "model-13-of-28.safetensors",
607
+ "model.layers.10.self_attn.q_proj.weight": "model-11-of-28.safetensors",
608
+ "model.layers.10.self_attn.q_proj.bias": "model-11-of-28.safetensors",
609
+ "model.layers.10.self_attn.k_proj.weight": "model-11-of-28.safetensors",
610
+ "model.layers.10.self_attn.k_proj.bias": "model-11-of-28.safetensors",
611
+ "model.layers.10.self_attn.v_proj.weight": "model-11-of-28.safetensors",
612
+ "model.layers.10.self_attn.v_proj.bias": "model-11-of-28.safetensors",
613
+ "model.layers.10.self_attn.o_proj.weight": "model-11-of-28.safetensors",
614
+ "model.layers.10.mlp.gate_proj.weight": "model-11-of-28.safetensors",
615
+ "model.layers.10.mlp.down_proj.weight": "model-11-of-28.safetensors",
616
+ "model.layers.10.mlp.up_proj.weight": "model-11-of-28.safetensors",
617
+ "model.layers.10.input_layernorm.weight": "model-11-of-28.safetensors",
618
+ "model.layers.10.post_attention_layernorm.weight": "model-11-of-28.safetensors",
619
+ "model.layers.10.self_attn.rotary_emb.inv_freq": "model-11-of-28.safetensors",
620
+ "model.layers.23.self_attn.q_proj.weight": "model-24-of-28.safetensors",
621
+ "model.layers.23.self_attn.q_proj.bias": "model-24-of-28.safetensors",
622
+ "model.layers.23.self_attn.k_proj.weight": "model-24-of-28.safetensors",
623
+ "model.layers.23.self_attn.k_proj.bias": "model-24-of-28.safetensors",
624
+ "model.layers.23.self_attn.v_proj.weight": "model-24-of-28.safetensors",
625
+ "model.layers.23.self_attn.v_proj.bias": "model-24-of-28.safetensors",
626
+ "model.layers.23.self_attn.o_proj.weight": "model-24-of-28.safetensors",
627
+ "model.layers.23.mlp.gate_proj.weight": "model-24-of-28.safetensors",
628
+ "model.layers.23.mlp.down_proj.weight": "model-24-of-28.safetensors",
629
+ "model.layers.23.mlp.up_proj.weight": "model-24-of-28.safetensors",
630
+ "model.layers.23.input_layernorm.weight": "model-24-of-28.safetensors",
631
+ "model.layers.23.post_attention_layernorm.weight": "model-24-of-28.safetensors",
632
+ "model.layers.23.self_attn.rotary_emb.inv_freq": "model-24-of-28.safetensors",
633
+ "model.layers.11.self_attn.q_proj.weight": "model-12-of-28.safetensors",
634
+ "model.layers.11.self_attn.q_proj.bias": "model-12-of-28.safetensors",
635
+ "model.layers.11.self_attn.k_proj.weight": "model-12-of-28.safetensors",
636
+ "model.layers.11.self_attn.k_proj.bias": "model-12-of-28.safetensors",
637
+ "model.layers.11.self_attn.v_proj.weight": "model-12-of-28.safetensors",
638
+ "model.layers.11.self_attn.v_proj.bias": "model-12-of-28.safetensors",
639
+ "model.layers.11.self_attn.o_proj.weight": "model-12-of-28.safetensors",
640
+ "model.layers.11.mlp.gate_proj.weight": "model-12-of-28.safetensors",
641
+ "model.layers.11.mlp.down_proj.weight": "model-12-of-28.safetensors",
642
+ "model.layers.11.mlp.up_proj.weight": "model-12-of-28.safetensors",
643
+ "model.layers.11.input_layernorm.weight": "model-12-of-28.safetensors",
644
+ "model.layers.11.post_attention_layernorm.weight": "model-12-of-28.safetensors",
645
+ "model.layers.11.self_attn.rotary_emb.inv_freq": "model-12-of-28.safetensors",
646
+ "model.layers.3.self_attn.q_proj.weight": "model-4-of-28.safetensors",
647
+ "model.layers.3.self_attn.q_proj.bias": "model-4-of-28.safetensors",
648
+ "model.layers.3.self_attn.k_proj.weight": "model-4-of-28.safetensors",
649
+ "model.layers.3.self_attn.k_proj.bias": "model-4-of-28.safetensors",
650
+ "model.layers.3.self_attn.v_proj.weight": "model-4-of-28.safetensors",
651
+ "model.layers.3.self_attn.v_proj.bias": "model-4-of-28.safetensors",
652
+ "model.layers.3.self_attn.o_proj.weight": "model-4-of-28.safetensors",
653
+ "model.layers.3.mlp.gate_proj.weight": "model-4-of-28.safetensors",
654
+ "model.layers.3.mlp.down_proj.weight": "model-4-of-28.safetensors",
655
+ "model.layers.3.mlp.up_proj.weight": "model-4-of-28.safetensors",
656
+ "model.layers.3.input_layernorm.weight": "model-4-of-28.safetensors",
657
+ "model.layers.3.post_attention_layernorm.weight": "model-4-of-28.safetensors",
658
+ "model.layers.3.self_attn.rotary_emb.inv_freq": "model-4-of-28.safetensors",
659
+ "model.layers.14.self_attn.q_proj.weight": "model-15-of-28.safetensors",
660
+ "model.layers.14.self_attn.q_proj.bias": "model-15-of-28.safetensors",
661
+ "model.layers.14.self_attn.k_proj.weight": "model-15-of-28.safetensors",
662
+ "model.layers.14.self_attn.k_proj.bias": "model-15-of-28.safetensors",
663
+ "model.layers.14.self_attn.v_proj.weight": "model-15-of-28.safetensors",
664
+ "model.layers.14.self_attn.v_proj.bias": "model-15-of-28.safetensors",
665
+ "model.layers.14.self_attn.o_proj.weight": "model-15-of-28.safetensors",
666
+ "model.layers.14.mlp.gate_proj.weight": "model-15-of-28.safetensors",
667
+ "model.layers.14.mlp.down_proj.weight": "model-15-of-28.safetensors",
668
+ "model.layers.14.mlp.up_proj.weight": "model-15-of-28.safetensors",
669
+ "model.layers.14.input_layernorm.weight": "model-15-of-28.safetensors",
670
+ "model.layers.14.post_attention_layernorm.weight": "model-15-of-28.safetensors",
671
+ "model.layers.14.self_attn.rotary_emb.inv_freq": "model-15-of-28.safetensors",
672
+ "model.layers.6.self_attn.q_proj.weight": "model-7-of-28.safetensors",
673
+ "model.layers.6.self_attn.q_proj.bias": "model-7-of-28.safetensors",
674
+ "model.layers.6.self_attn.k_proj.weight": "model-7-of-28.safetensors",
675
+ "model.layers.6.self_attn.k_proj.bias": "model-7-of-28.safetensors",
676
+ "model.layers.6.self_attn.v_proj.weight": "model-7-of-28.safetensors",
677
+ "model.layers.6.self_attn.v_proj.bias": "model-7-of-28.safetensors",
678
+ "model.layers.6.self_attn.o_proj.weight": "model-7-of-28.safetensors",
679
+ "model.layers.6.mlp.gate_proj.weight": "model-7-of-28.safetensors",
680
+ "model.layers.6.mlp.down_proj.weight": "model-7-of-28.safetensors",
681
+ "model.layers.6.mlp.up_proj.weight": "model-7-of-28.safetensors",
682
+ "model.layers.6.input_layernorm.weight": "model-7-of-28.safetensors",
683
+ "model.layers.6.post_attention_layernorm.weight": "model-7-of-28.safetensors",
684
+ "model.layers.6.self_attn.rotary_emb.inv_freq": "model-7-of-28.safetensors",
685
+ "model.layers.5.self_attn.q_proj.weight": "model-6-of-28.safetensors",
686
+ "model.layers.5.self_attn.q_proj.bias": "model-6-of-28.safetensors",
687
+ "model.layers.5.self_attn.k_proj.weight": "model-6-of-28.safetensors",
688
+ "model.layers.5.self_attn.k_proj.bias": "model-6-of-28.safetensors",
689
+ "model.layers.5.self_attn.v_proj.weight": "model-6-of-28.safetensors",
690
+ "model.layers.5.self_attn.v_proj.bias": "model-6-of-28.safetensors",
691
+ "model.layers.5.self_attn.o_proj.weight": "model-6-of-28.safetensors",
692
+ "model.layers.5.mlp.gate_proj.weight": "model-6-of-28.safetensors",
693
+ "model.layers.5.mlp.down_proj.weight": "model-6-of-28.safetensors",
694
+ "model.layers.5.mlp.up_proj.weight": "model-6-of-28.safetensors",
695
+ "model.layers.5.input_layernorm.weight": "model-6-of-28.safetensors",
696
+ "model.layers.5.post_attention_layernorm.weight": "model-6-of-28.safetensors",
697
+ "model.layers.5.self_attn.rotary_emb.inv_freq": "model-6-of-28.safetensors",
698
+ "model.layers.4.self_attn.q_proj.weight": "model-5-of-28.safetensors",
699
+ "model.layers.4.self_attn.q_proj.bias": "model-5-of-28.safetensors",
700
+ "model.layers.4.self_attn.k_proj.weight": "model-5-of-28.safetensors",
701
+ "model.layers.4.self_attn.k_proj.bias": "model-5-of-28.safetensors",
702
+ "model.layers.4.self_attn.v_proj.weight": "model-5-of-28.safetensors",
703
+ "model.layers.4.self_attn.v_proj.bias": "model-5-of-28.safetensors",
704
+ "model.layers.4.self_attn.o_proj.weight": "model-5-of-28.safetensors",
705
+ "model.layers.4.mlp.gate_proj.weight": "model-5-of-28.safetensors",
706
+ "model.layers.4.mlp.down_proj.weight": "model-5-of-28.safetensors",
707
+ "model.layers.4.mlp.up_proj.weight": "model-5-of-28.safetensors",
708
+ "model.layers.4.input_layernorm.weight": "model-5-of-28.safetensors",
709
+ "model.layers.4.post_attention_layernorm.weight": "model-5-of-28.safetensors",
710
+ "model.layers.4.self_attn.rotary_emb.inv_freq": "model-5-of-28.safetensors",
711
+ "model.layers.2.self_attn.q_proj.weight": "model-3-of-28.safetensors",
712
+ "model.layers.2.self_attn.q_proj.bias": "model-3-of-28.safetensors",
713
+ "model.layers.2.self_attn.k_proj.weight": "model-3-of-28.safetensors",
714
+ "model.layers.2.self_attn.k_proj.bias": "model-3-of-28.safetensors",
715
+ "model.layers.2.self_attn.v_proj.weight": "model-3-of-28.safetensors",
716
+ "model.layers.2.self_attn.v_proj.bias": "model-3-of-28.safetensors",
717
+ "model.layers.2.self_attn.o_proj.weight": "model-3-of-28.safetensors",
718
+ "model.layers.2.mlp.gate_proj.weight": "model-3-of-28.safetensors",
719
+ "model.layers.2.mlp.down_proj.weight": "model-3-of-28.safetensors",
720
+ "model.layers.2.mlp.up_proj.weight": "model-3-of-28.safetensors",
721
+ "model.layers.2.input_layernorm.weight": "model-3-of-28.safetensors",
722
+ "model.layers.2.post_attention_layernorm.weight": "model-3-of-28.safetensors",
723
+ "model.layers.2.self_attn.rotary_emb.inv_freq": "model-3-of-28.safetensors",
724
+ "model.layers.7.self_attn.q_proj.weight": "model-8-of-28.safetensors",
725
+ "model.layers.7.self_attn.q_proj.bias": "model-8-of-28.safetensors",
726
+ "model.layers.7.self_attn.k_proj.weight": "model-8-of-28.safetensors",
727
+ "model.layers.7.self_attn.k_proj.bias": "model-8-of-28.safetensors",
728
+ "model.layers.7.self_attn.v_proj.weight": "model-8-of-28.safetensors",
729
+ "model.layers.7.self_attn.v_proj.bias": "model-8-of-28.safetensors",
730
+ "model.layers.7.self_attn.o_proj.weight": "model-8-of-28.safetensors",
731
+ "model.layers.7.mlp.gate_proj.weight": "model-8-of-28.safetensors",
732
+ "model.layers.7.mlp.down_proj.weight": "model-8-of-28.safetensors",
733
+ "model.layers.7.mlp.up_proj.weight": "model-8-of-28.safetensors",
734
+ "model.layers.7.input_layernorm.weight": "model-8-of-28.safetensors",
735
+ "model.layers.7.post_attention_layernorm.weight": "model-8-of-28.safetensors",
736
+ "model.layers.7.self_attn.rotary_emb.inv_freq": "model-8-of-28.safetensors",
737
+ "model.layers.9.self_attn.q_proj.weight": "model-10-of-28.safetensors",
738
+ "model.layers.9.self_attn.q_proj.bias": "model-10-of-28.safetensors",
739
+ "model.layers.9.self_attn.k_proj.weight": "model-10-of-28.safetensors",
740
+ "model.layers.9.self_attn.k_proj.bias": "model-10-of-28.safetensors",
741
+ "model.layers.9.self_attn.v_proj.weight": "model-10-of-28.safetensors",
742
+ "model.layers.9.self_attn.v_proj.bias": "model-10-of-28.safetensors",
743
+ "model.layers.9.self_attn.o_proj.weight": "model-10-of-28.safetensors",
744
+ "model.layers.9.mlp.gate_proj.weight": "model-10-of-28.safetensors",
745
+ "model.layers.9.mlp.down_proj.weight": "model-10-of-28.safetensors",
746
+ "model.layers.9.mlp.up_proj.weight": "model-10-of-28.safetensors",
747
+ "model.layers.9.input_layernorm.weight": "model-10-of-28.safetensors",
748
+ "model.layers.9.post_attention_layernorm.weight": "model-10-of-28.safetensors",
749
+ "model.layers.9.self_attn.rotary_emb.inv_freq": "model-10-of-28.safetensors",
750
+ "model.layers.1.self_attn.q_proj.weight": "model-2-of-28.safetensors",
751
+ "model.layers.1.self_attn.q_proj.bias": "model-2-of-28.safetensors",
752
+ "model.layers.1.self_attn.k_proj.weight": "model-2-of-28.safetensors",
753
+ "model.layers.1.self_attn.k_proj.bias": "model-2-of-28.safetensors",
754
+ "model.layers.1.self_attn.v_proj.weight": "model-2-of-28.safetensors",
755
+ "model.layers.1.self_attn.v_proj.bias": "model-2-of-28.safetensors",
756
+ "model.layers.1.self_attn.o_proj.weight": "model-2-of-28.safetensors",
757
+ "model.layers.1.mlp.gate_proj.weight": "model-2-of-28.safetensors",
758
+ "model.layers.1.mlp.down_proj.weight": "model-2-of-28.safetensors",
759
+ "model.layers.1.mlp.up_proj.weight": "model-2-of-28.safetensors",
760
+ "model.layers.1.input_layernorm.weight": "model-2-of-28.safetensors",
761
+ "model.layers.1.post_attention_layernorm.weight": "model-2-of-28.safetensors",
762
+ "model.layers.1.self_attn.rotary_emb.inv_freq": "model-2-of-28.safetensors"
763
+ }
764
+ }
modeling_opencua.py ADDED
@@ -0,0 +1,449 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ------------------------------------------------------------------------------
2
+ # OpenCUA‑7B Model
3
+ #
4
+ # This implementation is adapted from the Qwen2‑VL reference code in
5
+ # Hugging Face Transformers v4.53.0:
6
+ # https://github.com/huggingface/transformers/tree/v4.53.0/src/transformers/models/qwen2_5_vl
7
+ #
8
+ # Checkpoint used for weight initialisation:
9
+ # "Qwen/Qwen2.5-VL-7B-Instruct" – https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
10
+ #
11
+ # Key modifications
12
+ # -----------------
13
+ # • Replaced Multimodal Rotary Position Embedding (M‑RoPE) with 1‑D RoPE for
14
+ # compatibility with OpenCUA training settings.
15
+ # • Wrapped vision encoder and language model into a single
16
+ # `OpenCUAForConditionalGeneration` class.
17
+ # • Simplified weight initialisation — this file targets inference / fine‑tuning,
18
+ # not training from scratch.
19
+ #
20
+ # Copyright (c) 2025 XLANG Lab, The University of Hong Kong
21
+ #
22
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
23
+ # of this software and associated documentation files (the “Software”), to deal
24
+ # in the Software without restriction, including without limitation the rights
25
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
26
+ # copies of the Software, and to permit persons to whom the Software is
27
+ # furnished to do so, subject to the following conditions:
28
+ #
29
+ # The above copyright notice and this permission notice shall be included in all
30
+ # copies or substantial portions of the Software.
31
+ #
32
+ # THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
33
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
34
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
35
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
36
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
37
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
38
+ # SOFTWARE.
39
+ #
40
+ # ------------------------------------------------------------------------------
41
+ # Prohibited Uses & Additional Disclaimer
42
+ # ---------------------------------------
43
+ # • The Software may **not** be used for any purpose or activity that violates
44
+ # applicable laws or regulations in any jurisdiction.
45
+ # • The authors, contributors, and copyright holders are **not responsible**
46
+ # for any illegal, unethical, or harmful use of the Software, nor for any
47
+ # direct or indirect damages resulting from such use.
48
+ # • Use of the “OpenCUA” name, logo, or trademarks does **not** imply any
49
+ # endorsement or affiliation unless a separate written permission is obtained.
50
+
51
+ import torch
52
+ import torch.nn as nn
53
+ from transformers.cache_utils import Cache
54
+ from transformers.modeling_utils import PreTrainedModel
55
+ from transformers.models.llava.modeling_llava import LlavaCausalLMOutputWithPast
56
+
57
+ from .configuration_opencua import OpenCUAConfig
58
+ from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VisionTransformerPretrainedModel
59
+ from transformers.models.qwen2.modeling_qwen2 import Qwen2ForCausalLM
60
+
61
+
62
+ class OpenCUAPreTrainedModel(PreTrainedModel):
63
+ config_class = OpenCUAConfig
64
+ base_model_prefix = "model"
65
+ _no_split_modules = ["Qwen2_5_VisionTransformerPretrainedModel"]
66
+ _skip_keys_device_placement = "past_key_values"
67
+ _supports_flash_attn_2 = True
68
+
69
+ def _init_weights(self, module):
70
+ # important: this ported version of Llava isn't meant for training from scratch - only
71
+ # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
72
+ # https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
73
+ std = (
74
+ self.config.initializer_range
75
+ if hasattr(self.config, "initializer_range")
76
+ else self.config.text_config.initializer_range
77
+ )
78
+
79
+ if hasattr(module, "class_embedding"):
80
+ module.class_embedding.data.normal_(mean=0.0, std=std)
81
+
82
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
83
+ module.weight.data.normal_(mean=0.0, std=std)
84
+ if module.bias is not None:
85
+ module.bias.data.zero_()
86
+ elif isinstance(module, nn.Embedding):
87
+ module.weight.data.normal_(mean=0.0, std=std)
88
+ if module.padding_idx is not None:
89
+ module.weight.data[module.padding_idx].zero_()
90
+
91
+ @property
92
+ def _supports_sdpa(self):
93
+ """
94
+ Retrieve language_model's attribute to check whether the model supports
95
+ SDPA or not.
96
+ """
97
+ return self.language_model._supports_sdpa
98
+
99
+
100
+ class OpenCUAForConditionalGeneration(OpenCUAPreTrainedModel):
101
+
102
+ def __init__(self, config: OpenCUAConfig):
103
+ super().__init__(config)
104
+ self.vision_tower = Qwen2_5_VisionTransformerPretrainedModel(config.vision_config)
105
+ self.language_model = Qwen2ForCausalLM(config.text_config)
106
+ self.post_init()
107
+
108
+ def get_input_embeddings(self):
109
+ return self.language_model.get_input_embeddings()
110
+
111
+ def set_input_embeddings(self, value):
112
+ self.language_model.set_input_embeddings(value)
113
+
114
+ def get_output_embeddings(self):
115
+ return self.language_model.get_output_embeddings()
116
+
117
+ def set_output_embeddings(self, new_embeddings):
118
+ self.language_model.set_output_embeddings(new_embeddings)
119
+
120
+ def set_decoder(self, decoder):
121
+ self.language_model.set_decoder(decoder)
122
+
123
+ def get_decoder(self):
124
+ return self.language_model.get_decoder()
125
+
126
+ def tie_weights(self):
127
+ return self.language_model.tie_weights()
128
+
129
+ def resize_token_embeddings(self, new_num_tokens: int | None = None, pad_to_multiple_of=None) -> nn.Embedding:
130
+ model_embeds = self.language_model.resize_token_embeddings(
131
+ new_num_tokens, pad_to_multiple_of)
132
+ # update vocab size
133
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
134
+ self.vocab_size = model_embeds.num_embeddings
135
+ return model_embeds
136
+
137
+ def _merge_input_ids_with_image_features(
138
+ self,
139
+ image_features: torch.Tensor,
140
+ feature_lengths: list[int],
141
+ inputs_embeds: torch.Tensor,
142
+ input_ids: torch.Tensor,
143
+ attention_mask: torch.Tensor,
144
+ labels: torch.Tensor | None = None):
145
+ """
146
+ Args:
147
+ image_features (:obj:`torch.Tensor` of shape :obj:`(num_image_tokens, embed_dim)`):
148
+ The image features to merge with the input embeddings.
149
+ feature_lengths: the length of image feature.
150
+ inputs_embeds (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length, embed_dim)`):
151
+ The input embeddings.
152
+ input_ids (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`):
153
+ The input ids.
154
+ attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`):
155
+ The attention mask.
156
+ labels (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, *optional*):
157
+ The labels.
158
+ """
159
+
160
+ image_token_index: int = self.config.media_placeholder_token_id
161
+ pad_token_id: int = self.config.pad_token_id
162
+ ignore_index: int = self.config.ignore_index
163
+
164
+ _, embed_dim = image_features.shape
165
+
166
+ batch_size, sequence_length = input_ids.shape
167
+ left_padding = not torch.sum(
168
+ input_ids[:, -1] == torch.tensor(pad_token_id))
169
+
170
+ # 1. Create a mask to know where special image tokens are
171
+ _token_occupation_table = torch.ones_like(input_ids.flatten())
172
+ _token_occupation_table[input_ids.flatten() == image_token_index] = \
173
+ torch.tensor(feature_lengths,
174
+ dtype=torch.long, device=input_ids.device)
175
+ _token_occupation_table = _token_occupation_table.reshape(
176
+ input_ids.shape)
177
+
178
+ max_embed_dim = _token_occupation_table.sum(-1).max().item()
179
+ assert max_embed_dim >= sequence_length, (
180
+ f"The maximum embedding dimension ({max_embed_dim}) is less than the sequence length ({sequence_length})"
181
+ )
182
+ batch_indices, non_image_indices = torch.where(input_ids != image_token_index)
183
+
184
+ # 2. Compute the positions where text should be written
185
+ # Calculate new positions for text tokens in merged image-text sequence.
186
+ new_token_positions = torch.cumsum(_token_occupation_table, -1) - 1
187
+ nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
188
+ if left_padding:
189
+ new_token_positions += nb_image_pad[:, None] # offset for left padding
190
+ text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
191
+
192
+ # 3. Create the full embedding, already padded to the maximum position
193
+ final_embedding = torch.zeros(
194
+ batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
195
+ )
196
+ final_attention_mask = torch.zeros(
197
+ batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
198
+ )
199
+ if labels is not None:
200
+ final_labels = torch.full(
201
+ (batch_size, max_embed_dim), ignore_index, dtype=input_ids.dtype, device=input_ids.device
202
+ )
203
+ # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
204
+ # set the corresponding tensors into their correct target device.
205
+ target_device = inputs_embeds.device
206
+ batch_indices, non_image_indices, text_to_overwrite = (
207
+ batch_indices.to(target_device),
208
+ non_image_indices.to(target_device),
209
+ text_to_overwrite.to(target_device),
210
+ )
211
+ attention_mask = attention_mask.to(target_device)
212
+
213
+ # 4. Fill the embeddings based on the mask.
214
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
215
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
216
+ if labels is not None:
217
+ final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
218
+
219
+ # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
220
+ image_to_overwrite = torch.full(
221
+ (batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
222
+ )
223
+ image_to_overwrite[batch_indices, text_to_overwrite] = False
224
+ image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
225
+
226
+ if image_to_overwrite.sum() != image_features.shape[:-1].numel():
227
+ raise ValueError(
228
+ f"The input provided to the model are wrong. The number of image tokens is {image_to_overwrite.sum()} while"
229
+ f" the number of image features given to the model is {image_features.shape[:-1].numel()}. "
230
+ "This prevents correct indexing and breaks batch generation."
231
+ )
232
+
233
+ final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
234
+ final_attention_mask |= image_to_overwrite
235
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
236
+
237
+ # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
238
+ batch_indices, pad_indices = torch.where(input_ids == pad_token_id)
239
+ indices_to_mask = new_token_positions[batch_indices, pad_indices]
240
+
241
+ final_embedding[batch_indices, indices_to_mask] = 0
242
+
243
+ if labels is None:
244
+ final_labels = None
245
+
246
+ return final_embedding, final_attention_mask, final_labels, position_ids
247
+
248
+ def _extract_image_features(self,
249
+ pixel_values: torch.FloatTensor | list[torch.FloatTensor],
250
+ grid_thws: torch.FloatTensor,
251
+ ):
252
+ """
253
+ Args:
254
+ pixel_values (:obj:`torch.FloatTensor` of shape :obj:`(sum_num_image_tokens, channels)`):
255
+ The pixel values of the images processed by image processor.
256
+ grid_thws: (B,3)
257
+
258
+ Returns:
259
+ selected_image_feature (:obj:`torch.FloatTensor` of shape :obj:`(num_image_tokens, embed_dim)`):
260
+ The selected image features to use as input to the projector head.
261
+
262
+ """
263
+
264
+ assert len(grid_thws.shape)==2 and grid_thws.shape[1]==3, f"grid_thws must be a 2D tensor with shape (batched, 3), but got {grid_thws.shape}"
265
+ if isinstance(pixel_values, list):
266
+ pixel_values = torch.cat(pixel_values, dim=0)
267
+ image_features_ = self.vision_tower(pixel_values, grid_thw=grid_thws)
268
+ image_features_list = []
269
+ start_idx = 0
270
+ for i, grid_thw in enumerate(grid_thws):
271
+ end_idx = start_idx + (grid_thw[0] * grid_thw[1] * grid_thw[2]) // 4
272
+ image_features_list.append(image_features_[start_idx:end_idx, :])
273
+ start_idx = end_idx
274
+
275
+ selected_image_feature = torch.cat(image_features_list, dim=0)
276
+ feature_lengths = [x.size(0) for x in image_features_list]
277
+ return selected_image_feature, feature_lengths
278
+
279
+ def forward(
280
+ self,
281
+ input_ids: torch.LongTensor | None = None,
282
+ pixel_values: torch.FloatTensor | list[torch.FloatTensor] | None = None,
283
+ grid_thws: torch.Tensor = None,
284
+ attention_mask: torch.Tensor | None = None,
285
+ position_ids: torch.LongTensor | None = None,
286
+ past_key_values: list[torch.FloatTensor] | None = None,
287
+ inputs_embeds: torch.FloatTensor | None = None,
288
+ labels: torch.LongTensor | None = None,
289
+ use_cache: bool | None = None,
290
+ output_attentions: bool | None = None,
291
+ output_hidden_states: bool | None = None,
292
+ return_dict: bool | None = None,
293
+ ) -> tuple | LlavaCausalLMOutputWithPast:
294
+ r"""
295
+ Args:
296
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
297
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
298
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
299
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
300
+
301
+ ```"""
302
+
303
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
304
+ output_hidden_states = (
305
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
306
+ )
307
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
308
+ if inputs_embeds is None:
309
+ # 1. Extra the input embeddings
310
+ inputs_embeds = self.get_input_embeddings()(input_ids)
311
+ # 2. Merge text and images
312
+ if pixel_values is not None and len(pixel_values) > 0 and input_ids.shape[1] != 1:
313
+ image_feature, feature_lengths = self._extract_image_features(
314
+ pixel_values, grid_thws)
315
+
316
+ inputs_embeds = inputs_embeds.to(image_feature.dtype) # num_tokens, embed_dim
317
+ inputs_embeds, attention_mask, labels, position_ids = \
318
+ self._merge_input_ids_with_image_features(image_feature, feature_lengths, inputs_embeds, input_ids, attention_mask, labels
319
+ )
320
+ # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
321
+ # generation with cache
322
+ elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
323
+ # Retrieve the first layer to inspect the logits and mask out the hidden states
324
+ # that are set to 0
325
+ first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
326
+
327
+ # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
328
+ batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
329
+
330
+ # Get the target length
331
+ target_length = input_ids.shape[1]
332
+ past_length = first_layer_past_key_value.shape[-1]
333
+
334
+ extended_attention_mask = torch.ones(
335
+ (attention_mask.shape[0], past_length),
336
+ dtype=attention_mask.dtype,
337
+ device=attention_mask.device,
338
+ )
339
+
340
+ # Filter out only the tokens that can be un-attended, this can happen
341
+ # if one uses Llava + Fused modules where the cache on the
342
+ # first iteration is already big enough, or if one passes custom cache
343
+ valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
344
+ new_batch_index = batch_index[valid_indices]
345
+ new_non_attended_tokens = non_attended_tokens[valid_indices]
346
+
347
+ # Zero-out the places where we don't need to attend
348
+ extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
349
+
350
+ attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
351
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
352
+
353
+ outputs = self.language_model(
354
+ attention_mask=attention_mask,
355
+ position_ids=position_ids,
356
+ past_key_values=past_key_values,
357
+ inputs_embeds=inputs_embeds,
358
+ use_cache=use_cache,
359
+ output_attentions=output_attentions,
360
+ output_hidden_states=output_hidden_states,
361
+ return_dict=return_dict,
362
+ )
363
+
364
+ logits = outputs[0]
365
+
366
+ loss = None
367
+ if labels is not None:
368
+ # Shift so that tokens < n predict n
369
+ if attention_mask is not None:
370
+ shift_attention_mask = attention_mask[..., 1:]
371
+ shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
372
+ shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
373
+ else:
374
+ shift_logits = logits[..., :-1, :].contiguous()
375
+ shift_labels = labels[..., 1:].contiguous()
376
+ # Flatten the tokens
377
+ loss_fct = nn.CrossEntropyLoss()
378
+ loss = loss_fct(
379
+ shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
380
+ )
381
+
382
+ if not return_dict:
383
+ output = (logits,) + outputs[1:]
384
+ return (loss,) + output if loss is not None else output
385
+
386
+ return LlavaCausalLMOutputWithPast(
387
+ loss=loss,
388
+ logits=logits,
389
+ past_key_values=outputs.past_key_values,
390
+ hidden_states=outputs.hidden_states,
391
+ attentions=outputs.attentions,
392
+ )
393
+
394
+ def prepare_inputs_for_generation(
395
+ self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, grid_thws=None, attention_mask=None, **kwargs
396
+ ):
397
+ if past_key_values is not None:
398
+ if isinstance(past_key_values, Cache):
399
+ cache_length = past_key_values.get_seq_length()
400
+ past_length = past_key_values.seen_tokens
401
+ else:
402
+ cache_length = past_length = past_key_values[0][0].shape[2]
403
+
404
+ # Keep only the unprocessed tokens:
405
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
406
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
407
+ # input)
408
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
409
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
410
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
411
+ # input_ids based on the past_length.
412
+ elif past_length < input_ids.shape[1]:
413
+ input_ids = input_ids[:, past_length:]
414
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
415
+ elif self.config.media_placeholder_token_id in input_ids:
416
+ input_ids = input_ids[:, input_ids.shape[1] - 1 :]
417
+ # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
418
+ # older attention values, as their corresponding values are not part of the input.
419
+ if cache_length < past_length and attention_mask is not None:
420
+ attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
421
+
422
+ position_ids = kwargs.get("position_ids", None)
423
+ if attention_mask is not None and position_ids is None:
424
+ # create position_ids on the fly for batch generation
425
+ position_ids = attention_mask.long().cumsum(-1) - 1
426
+ position_ids.masked_fill_(attention_mask == 0, 1)
427
+ if past_key_values:
428
+ position_ids = position_ids[:, -input_ids.shape[1] :]
429
+
430
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
431
+ if inputs_embeds is not None and past_key_values is None:
432
+ model_inputs = {"inputs_embeds": inputs_embeds}
433
+ else:
434
+ model_inputs = {"input_ids": input_ids}
435
+
436
+ model_inputs.update(
437
+ {
438
+ "position_ids": position_ids,
439
+ "past_key_values": past_key_values,
440
+ "use_cache": kwargs.get("use_cache"),
441
+ "attention_mask": attention_mask,
442
+ "pixel_values": pixel_values,
443
+ "grid_thws": grid_thws,
444
+ }
445
+ )
446
+ return model_inputs
447
+
448
+ def _reorder_cache(self, *args, **kwargs):
449
+ return self.language_model._reorder_cache(*args, **kwargs)
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor"
18
+ }
processing_opencua.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # processing_opencua.py
2
+ from transformers import Qwen2_5_VLProcessor, AutoTokenizer, AutoImageProcessor
3
+
4
+ class OpenCUAProcessor(Qwen2_5_VLProcessor):
5
+ # 用字符串就行,但我们会在 from_pretrained 里手动加载,避免字符串反射
6
+ tokenizer_class = "TikTokenV3"
7
+
8
+ @classmethod
9
+ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
10
+ # 确保 remote code 可用
11
+ trust_remote_code = kwargs.get("trust_remote_code", False)
12
+
13
+ # 1) 手动加载 tokenizer(会按模型目录里的 tokenizer_config.json -> TikTokenV3 + tokenization_opencua.py)
14
+ tokenizer = AutoTokenizer.from_pretrained(
15
+ pretrained_model_name_or_path,
16
+ trust_remote_code=trust_remote_code,
17
+ )
18
+
19
+ # 2) 手动加载图像处理器(保持 Qwen2VLImageProcessor)
20
+ image_processor = AutoImageProcessor.from_pretrained(
21
+ pretrained_model_name_or_path,
22
+ trust_remote_code=trust_remote_code,
23
+ )
24
+
25
+ # 3) 构造并返回 Qwen2.5-VL 的 Processor 实例
26
+ return cls(image_processor=image_processor, tokenizer=tokenizer)
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "processor_class": "processing_opencua.OpenCUAProcessor",
3
+ "image_processor_type": "Qwen2VLImageProcessor"
4
+ }
tiktoken.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2b1b8dfb5cc5f024bafc373121c6aba3f66f9a5a0269e243470a1de16a33186
3
+ size 2561218
tokenization_opencua.py ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tiktoken
3
+
4
+ from logging import getLogger
5
+ from pathlib import Path
6
+ from typing import (
7
+ cast,
8
+ Tuple,
9
+ Dict,
10
+ Iterator,
11
+ List,
12
+ Union,
13
+ Optional,
14
+ )
15
+ from shutil import copyfile
16
+ from tiktoken.load import load_tiktoken_bpe
17
+ from tokenizers import AddedToken
18
+ from transformers.tokenization_utils import PreTrainedTokenizer
19
+ from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
20
+
21
+ # 导入Qwen2Tokenizer用于继承
22
+ try:
23
+ from transformers.models.qwen2.tokenization_qwen2 import Qwen2Tokenizer
24
+ QWEN2_AVAILABLE = True
25
+ except ImportError:
26
+ QWEN2_AVAILABLE = False
27
+ Qwen2Tokenizer = PreTrainedTokenizer
28
+
29
+
30
+ logger = getLogger(__name__)
31
+ VOCAB_FILES_NAMES = {"vocab_file": "tiktoken.model"}
32
+
33
+ class TikTokenTokenizer(PreTrainedTokenizer):
34
+ """
35
+ Tokenizing and encoding/decoding text using the Tiktoken tokenizer. See megatron/tokenizer/tiktoken_tokenizer.py.
36
+
37
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
38
+ this superclass for more information regarding those methods.
39
+
40
+ Args:
41
+ vocab_file (`str`):
42
+ The path to the Tiktoken model file.
43
+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|begin_of_text|>",`):
44
+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
45
+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|end_of_text|>"`):
46
+ The end of sequence token.
47
+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_249|>"`):
48
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
49
+ token instead. The second to last item in special_tokens.
50
+ pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_250|>"`):
51
+ The token used for padding, for example when batching sequences of different lengths.
52
+ additional_special_tokens (list of `str`, *optional*):
53
+ A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
54
+ skipped when decoding if `skip_special_tokens` is set to `True`.
55
+ """
56
+
57
+ vocab_files_names = VOCAB_FILES_NAMES
58
+
59
+ model_input_names = ["input_ids", "attention_mask"]
60
+
61
+ special_tokens: Dict[str, int]
62
+
63
+ num_reserved_special_tokens = 256
64
+
65
+ pat_str = "|".join(
66
+ [
67
+ r"""[\p{Han}]+""",
68
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
69
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
70
+ r"""\p{N}{1,3}""",
71
+ r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
72
+ r"""\s*[\r\n]+""",
73
+ r"""\s+(?!\S)""",
74
+ r"""\s+""",
75
+ ]
76
+ )
77
+
78
+ def __init__(
79
+ self,
80
+ vocab_file,
81
+ bos_token: Union[str, AddedToken]="[BOS]",
82
+ eos_token: Union[str, AddedToken]="[EOS]",
83
+ unk_token: Union[str, AddedToken, None]=None,
84
+ pad_token: Union[str, AddedToken, None]=None,
85
+ additional_special_tokens: List[str]=None,
86
+ added_tokens_decoder: Optional[dict] = None,
87
+ **kwargs,
88
+ ):
89
+ assert os.path.isfile(vocab_file), vocab_file
90
+
91
+ if additional_special_tokens is None:
92
+ # dumping mode
93
+ used_special_tokens = [
94
+ "<|im_end|>",
95
+ "<|im_user|>",
96
+ "<|im_assistant|>",
97
+ "<|reserved_token_0|>",
98
+ "<|start_header_id|>",
99
+ "<|end_header_id|>",
100
+ "<|reserved_token_1|>",
101
+ "[EOT]",
102
+ "<|im_system|>",
103
+ "<|reserved_token_2|>",
104
+ "<|reserved_token_3|>",
105
+ "<|reserved_token_4|>",
106
+ "<|reserved_token_5|>",
107
+ "<|reserved_token_6|>",
108
+ "<|reserved_token_7|>",
109
+ "<|im_middle|>",
110
+ "<|media_begin|>",
111
+ "<|media_content|>",
112
+ "<|media_end|>",
113
+ "<|media_placeholder|>",
114
+ # 添加标准Qwen2.5-VL需要的token
115
+ "<|vision_start|>",
116
+ "<|vision_end|>",
117
+ "<|image_pad|>",
118
+ "<|video_pad|>",
119
+ ]
120
+ used_reserved_tokens = 12 # 原来8个 + 新增4个vision相关token
121
+ last_reserved_token_id = self.num_reserved_special_tokens - 4 - len(used_special_tokens) + used_reserved_tokens - 1
122
+ additional_special_tokens = used_special_tokens + [
123
+ f"<|reserved_token_{i}|>"
124
+ for i in range(used_reserved_tokens, last_reserved_token_id + 1)
125
+ ]
126
+ # num_reserved_special_tokens = additional_special_tokens + BOS + EOS + unk_token + pad_token
127
+ assert len(additional_special_tokens) + 4 == self.num_reserved_special_tokens, f"additional_special_tokens num: {len(additional_special_tokens)} is not correct"
128
+ # we assume that the instance is under initialization and unk_token and pad_token should be automatically inferred
129
+ if unk_token is not None:
130
+ raise ValueError("unk_token should not be set in dumping mode when additional_special_tokens is None")
131
+ if pad_token is not None:
132
+ raise ValueError("pad_token should not be set in dumping mode when additional_special_tokens is None")
133
+ # last two reserved tokens
134
+ unk_token = f"[UNK]"
135
+ pad_token = f"[PAD]"
136
+
137
+ logger.info(f"adding unk_token: {unk_token} and pad_token: {pad_token}")
138
+ self.additional_special_tokens = additional_special_tokens
139
+ special_tokens = [str(bos_token), str(eos_token)] + additional_special_tokens + [str(unk_token), str(pad_token)]
140
+
141
+ self.vocab_file = vocab_file
142
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
143
+ num_base_tokens = len(mergeable_ranks)
144
+ self.special_tokens = {
145
+ token: num_base_tokens + i for i, token in enumerate(special_tokens)
146
+ }
147
+ else:
148
+ self.additional_special_tokens = additional_special_tokens
149
+ special_tokens_mapping = {
150
+ i: added_tokens_decoder[i].content for i in added_tokens_decoder
151
+ }
152
+
153
+ self.vocab_file = vocab_file
154
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
155
+ num_base_tokens = len(mergeable_ranks)
156
+ self.special_tokens = {
157
+ special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
158
+ for i in range(
159
+ num_base_tokens, num_base_tokens + self.num_reserved_special_tokens + 2
160
+ )
161
+ }
162
+
163
+
164
+
165
+ self.model = tiktoken.Encoding(
166
+ name=Path(vocab_file).name,
167
+ pat_str=self.pat_str,
168
+ mergeable_ranks=mergeable_ranks,
169
+ special_tokens=self.special_tokens,
170
+ )
171
+ logger.info(f"Reloaded tiktoken model from {vocab_file}")
172
+
173
+ self.n_words: int = self.model.n_vocab
174
+ # BOS / EOS token IDs
175
+ self.bos_id: int = self.special_tokens[str(bos_token)]
176
+ self.eos_id: int = self.special_tokens[str(eos_token)]
177
+
178
+ logger.info(
179
+ f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
180
+ )
181
+
182
+ self.pad_id: int = self.special_tokens[str(pad_token)]
183
+ self.unk_id: int = self.special_tokens[str(unk_token)]
184
+ self.byte_encoder = bytes_to_unicode()
185
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
186
+
187
+ self.decoder = {}
188
+ for i in range(self.n_words):
189
+ # Taken from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
190
+ decoding = ''.join([
191
+ self.byte_encoder[ord(char)] for char in
192
+ self.model.decode_single_token_bytes(i).decode('latin-1')
193
+ ])
194
+ self.decoder[i] = decoding
195
+
196
+ self.encoder = {}
197
+ for i in range(self.n_words):
198
+ if i in self.decoder:
199
+ self.encoder[self.decoder[i]] = i
200
+
201
+ super().__init__(
202
+ bos_token=bos_token,
203
+ eos_token=eos_token,
204
+ unk_token=unk_token,
205
+ pad_token=pad_token,
206
+ additional_special_tokens=self.additional_special_tokens,
207
+ **kwargs,
208
+ )
209
+ self.all_special_ids_set = set(self.all_special_ids)
210
+
211
+ def encode(
212
+ self,
213
+ text: str,
214
+ allow_special_tokens = True,
215
+ **kwargs
216
+ ) -> List[int]:
217
+ """
218
+ Encodes a string into a list of token IDs.
219
+
220
+ Args:
221
+ text (str): The input string to be encoded.
222
+
223
+ Returns:
224
+ list[int]: A list of token IDs.
225
+ """
226
+ # If there are other args, we should call super().encode because there are a lot of code
227
+ # to handle those args. supper().encode finally will call _tokenize and _convert_token_to_id.
228
+ # NOTE: our encode method is not compatible with the super().encode method,
229
+ # e.g. split_special_tokens' default is True in our encode method.
230
+ if len(kwargs) > 0:
231
+ logger.warning( f"Calling super().encode with {kwargs}" )
232
+ return super().encode(text, **kwargs)
233
+
234
+ assert type(text) is str
235
+
236
+ # The tiktoken tokenizer can handle <=400k chars without
237
+ # pyo3_runtime.PanicException.
238
+ TIKTOKEN_MAX_ENCODE_CHARS = 400_000
239
+
240
+ # https://github.com/openai/tiktoken/issues/195
241
+ # Here we iterate over subsequences and split if we exceed the limit
242
+ # of max consecutive non-whitespace or whitespace characters.
243
+ MAX_NO_WHITESPACES_CHARS = 25_000
244
+
245
+ texts = self.pre_tokenizer_process(text)
246
+
247
+ all_substrs = []
248
+ for text in texts:
249
+ substrs = (
250
+ substr
251
+ for i in range(0, len(text), TIKTOKEN_MAX_ENCODE_CHARS)
252
+ for substr in self._split_whitespaces_or_nonwhitespaces(
253
+ text[i: i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
254
+ )
255
+ )
256
+ all_substrs.extend(substrs)
257
+
258
+ t: List[int] = []
259
+ for substr in all_substrs:
260
+ if allow_special_tokens:
261
+ t.extend(
262
+ self.model.encode(
263
+ substr,
264
+ allowed_special="all",
265
+ )
266
+ )
267
+ else:
268
+ t.extend(
269
+ self.model.encode(
270
+ substr,
271
+ disallowed_special=(),
272
+ )
273
+ )
274
+
275
+ return t
276
+
277
+ def decode(
278
+ self,
279
+ token_ids: Union[int, List[int]],
280
+ **kwargs
281
+ ) -> str:
282
+ """
283
+ Decodes a list of token IDs into a string.
284
+
285
+ Args:
286
+ token_ids (List[int]): The list of token IDs to be decoded.
287
+
288
+ Returns:
289
+ str: The decoded string.
290
+ """
291
+ # If there are other args, we should call super().decode because there are a lot of code
292
+ # to handle those args. supper().encode finally will call convert_tokens_to_string and _convert_id_to_token.
293
+ if len(kwargs) > 0:
294
+ return super().decode(token_ids, **kwargs)
295
+
296
+ if type(token_ids) is int:
297
+ token_ids = [token_ids]
298
+
299
+ return self.model.decode(cast(List[int], token_ids))
300
+
301
+ @staticmethod
302
+ def _split_whitespaces_or_nonwhitespaces(
303
+ s: str, max_consecutive_slice_len: int
304
+ ) -> Iterator[str]:
305
+ """
306
+ Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
307
+ consecutive whitespaces or consecutive non-whitespaces.
308
+ """
309
+ current_slice_len = 0
310
+ current_slice_is_space = s[0].isspace() if len(s) > 0 else False
311
+ slice_start = 0
312
+
313
+ for i in range(len(s)):
314
+ is_now_space = s[i].isspace()
315
+
316
+ if current_slice_is_space ^ is_now_space:
317
+ current_slice_len = 1
318
+ current_slice_is_space = is_now_space
319
+ else:
320
+ current_slice_len += 1
321
+ if current_slice_len > max_consecutive_slice_len:
322
+ yield s[slice_start:i]
323
+ slice_start = i
324
+ current_slice_len = 1
325
+ yield s[slice_start:]
326
+
327
+ def pre_tokenizer_process(self, text: str) -> List[str]:
328
+ """
329
+ pre-tokenizes the input text into a list of tokens.
330
+ This method is used to split the input text into smaller chunks for internal processing.
331
+ """
332
+ return [text]
333
+
334
+
335
+ """ ----- Below are the abstract methods required by PreTrainedTokenizer ----- """
336
+ @property
337
+ def vocab_size(self) -> int:
338
+ return self.n_words
339
+
340
+ def get_vocab(self) -> Dict[str, int]:
341
+ return self.encoder
342
+
343
+ def _tokenize(self, text: str, **kwargs) -> List[str]:
344
+ return [
345
+ self.decoder[t]
346
+ for t in self.encode(text)
347
+ ]
348
+
349
+ def _convert_token_to_id(self, token: str) -> int:
350
+ return self.encoder.get(token, self.unk_id)
351
+
352
+ def _convert_id_to_token(self, index: int) -> str:
353
+ return self.decoder.get(index)
354
+
355
+ @staticmethod
356
+ def clean_up_tokenization(out_string: str) -> str:
357
+ return out_string
358
+
359
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
360
+ text = ''.join(tokens)
361
+ text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', 'replace')
362
+ return text
363
+
364
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
365
+ if not os.path.isdir(save_directory):
366
+ raise ValueError(f"vocabulary path ({save_directory}) should be a directory")
367
+ out_vocab_file = os.path.join(
368
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
369
+ )
370
+
371
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
372
+ copyfile(self.vocab_file, out_vocab_file)
373
+
374
+ return (out_vocab_file,)
375
+
376
+
377
+ class TikTokenV3(TikTokenTokenizer):
378
+ num_reserved_special_tokens = 293 + 128
379
+ pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
tokenizer_config.json ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "151643": {
4
+ "content": "[BOS]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "151644": {
12
+ "content": "[EOS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "151645": {
20
+ "content": "<|im_end|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "151646": {
28
+ "content": "<|im_user|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "151647": {
36
+ "content": "<|im_assistant|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "151648": {
44
+ "content": "<|reserved_token_0|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "151649": {
52
+ "content": "<|start_header_id|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "151650": {
60
+ "content": "<|end_header_id|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "151651": {
68
+ "content": "<|reserved_token_1|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "151652": {
76
+ "content": "[EOT]",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "151653": {
84
+ "content": "<|im_system|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "151654": {
92
+ "content": "<|reserved_token_2|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "151655": {
100
+ "content": "<|reserved_token_3|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "151656": {
108
+ "content": "<|reserved_token_4|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "151657": {
116
+ "content": "<|reserved_token_5|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "151658": {
124
+ "content": "<|reserved_token_6|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "151659": {
132
+ "content": "<|reserved_token_7|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "151660": {
140
+ "content": "<|im_middle|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "151661": {
148
+ "content": "<|media_begin|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "151662": {
156
+ "content": "<|media_content|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "151663": {
164
+ "content": "<|media_end|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "151664": {
172
+ "content": "<|media_placeholder|>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "151665": {
180
+ "content": "<|vision_start|>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "151666": {
188
+ "content": "<|vision_end|>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "151667": {
196
+ "content": "<|image_pad|>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "151668": {
204
+ "content": "<|video_pad|>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "152062": {
212
+ "content": "[UNK]",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "152063": {
220
+ "content": "[PAD]",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ }
227
+
228
+ },
229
+ "additional_special_tokens": [
230
+ "<|im_end|>",
231
+ "<|im_user|>",
232
+ "<|im_assistant|>",
233
+ "<|reserved_token_0|>",
234
+ "<|start_header_id|>",
235
+ "<|end_header_id|>",
236
+ "<|reserved_token_1|>",
237
+ "[EOT]",
238
+ "<|im_system|>",
239
+ "<|reserved_token_2|>",
240
+ "<|reserved_token_3|>",
241
+ "<|reserved_token_4|>",
242
+ "<|reserved_token_5|>",
243
+ "<|reserved_token_6|>",
244
+ "<|reserved_token_7|>",
245
+ "<|im_middle|>",
246
+ "<|media_begin|>",
247
+ "<|media_content|>",
248
+ "<|media_end|>",
249
+ "<|media_placeholder|>",
250
+ "<|vision_start|>",
251
+ "<|vision_end|>",
252
+ "<|image_pad|>",
253
+ "<|video_pad|>"
254
+ ],
255
+ "bos_token": "[BOS]",
256
+ "clean_up_tokenization_spaces": false,
257
+ "eos_token": "[EOS]",
258
+ "extra_special_tokens": {},
259
+ "chat_template": "{%- for message in messages -%}{%- if loop.first and messages[0]['role'] != 'system' -%}{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}{%- endif -%}{%- if message['role'] == 'system' -%}{{'<|im_system|>'}}{%- endif -%}{%- if message['role'] == 'user' -%}{{'<|im_user|>'}}{%- endif -%}{%- if message['role'] == 'assistant' -%}{{'<|im_assistant|>'}}{%- endif -%}{{- message['role'] -}}{{'<|im_middle|>'}}{%- if message['content'] is string -%}{{- message['content'] + '<|im_end|>' -}}{%- else -%}{%- for content in message['content'] -%}{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}{{'<|media_begin|>image<|media_content|><|media_placeholder|><|media_end|>'}}{%- else -%}{{content['text']}}{%- endif -%}{%- endfor -%}{{'<|im_end|>'}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{'<|im_assistant|>assistant<|im_middle|>'}}{%- endif -%}",
260
+ "model_max_length": 1000000000000000019884624838656,
261
+ "pad_token": "[PAD]",
262
+ "tokenizer_class": "TikTokenV3",
263
+ "unk_token": "[UNK]",
264
+ "auto_map": {
265
+ "AutoTokenizer": [
266
+ "tokenization_opencua.TikTokenV3",
267
+ null
268
+ ]
269
+ }
270
+ }