yarenty commited on
Commit
c653139
·
verified ·
1 Parent(s): a301c0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +296 -3
README.md CHANGED
@@ -1,3 +1,296 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - yarenty/datafusion_QA
5
+ base_model:
6
+ - Qwen/Qwen2.5-3B-Instruct
7
+ tags:
8
+ - rust
9
+ - datafusion
10
+ - qwen
11
+ ---
12
+ # Qwen2.5-3B-DataFusion-Instruct: Original Trained Model
13
+
14
+ ## Model Overview
15
+
16
+ **Model Name:** Qwen2.5-3B-DataFusion-Instruct
17
+ **Model Type:** Fine-tuned Large Language Model
18
+ **Base Model:** Qwen2.5-3B
19
+ **Specialization:** DataFusion SQL Engine and Rust Programming
20
+ **Format:** Hugging Face Transformers (SafeTensors)
21
+ **License:** Apache 2.0
22
+ **Total Size:** ~11.5GB (distributed across 3 shards)
23
+
24
+ ## Model Description
25
+
26
+ This is the original trained version of the Qwen2.5-3B-DataFusion-Instruct model, containing the complete fine-tuned weights and configuration files. This model has been specifically trained on comprehensive DataFusion ecosystem data to excel at Rust programming, DataFusion SQL queries, and data processing tasks.
27
+
28
+ ## Model Architecture
29
+
30
+ ### Base Architecture
31
+ - **Model Type:** Qwen2ForCausalLM
32
+ - **Architecture:** Transformer-based causal language model
33
+ - **Hidden Size:** 2,048 dimensions
34
+ - **Intermediate Size:** 11,008 dimensions
35
+ - **Number of Layers:** 36 transformer layers
36
+ - **Attention Heads:** 16 attention heads
37
+ - **Key-Value Heads:** 2 key-value heads (Grouped Query Attention)
38
+ - **Max Position Embeddings:** 32,768 tokens
39
+ - **Vocabulary Size:** 151,936 tokens
40
+
41
+ ### Training Configuration
42
+ - **Attention Dropout:** 0.0 (no dropout during inference)
43
+ - **RMS Norm Epsilon:** 1e-06
44
+ - **Initializer Range:** 0.02
45
+ - **Hidden Activation:** SiLU (Swish)
46
+ - **Layer Types:** Full attention across all 36 layers
47
+ - **Sliding Window:** Disabled (full attention context)
48
+ - **RoPE Scaling:** None (standard rotary position encoding)
49
+ - **RoPE Theta:** 1,000,000
50
+
51
+ ## Model Files
52
+
53
+ ### Core Model Weights
54
+ - **model-00001-of-00003.safetensors** (4.6GB) - First shard
55
+ - **model-00002-of-00003.safetensors** (4.6GB) - Second shard
56
+ - **model-00003-of-00003.safetensors** (2.3GB) - Third shard
57
+ - **model.safetensors.index.json** (35KB) - Shard index file
58
+
59
+ ### Tokenizer and Vocabulary
60
+ - **tokenizer.json** (11MB) - Main tokenizer configuration
61
+ - **vocab.json** (2.6MB) - Vocabulary mapping
62
+ - **merges.txt** (1.6MB) - Byte-pair encoding merges
63
+ - **tokenizer_config.json** (4.6KB) - Tokenizer settings
64
+ - **special_tokens_map.json** (613B) - Special token definitions
65
+ - **added_tokens.json** (605B) - Additional tokens added during training
66
+
67
+ ### Configuration Files
68
+ - **config.json** (1.5KB) - Model architecture configuration
69
+ - **generation_config.json** (243B) - Generation parameters
70
+ - **chat_template.jinja** (2.4KB) - Chat conversation template
71
+ - **training_args.bin** (5.7KB) - Training arguments and metadata
72
+
73
+ ## Training Data
74
+
75
+ ### Dataset Composition
76
+ - **Total QA Pairs:** 265,180
77
+ - **Source Projects:** 36 different repositories
78
+ - **Content Types:** Code implementation, documentation, usage examples
79
+ - **Coverage:** Comprehensive DataFusion ecosystem
80
+
81
+ ### Training Projects Covered
82
+ - **Core DataFusion:** datafusion, datafusion-ballista, datafusion-federation
83
+ - **DataFusion Extensions:** datafusion-functions-json, datafusion-postgres, datafusion-python
84
+ - **Arrow Ecosystem:** arrow-rs, arrow-zarr
85
+ - **Related Tools:** blaze, exon, feldera, greptimedb, horaedb, influxdb
86
+ - **Modern Data Stack:** iceberg-rust, LakeSoul, lance, openobserve, parseable
87
+
88
+ ### Data Quality Features
89
+ - Structured JSONL format with source attribution
90
+ - Code examples with best practices and common pitfalls
91
+ - Error handling guidance and troubleshooting solutions
92
+ - Performance optimization tips and best practices
93
+
94
+ ## Model Capabilities
95
+
96
+ ### Primary Strengths
97
+ 1. **Rust Programming Expertise**
98
+ - Idiomatic Rust code generation
99
+ - DataFusion API usage patterns
100
+ - Error handling and testing best practices
101
+ - Performance optimization techniques
102
+
103
+ 2. **DataFusion SQL Mastery**
104
+ - Complex SQL query construction
105
+ - Table provider implementations
106
+ - UDF (User-Defined Function) development
107
+ - Query optimization and execution planning
108
+
109
+ 3. **Data Processing Knowledge**
110
+ - Arrow format operations
111
+ - Parquet file handling
112
+ - Data transformation pipelines
113
+ - Streaming and batch processing
114
+
115
+ 4. **System Architecture Understanding**
116
+ - Distributed query execution
117
+ - Federation and integration patterns
118
+ - Observability and tracing
119
+ - Performance monitoring
120
+
121
+ ### Technical Domains
122
+ - **SQL Engine Internals:** Query planning, optimization, execution
123
+ - **Data Formats:** Arrow, Parquet, JSON, CSV, Avro
124
+ - **Storage Systems:** Object storage, databases, file systems
125
+ - **Distributed Computing:** Ray, Ballista, cluster management
126
+ - **Streaming:** Real-time data processing, windowing, aggregations
127
+
128
+ ## Usage Instructions
129
+
130
+ ### Direct Usage with Transformers
131
+ ```python
132
+ from transformers import AutoModelForCausalLM, AutoTokenizer
133
+
134
+ # Load model and tokenizer
135
+ model = AutoModelForCausalLM.from_pretrained(
136
+ "path/to/qwen2.5-3B-datafusion-instruct",
137
+ device_map="auto",
138
+ torch_dtype="auto"
139
+ )
140
+ tokenizer = AutoTokenizer.from_pretrained(
141
+ "path/to/qwen2.5-3B-datafusion-instruct"
142
+ )
143
+
144
+ # Generate response
145
+ prompt = "How do I create a custom UDF in DataFusion?"
146
+ inputs = tokenizer(prompt, return_tensors="pt")
147
+ outputs = model.generate(
148
+ **inputs,
149
+ max_new_tokens=512,
150
+ temperature=0.7,
151
+ top_p=0.8,
152
+ repetition_penalty=1.05
153
+ )
154
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
155
+ print(response)
156
+ ```
157
+
158
+ ### Chat Template Usage
159
+ ```python
160
+ from transformers import AutoTokenizer
161
+
162
+ tokenizer = AutoTokenizer.from_pretrained(
163
+ "path/to/qwen2.5-3B-datafusion-instruct"
164
+ )
165
+
166
+ # Prepare chat messages
167
+ messages = [
168
+ {"role": "system", "content": "You are a DataFusion expert."},
169
+ {"role": "user", "content": "How do I optimize a SQL query?"}
170
+ ]
171
+
172
+ # Apply chat template
173
+ prompt = tokenizer.apply_chat_template(
174
+ messages,
175
+ tokenize=False,
176
+ add_generation_prompt=True
177
+ )
178
+ print(prompt)
179
+ ```
180
+
181
+ ## Generation Parameters
182
+
183
+ ### Default Configuration
184
+ - **Temperature:** 0.7 (balanced creativity vs consistency)
185
+ - **Top-p:** 0.8 (nucleus sampling)
186
+ - **Top-k:** 20 (top-k sampling)
187
+ - **Repetition Penalty:** 1.05 (prevents repetitive output)
188
+ - **Do Sample:** True (enables sampling-based generation)
189
+
190
+ ### Recommended Settings
191
+ - **For Code Generation:** temperature=0.3, top_p=0.9
192
+ - **For Explanations:** temperature=0.7, top_p=0.8
193
+ - **For Debugging:** temperature=0.1, top_p=0.95
194
+ - **For Learning:** temperature=0.5, top_p=0.85
195
+
196
+ ## Performance Characteristics
197
+
198
+ ### Model Size and Memory
199
+ - **Total Parameters:** ~3 billion parameters
200
+ - **Model Size:** 11.5GB (distributed across 3 shards)
201
+ - **Memory Usage:** ~16-24GB RAM during inference
202
+ - **GPU Memory:** 12-16GB VRAM (depending on precision)
203
+
204
+ ### Inference Performance
205
+ - **Context Length:** Up to 32,768 tokens
206
+ - **Generation Speed:** ~10-50 tokens/second (depending on hardware)
207
+ - **Memory Efficiency:** Optimized for large context windows
208
+ - **Batch Processing:** Supports batched inference
209
+
210
+
211
+ ## Installation and Setup
212
+
213
+ ### Requirements
214
+ ```bash
215
+ # Python dependencies
216
+ pip install torch transformers accelerate safetensors
217
+
218
+ # For GPU acceleration
219
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
220
+ ```
221
+
222
+ ### Model Loading
223
+ ```python
224
+ # Basic loading
225
+ model = AutoModelForCausalLM.from_pretrained("path/to/model")
226
+
227
+ # With device mapping
228
+ model = AutoModelForCausalLM.from_pretrained(
229
+ "path/to/model",
230
+ device_map="auto",
231
+ torch_dtype="auto"
232
+ )
233
+
234
+ # With quantization (for memory efficiency)
235
+ model = AutoModelForCausalLM.from_pretrained(
236
+ "path/to/model",
237
+ device_map="auto",
238
+ load_in_8bit=True # or load_in_4bit=True
239
+ )
240
+ ```
241
+
242
+ ## Comparison with GGUF Versions
243
+
244
+ | Aspect | Original Model | GGUF Main | GGUF Quantized |
245
+ |--------|----------------|------------|-----------------|
246
+ | **Format** | SafeTensors | GGUF | GGUF (Quantized) |
247
+ | **Size** | 11.5GB | 5.8GB | 1.8GB |
248
+ | **Memory Usage** | Highest | High | Lower |
249
+ | **Accuracy** | Highest | High | High |
250
+ | **Flexibility** | Maximum | High | Standard |
251
+ | **Deployment** | Development/Research | Production | Production |
252
+ | **Hardware Requirements** | High | Medium | Low |
253
+
254
+ ## Limitations and Considerations
255
+
256
+ ### Technical Limitations
257
+ - **Context Window:** Limited to 32,768 tokens
258
+ - **Real-time Updates:** May not reflect latest API changes
259
+ - **Complex Queries:** Very complex scenarios may require human review
260
+ - **Edge Cases:** Unusual configurations may need manual intervention
261
+
262
+ ### Best Practices
263
+ - **Verify Output:** Always review generated code before deployment
264
+ - **Test Thoroughly:** Validate generated queries and functions
265
+ - **Stay Updated:** Check for newer model versions
266
+ - **Human Oversight:** Use as assistant, not replacement for expertise
267
+
268
+
269
+ ## Resources
270
+ - **DataFusion Documentation:** https://docs.datafusion.org/
271
+ - **Apache Arrow:** https://arrow.apache.org/
272
+ - **Rust Programming Language:** https://www.rust-lang.org/
273
+ - **Training Dataset:** https://huggingface.co/datasets/yarenty/datafusion_QA
274
+ - **Hugging Face Model:** Available for download and use
275
+
276
+ ## Citation
277
+
278
+ When using this model in research or publications, please cite:
279
+
280
+ ```bibtex
281
+ @software{qwen2.5_3b_datafusion_instruct,
282
+ title={Qwen2.5-3B-DataFusion-Instruct: A Specialized Model for DataFusion Ecosystem},
283
+ author={Fine-tuned on DataFusion Ecosystem QA Dataset},
284
+ year={2025},
285
+ url={https://github.com/yarenty/trainer},
286
+ license={Apache-2.0}
287
+ }
288
+ ```
289
+
290
+ ## License
291
+
292
+ This model is licensed under the Apache 2.0 License. See the LICENSE file for full details.
293
+
294
+ ---
295
+
296
+ *This original trained model represents the foundation of specialized AI assistance for the DataFusion ecosystem, providing the highest quality outputs for development, research, and production use cases. It serves as the source for creating optimized GGUF versions for various deployment scenarios.*