Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,296 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- yarenty/datafusion_QA
|
5 |
+
base_model:
|
6 |
+
- Qwen/Qwen2.5-3B-Instruct
|
7 |
+
tags:
|
8 |
+
- rust
|
9 |
+
- datafusion
|
10 |
+
- qwen
|
11 |
+
---
|
12 |
+
# Qwen2.5-3B-DataFusion-Instruct: Original Trained Model
|
13 |
+
|
14 |
+
## Model Overview
|
15 |
+
|
16 |
+
**Model Name:** Qwen2.5-3B-DataFusion-Instruct
|
17 |
+
**Model Type:** Fine-tuned Large Language Model
|
18 |
+
**Base Model:** Qwen2.5-3B
|
19 |
+
**Specialization:** DataFusion SQL Engine and Rust Programming
|
20 |
+
**Format:** Hugging Face Transformers (SafeTensors)
|
21 |
+
**License:** Apache 2.0
|
22 |
+
**Total Size:** ~11.5GB (distributed across 3 shards)
|
23 |
+
|
24 |
+
## Model Description
|
25 |
+
|
26 |
+
This is the original trained version of the Qwen2.5-3B-DataFusion-Instruct model, containing the complete fine-tuned weights and configuration files. This model has been specifically trained on comprehensive DataFusion ecosystem data to excel at Rust programming, DataFusion SQL queries, and data processing tasks.
|
27 |
+
|
28 |
+
## Model Architecture
|
29 |
+
|
30 |
+
### Base Architecture
|
31 |
+
- **Model Type:** Qwen2ForCausalLM
|
32 |
+
- **Architecture:** Transformer-based causal language model
|
33 |
+
- **Hidden Size:** 2,048 dimensions
|
34 |
+
- **Intermediate Size:** 11,008 dimensions
|
35 |
+
- **Number of Layers:** 36 transformer layers
|
36 |
+
- **Attention Heads:** 16 attention heads
|
37 |
+
- **Key-Value Heads:** 2 key-value heads (Grouped Query Attention)
|
38 |
+
- **Max Position Embeddings:** 32,768 tokens
|
39 |
+
- **Vocabulary Size:** 151,936 tokens
|
40 |
+
|
41 |
+
### Training Configuration
|
42 |
+
- **Attention Dropout:** 0.0 (no dropout during inference)
|
43 |
+
- **RMS Norm Epsilon:** 1e-06
|
44 |
+
- **Initializer Range:** 0.02
|
45 |
+
- **Hidden Activation:** SiLU (Swish)
|
46 |
+
- **Layer Types:** Full attention across all 36 layers
|
47 |
+
- **Sliding Window:** Disabled (full attention context)
|
48 |
+
- **RoPE Scaling:** None (standard rotary position encoding)
|
49 |
+
- **RoPE Theta:** 1,000,000
|
50 |
+
|
51 |
+
## Model Files
|
52 |
+
|
53 |
+
### Core Model Weights
|
54 |
+
- **model-00001-of-00003.safetensors** (4.6GB) - First shard
|
55 |
+
- **model-00002-of-00003.safetensors** (4.6GB) - Second shard
|
56 |
+
- **model-00003-of-00003.safetensors** (2.3GB) - Third shard
|
57 |
+
- **model.safetensors.index.json** (35KB) - Shard index file
|
58 |
+
|
59 |
+
### Tokenizer and Vocabulary
|
60 |
+
- **tokenizer.json** (11MB) - Main tokenizer configuration
|
61 |
+
- **vocab.json** (2.6MB) - Vocabulary mapping
|
62 |
+
- **merges.txt** (1.6MB) - Byte-pair encoding merges
|
63 |
+
- **tokenizer_config.json** (4.6KB) - Tokenizer settings
|
64 |
+
- **special_tokens_map.json** (613B) - Special token definitions
|
65 |
+
- **added_tokens.json** (605B) - Additional tokens added during training
|
66 |
+
|
67 |
+
### Configuration Files
|
68 |
+
- **config.json** (1.5KB) - Model architecture configuration
|
69 |
+
- **generation_config.json** (243B) - Generation parameters
|
70 |
+
- **chat_template.jinja** (2.4KB) - Chat conversation template
|
71 |
+
- **training_args.bin** (5.7KB) - Training arguments and metadata
|
72 |
+
|
73 |
+
## Training Data
|
74 |
+
|
75 |
+
### Dataset Composition
|
76 |
+
- **Total QA Pairs:** 265,180
|
77 |
+
- **Source Projects:** 36 different repositories
|
78 |
+
- **Content Types:** Code implementation, documentation, usage examples
|
79 |
+
- **Coverage:** Comprehensive DataFusion ecosystem
|
80 |
+
|
81 |
+
### Training Projects Covered
|
82 |
+
- **Core DataFusion:** datafusion, datafusion-ballista, datafusion-federation
|
83 |
+
- **DataFusion Extensions:** datafusion-functions-json, datafusion-postgres, datafusion-python
|
84 |
+
- **Arrow Ecosystem:** arrow-rs, arrow-zarr
|
85 |
+
- **Related Tools:** blaze, exon, feldera, greptimedb, horaedb, influxdb
|
86 |
+
- **Modern Data Stack:** iceberg-rust, LakeSoul, lance, openobserve, parseable
|
87 |
+
|
88 |
+
### Data Quality Features
|
89 |
+
- Structured JSONL format with source attribution
|
90 |
+
- Code examples with best practices and common pitfalls
|
91 |
+
- Error handling guidance and troubleshooting solutions
|
92 |
+
- Performance optimization tips and best practices
|
93 |
+
|
94 |
+
## Model Capabilities
|
95 |
+
|
96 |
+
### Primary Strengths
|
97 |
+
1. **Rust Programming Expertise**
|
98 |
+
- Idiomatic Rust code generation
|
99 |
+
- DataFusion API usage patterns
|
100 |
+
- Error handling and testing best practices
|
101 |
+
- Performance optimization techniques
|
102 |
+
|
103 |
+
2. **DataFusion SQL Mastery**
|
104 |
+
- Complex SQL query construction
|
105 |
+
- Table provider implementations
|
106 |
+
- UDF (User-Defined Function) development
|
107 |
+
- Query optimization and execution planning
|
108 |
+
|
109 |
+
3. **Data Processing Knowledge**
|
110 |
+
- Arrow format operations
|
111 |
+
- Parquet file handling
|
112 |
+
- Data transformation pipelines
|
113 |
+
- Streaming and batch processing
|
114 |
+
|
115 |
+
4. **System Architecture Understanding**
|
116 |
+
- Distributed query execution
|
117 |
+
- Federation and integration patterns
|
118 |
+
- Observability and tracing
|
119 |
+
- Performance monitoring
|
120 |
+
|
121 |
+
### Technical Domains
|
122 |
+
- **SQL Engine Internals:** Query planning, optimization, execution
|
123 |
+
- **Data Formats:** Arrow, Parquet, JSON, CSV, Avro
|
124 |
+
- **Storage Systems:** Object storage, databases, file systems
|
125 |
+
- **Distributed Computing:** Ray, Ballista, cluster management
|
126 |
+
- **Streaming:** Real-time data processing, windowing, aggregations
|
127 |
+
|
128 |
+
## Usage Instructions
|
129 |
+
|
130 |
+
### Direct Usage with Transformers
|
131 |
+
```python
|
132 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
133 |
+
|
134 |
+
# Load model and tokenizer
|
135 |
+
model = AutoModelForCausalLM.from_pretrained(
|
136 |
+
"path/to/qwen2.5-3B-datafusion-instruct",
|
137 |
+
device_map="auto",
|
138 |
+
torch_dtype="auto"
|
139 |
+
)
|
140 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
141 |
+
"path/to/qwen2.5-3B-datafusion-instruct"
|
142 |
+
)
|
143 |
+
|
144 |
+
# Generate response
|
145 |
+
prompt = "How do I create a custom UDF in DataFusion?"
|
146 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
147 |
+
outputs = model.generate(
|
148 |
+
**inputs,
|
149 |
+
max_new_tokens=512,
|
150 |
+
temperature=0.7,
|
151 |
+
top_p=0.8,
|
152 |
+
repetition_penalty=1.05
|
153 |
+
)
|
154 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
155 |
+
print(response)
|
156 |
+
```
|
157 |
+
|
158 |
+
### Chat Template Usage
|
159 |
+
```python
|
160 |
+
from transformers import AutoTokenizer
|
161 |
+
|
162 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
163 |
+
"path/to/qwen2.5-3B-datafusion-instruct"
|
164 |
+
)
|
165 |
+
|
166 |
+
# Prepare chat messages
|
167 |
+
messages = [
|
168 |
+
{"role": "system", "content": "You are a DataFusion expert."},
|
169 |
+
{"role": "user", "content": "How do I optimize a SQL query?"}
|
170 |
+
]
|
171 |
+
|
172 |
+
# Apply chat template
|
173 |
+
prompt = tokenizer.apply_chat_template(
|
174 |
+
messages,
|
175 |
+
tokenize=False,
|
176 |
+
add_generation_prompt=True
|
177 |
+
)
|
178 |
+
print(prompt)
|
179 |
+
```
|
180 |
+
|
181 |
+
## Generation Parameters
|
182 |
+
|
183 |
+
### Default Configuration
|
184 |
+
- **Temperature:** 0.7 (balanced creativity vs consistency)
|
185 |
+
- **Top-p:** 0.8 (nucleus sampling)
|
186 |
+
- **Top-k:** 20 (top-k sampling)
|
187 |
+
- **Repetition Penalty:** 1.05 (prevents repetitive output)
|
188 |
+
- **Do Sample:** True (enables sampling-based generation)
|
189 |
+
|
190 |
+
### Recommended Settings
|
191 |
+
- **For Code Generation:** temperature=0.3, top_p=0.9
|
192 |
+
- **For Explanations:** temperature=0.7, top_p=0.8
|
193 |
+
- **For Debugging:** temperature=0.1, top_p=0.95
|
194 |
+
- **For Learning:** temperature=0.5, top_p=0.85
|
195 |
+
|
196 |
+
## Performance Characteristics
|
197 |
+
|
198 |
+
### Model Size and Memory
|
199 |
+
- **Total Parameters:** ~3 billion parameters
|
200 |
+
- **Model Size:** 11.5GB (distributed across 3 shards)
|
201 |
+
- **Memory Usage:** ~16-24GB RAM during inference
|
202 |
+
- **GPU Memory:** 12-16GB VRAM (depending on precision)
|
203 |
+
|
204 |
+
### Inference Performance
|
205 |
+
- **Context Length:** Up to 32,768 tokens
|
206 |
+
- **Generation Speed:** ~10-50 tokens/second (depending on hardware)
|
207 |
+
- **Memory Efficiency:** Optimized for large context windows
|
208 |
+
- **Batch Processing:** Supports batched inference
|
209 |
+
|
210 |
+
|
211 |
+
## Installation and Setup
|
212 |
+
|
213 |
+
### Requirements
|
214 |
+
```bash
|
215 |
+
# Python dependencies
|
216 |
+
pip install torch transformers accelerate safetensors
|
217 |
+
|
218 |
+
# For GPU acceleration
|
219 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
220 |
+
```
|
221 |
+
|
222 |
+
### Model Loading
|
223 |
+
```python
|
224 |
+
# Basic loading
|
225 |
+
model = AutoModelForCausalLM.from_pretrained("path/to/model")
|
226 |
+
|
227 |
+
# With device mapping
|
228 |
+
model = AutoModelForCausalLM.from_pretrained(
|
229 |
+
"path/to/model",
|
230 |
+
device_map="auto",
|
231 |
+
torch_dtype="auto"
|
232 |
+
)
|
233 |
+
|
234 |
+
# With quantization (for memory efficiency)
|
235 |
+
model = AutoModelForCausalLM.from_pretrained(
|
236 |
+
"path/to/model",
|
237 |
+
device_map="auto",
|
238 |
+
load_in_8bit=True # or load_in_4bit=True
|
239 |
+
)
|
240 |
+
```
|
241 |
+
|
242 |
+
## Comparison with GGUF Versions
|
243 |
+
|
244 |
+
| Aspect | Original Model | GGUF Main | GGUF Quantized |
|
245 |
+
|--------|----------------|------------|-----------------|
|
246 |
+
| **Format** | SafeTensors | GGUF | GGUF (Quantized) |
|
247 |
+
| **Size** | 11.5GB | 5.8GB | 1.8GB |
|
248 |
+
| **Memory Usage** | Highest | High | Lower |
|
249 |
+
| **Accuracy** | Highest | High | High |
|
250 |
+
| **Flexibility** | Maximum | High | Standard |
|
251 |
+
| **Deployment** | Development/Research | Production | Production |
|
252 |
+
| **Hardware Requirements** | High | Medium | Low |
|
253 |
+
|
254 |
+
## Limitations and Considerations
|
255 |
+
|
256 |
+
### Technical Limitations
|
257 |
+
- **Context Window:** Limited to 32,768 tokens
|
258 |
+
- **Real-time Updates:** May not reflect latest API changes
|
259 |
+
- **Complex Queries:** Very complex scenarios may require human review
|
260 |
+
- **Edge Cases:** Unusual configurations may need manual intervention
|
261 |
+
|
262 |
+
### Best Practices
|
263 |
+
- **Verify Output:** Always review generated code before deployment
|
264 |
+
- **Test Thoroughly:** Validate generated queries and functions
|
265 |
+
- **Stay Updated:** Check for newer model versions
|
266 |
+
- **Human Oversight:** Use as assistant, not replacement for expertise
|
267 |
+
|
268 |
+
|
269 |
+
## Resources
|
270 |
+
- **DataFusion Documentation:** https://docs.datafusion.org/
|
271 |
+
- **Apache Arrow:** https://arrow.apache.org/
|
272 |
+
- **Rust Programming Language:** https://www.rust-lang.org/
|
273 |
+
- **Training Dataset:** https://huggingface.co/datasets/yarenty/datafusion_QA
|
274 |
+
- **Hugging Face Model:** Available for download and use
|
275 |
+
|
276 |
+
## Citation
|
277 |
+
|
278 |
+
When using this model in research or publications, please cite:
|
279 |
+
|
280 |
+
```bibtex
|
281 |
+
@software{qwen2.5_3b_datafusion_instruct,
|
282 |
+
title={Qwen2.5-3B-DataFusion-Instruct: A Specialized Model for DataFusion Ecosystem},
|
283 |
+
author={Fine-tuned on DataFusion Ecosystem QA Dataset},
|
284 |
+
year={2025},
|
285 |
+
url={https://github.com/yarenty/trainer},
|
286 |
+
license={Apache-2.0}
|
287 |
+
}
|
288 |
+
```
|
289 |
+
|
290 |
+
## License
|
291 |
+
|
292 |
+
This model is licensed under the Apache 2.0 License. See the LICENSE file for full details.
|
293 |
+
|
294 |
+
---
|
295 |
+
|
296 |
+
*This original trained model represents the foundation of specialized AI assistance for the DataFusion ecosystem, providing the highest quality outputs for development, research, and production use cases. It serves as the source for creating optimized GGUF versions for various deployment scenarios.*
|