Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,232 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- yarenty/datafusion_QA
|
5 |
+
base_model:
|
6 |
+
- Qwen/Qwen2.5-3B-Instruct
|
7 |
+
tags:
|
8 |
+
- rust
|
9 |
+
- datafusion
|
10 |
+
- arrow
|
11 |
+
---
|
12 |
+
# Qwen2.5-3B-DataFusion-Instruct GGUF Model
|
13 |
+
|
14 |
+
## Model Overview
|
15 |
+
|
16 |
+
**Model Name:** Qwen2.5-3B-DataFusion-Instruct
|
17 |
+
**Model Type:** Fine-tuned Large Language Model
|
18 |
+
**Base Model:** Qwen2.5-3B
|
19 |
+
**Specialization:** DataFusion SQL Engine and Rust Programming
|
20 |
+
**Format:** GGUF (GGML Universal Format)
|
21 |
+
**License:** Apache 2.0
|
22 |
+
|
23 |
+
## Model Description
|
24 |
+
|
25 |
+
This is a specialized fine-tuned version of the Qwen2.5-3B model, specifically trained on comprehensive DataFusion ecosystem data to excel at Rust programming, DataFusion SQL queries, and data processing tasks. The model has been optimized to provide accurate, idiomatic code examples and clear technical explanations.
|
26 |
+
|
27 |
+
## Model Files
|
28 |
+
|
29 |
+
### Main Model
|
30 |
+
- **File:** `model.gguf` (5.8GB)
|
31 |
+
- **Type:** Full precision GGUF model
|
32 |
+
- **Use Case:** Production environments, highest accuracy requirements
|
33 |
+
- **Recommended For:** Development, debugging, complex queries
|
34 |
+
|
35 |
+
### Quantized Model
|
36 |
+
- **File:** `qwen2.5-3B-datafusion.gguf` (1.8GB)
|
37 |
+
- **Type:** Quantized GGUF model (optimized for inference)
|
38 |
+
- **Use Case:** Resource-constrained environments, faster inference
|
39 |
+
- **Recommended For:** Deployment, testing, resource-limited scenarios
|
40 |
+
|
41 |
+
## Training Data
|
42 |
+
|
43 |
+
### Dataset Composition
|
44 |
+
- **Total QA Pairs:** 265,180
|
45 |
+
- **Source Projects:** 36 different repositories
|
46 |
+
- **Content Types:** Code implementation, documentation, usage examples
|
47 |
+
- **Coverage:** Comprehensive DataFusion ecosystem
|
48 |
+
|
49 |
+
### Training Projects
|
50 |
+
- **Core DataFusion:** datafusion, datafusion-ballista, datafusion-federation
|
51 |
+
- **DataFusion Extensions:** datafusion-functions-json, datafusion-postgres, datafusion-python
|
52 |
+
- **Arrow Ecosystem:** arrow-rs, arrow-zarr
|
53 |
+
- **Related Tools:** blaze, exon, feldera, greptimedb, horaedb, influxdb
|
54 |
+
- **Modern Data Stack:** iceberg-rust, LakeSoul, lance, openobserve, parseable
|
55 |
+
|
56 |
+
### Data Quality Features
|
57 |
+
- Structured JSONL format with source attribution
|
58 |
+
- Code examples with best practices and common pitfalls
|
59 |
+
- Error handling guidance and troubleshooting solutions
|
60 |
+
- Performance optimization tips and best practices
|
61 |
+
|
62 |
+
## Model Capabilities
|
63 |
+
|
64 |
+
### Primary Strengths
|
65 |
+
1. **Rust Programming Expertise**
|
66 |
+
- Idiomatic Rust code generation
|
67 |
+
- DataFusion API usage patterns
|
68 |
+
- Error handling and testing best practices
|
69 |
+
- Performance optimization techniques
|
70 |
+
|
71 |
+
2. **DataFusion SQL Mastery**
|
72 |
+
- Complex SQL query construction
|
73 |
+
- Table provider implementations
|
74 |
+
- UDF (User-Defined Function) development
|
75 |
+
- Query optimization and execution planning
|
76 |
+
|
77 |
+
3. **Data Processing Knowledge**
|
78 |
+
- Arrow format operations
|
79 |
+
- Parquet file handling
|
80 |
+
- Data transformation pipelines
|
81 |
+
- Streaming and batch processing
|
82 |
+
|
83 |
+
4. **System Architecture Understanding**
|
84 |
+
- Distributed query execution
|
85 |
+
- Federation and integration patterns
|
86 |
+
- Observability and tracing
|
87 |
+
- Performance monitoring
|
88 |
+
|
89 |
+
### Technical Domains
|
90 |
+
- **SQL Engine Internals:** Query planning, optimization, execution
|
91 |
+
- **Data Formats:** Arrow, Parquet, JSON, CSV, Avro
|
92 |
+
- **Storage Systems:** Object storage, databases, file systems
|
93 |
+
- **Distributed Computing:** Ray, Ballista, cluster management
|
94 |
+
- **Streaming:** Real-time data processing, windowing, aggregations
|
95 |
+
|
96 |
+
## Usage Instructions
|
97 |
+
|
98 |
+
### System Prompt
|
99 |
+
The model is configured with a specialized system prompt:
|
100 |
+
```
|
101 |
+
You are a helpful, concise, and accurate coding assistant specialized in Rust and the DataFusion SQL engine. Always provide high-level, idiomatic Rust code, DataFusion SQL examples, clear documentation, and robust test cases. Your answers should be precise, actionable, and end with '### End'.
|
102 |
+
```
|
103 |
+
|
104 |
+
### Prompt Template
|
105 |
+
```
|
106 |
+
### Instruction:
|
107 |
+
{{ .Prompt }}
|
108 |
+
|
109 |
+
### Response:
|
110 |
+
```
|
111 |
+
|
112 |
+
### Stop Sequences
|
113 |
+
- `### Instruction:`
|
114 |
+
- `### Response:`
|
115 |
+
- `### End`
|
116 |
+
|
117 |
+
### Generation Parameters
|
118 |
+
- **num_predict:** 1024 (maximum tokens to generate)
|
119 |
+
- **repeat_penalty:** 1.2 (prevents repetitive output)
|
120 |
+
- **temperature:** 0.7 (balanced creativity vs consistency)
|
121 |
+
- **top_p:** 0.9 (nucleus sampling for quality)
|
122 |
+
|
123 |
+
## Performance Characteristics
|
124 |
+
|
125 |
+
### Accuracy
|
126 |
+
- **Code Generation:** High accuracy for Rust and DataFusion patterns
|
127 |
+
- **SQL Queries:** Correct syntax and best practices
|
128 |
+
- **Documentation:** Clear, actionable explanations
|
129 |
+
- **Error Handling:** Comprehensive coverage of common issues
|
130 |
+
|
131 |
+
### Efficiency
|
132 |
+
- **Main Model:** Highest accuracy, larger memory footprint
|
133 |
+
- **Quantized Model:** Optimized inference, reduced memory usage
|
134 |
+
- **Response Time:** Fast generation with proper stop sequences
|
135 |
+
- **Memory Usage:** Efficient token management
|
136 |
+
|
137 |
+
## Use Cases
|
138 |
+
|
139 |
+
### Development
|
140 |
+
- **Code Generation:** Generate Rust functions and DataFusion queries
|
141 |
+
- **Debugging:** Identify and fix common issues
|
142 |
+
- **Documentation:** Create clear technical explanations
|
143 |
+
- **Testing:** Generate test cases and validation code
|
144 |
+
|
145 |
+
### Learning
|
146 |
+
- **Tutorial Creation:** Step-by-step learning materials
|
147 |
+
- **Best Practices:** Learn recommended approaches
|
148 |
+
- **Pattern Recognition:** Understand common design patterns
|
149 |
+
- **API Exploration:** Discover available functionality
|
150 |
+
|
151 |
+
### Production Support
|
152 |
+
- **Query Optimization:** Improve SQL performance
|
153 |
+
- **Troubleshooting:** Resolve runtime issues
|
154 |
+
- **Integration:** Connect different data sources
|
155 |
+
- **Monitoring:** Set up observability and tracing
|
156 |
+
|
157 |
+
## Limitations and Considerations
|
158 |
+
|
159 |
+
### Technical Limitations
|
160 |
+
- **Context Window:** Limited to training data scope
|
161 |
+
- **Real-time Updates:** May not reflect latest API changes
|
162 |
+
- **Complex Queries:** Very complex scenarios may require human review
|
163 |
+
- **Edge Cases:** Unusual configurations may need manual intervention
|
164 |
+
|
165 |
+
### Best Practices
|
166 |
+
- **Verify Output:** Always review generated code before deployment
|
167 |
+
- **Test Thoroughly:** Validate generated queries and functions
|
168 |
+
- **Stay Updated:** Check for newer model versions
|
169 |
+
- **Human Oversight:** Use as assistant, not replacement for expertise
|
170 |
+
|
171 |
+
## Installation and Setup
|
172 |
+
|
173 |
+
### Ollama (Recommended)
|
174 |
+
```bash
|
175 |
+
# Pull the model
|
176 |
+
ollama pull jaro/qwen2.5-3B-datafusion-instruct
|
177 |
+
|
178 |
+
# Run inference
|
179 |
+
ollama run jaro/qwen2.5-3B-datafusion-instruct
|
180 |
+
```
|
181 |
+
|
182 |
+
### Direct GGUF Usage
|
183 |
+
```bash
|
184 |
+
# Using llama.cpp or compatible tools
|
185 |
+
./llama -m model.gguf -p "How do I create a custom UDF in DataFusion?"
|
186 |
+
```
|
187 |
+
|
188 |
+
## Model Comparison
|
189 |
+
|
190 |
+
| Aspect | Main Model (5.8GB) | Quantized Model (1.8GB) |
|
191 |
+
|--------|-------------------|-------------------------|
|
192 |
+
| **Accuracy** | Highest | High (slight degradation) |
|
193 |
+
| **Memory Usage** | Higher | Lower |
|
194 |
+
| **Inference Speed** | Standard | Faster |
|
195 |
+
| **Deployment** | Development/Production | Production/Resource-constrained |
|
196 |
+
| **Use Case** | Maximum quality | Balanced performance |
|
197 |
+
|
198 |
+
## Community and Support
|
199 |
+
|
200 |
+
### Contributing
|
201 |
+
- Report issues with model behavior
|
202 |
+
- Suggest improvements to training data
|
203 |
+
- Share use cases and success stories
|
204 |
+
- Contribute to the DataFusion ecosystem
|
205 |
+
|
206 |
+
### Resources
|
207 |
+
- **DataFusion Documentation:** https://docs.datafusion.org/
|
208 |
+
- **Apache Arrow:** https://arrow.apache.org/
|
209 |
+
- **Rust Programming Language:** https://www.rust-lang.org/
|
210 |
+
- **Training Dataset:** Available in https://huggingface.co/datasets/yarenty/datafusion_QA
|
211 |
+
|
212 |
+
## Citation
|
213 |
+
|
214 |
+
When using this model in research or publications, please cite:
|
215 |
+
|
216 |
+
```bibtex
|
217 |
+
@software{qwen2.5_3b_datafusion_instruct,
|
218 |
+
title={Qwen2.5-3B-DataFusion-Instruct: A Specialized Model for DataFusion Ecosystem},
|
219 |
+
author={Fine-tuned on DataFusion Ecosystem QA Dataset},
|
220 |
+
year={2025},
|
221 |
+
url={https://github.com/apache/datafusion},
|
222 |
+
license={Apache-2.0}
|
223 |
+
}
|
224 |
+
```
|
225 |
+
|
226 |
+
## License
|
227 |
+
|
228 |
+
This model is licensed under the Apache 2.0 License. See the LICENSE file for full details.
|
229 |
+
|
230 |
+
---
|
231 |
+
|
232 |
+
*This model represents a significant advancement in specialized AI assistance for the DataFusion ecosystem, combining the power of large language models with domain-specific expertise in data processing and Rust programming.*
|