Qwen2.5-3B-DataFusion-Instruct: Original Trained Model
Model Overview
Model Name: Qwen2.5-3B-DataFusion-Instruct
Model Type: Fine-tuned Large Language Model
Base Model: Qwen2.5-3B
Specialization: DataFusion SQL Engine and Rust Programming
Format: Hugging Face Transformers (SafeTensors)
License: Apache 2.0
Total Size: ~11.5GB (distributed across 3 shards)
Model Description
This is the original trained version of the Qwen2.5-3B-DataFusion-Instruct model, containing the complete fine-tuned weights and configuration files. This model has been specifically trained on comprehensive DataFusion ecosystem data to excel at Rust programming, DataFusion SQL queries, and data processing tasks.
Model Architecture
Base Architecture
- Model Type: Qwen2ForCausalLM
- Architecture: Transformer-based causal language model
- Hidden Size: 2,048 dimensions
- Intermediate Size: 11,008 dimensions
- Number of Layers: 36 transformer layers
- Attention Heads: 16 attention heads
- Key-Value Heads: 2 key-value heads (Grouped Query Attention)
- Max Position Embeddings: 32,768 tokens
- Vocabulary Size: 151,936 tokens
Training Configuration
- Attention Dropout: 0.0 (no dropout during inference)
- RMS Norm Epsilon: 1e-06
- Initializer Range: 0.02
- Hidden Activation: SiLU (Swish)
- Layer Types: Full attention across all 36 layers
- Sliding Window: Disabled (full attention context)
- RoPE Scaling: None (standard rotary position encoding)
- RoPE Theta: 1,000,000
Model Files
Core Model Weights
- model-00001-of-00003.safetensors (4.6GB) - First shard
- model-00002-of-00003.safetensors (4.6GB) - Second shard
- model-00003-of-00003.safetensors (2.3GB) - Third shard
- model.safetensors.index.json (35KB) - Shard index file
Tokenizer and Vocabulary
- tokenizer.json (11MB) - Main tokenizer configuration
- vocab.json (2.6MB) - Vocabulary mapping
- merges.txt (1.6MB) - Byte-pair encoding merges
- tokenizer_config.json (4.6KB) - Tokenizer settings
- special_tokens_map.json (613B) - Special token definitions
- added_tokens.json (605B) - Additional tokens added during training
Configuration Files
- config.json (1.5KB) - Model architecture configuration
- generation_config.json (243B) - Generation parameters
- chat_template.jinja (2.4KB) - Chat conversation template
- training_args.bin (5.7KB) - Training arguments and metadata
Training Data
Dataset Composition
- Total QA Pairs: 265,180
- Source Projects: 36 different repositories
- Content Types: Code implementation, documentation, usage examples
- Coverage: Comprehensive DataFusion ecosystem
Training Projects Covered
- Core DataFusion: datafusion, datafusion-ballista, datafusion-federation
- DataFusion Extensions: datafusion-functions-json, datafusion-postgres, datafusion-python
- Arrow Ecosystem: arrow-rs, arrow-zarr
- Related Tools: blaze, exon, feldera, greptimedb, horaedb, influxdb
- Modern Data Stack: iceberg-rust, LakeSoul, lance, openobserve, parseable
Data Quality Features
- Structured JSONL format with source attribution
- Code examples with best practices and common pitfalls
- Error handling guidance and troubleshooting solutions
- Performance optimization tips and best practices
Model Capabilities
Primary Strengths
Rust Programming Expertise
- Idiomatic Rust code generation
- DataFusion API usage patterns
- Error handling and testing best practices
- Performance optimization techniques
DataFusion SQL Mastery
- Complex SQL query construction
- Table provider implementations
- UDF (User-Defined Function) development
- Query optimization and execution planning
Data Processing Knowledge
- Arrow format operations
- Parquet file handling
- Data transformation pipelines
- Streaming and batch processing
System Architecture Understanding
- Distributed query execution
- Federation and integration patterns
- Observability and tracing
- Performance monitoring
Technical Domains
- SQL Engine Internals: Query planning, optimization, execution
- Data Formats: Arrow, Parquet, JSON, CSV, Avro
- Storage Systems: Object storage, databases, file systems
- Distributed Computing: Ray, Ballista, cluster management
- Streaming: Real-time data processing, windowing, aggregations
Usage Instructions
Direct Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"path/to/qwen2.5-3B-datafusion-instruct",
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"path/to/qwen2.5-3B-datafusion-instruct"
)
# Generate response
prompt = "How do I create a custom UDF in DataFusion?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.8,
repetition_penalty=1.05
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Chat Template Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"path/to/qwen2.5-3B-datafusion-instruct"
)
# Prepare chat messages
messages = [
{"role": "system", "content": "You are a DataFusion expert."},
{"role": "user", "content": "How do I optimize a SQL query?"}
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(prompt)
Generation Parameters
Default Configuration
- Temperature: 0.7 (balanced creativity vs consistency)
- Top-p: 0.8 (nucleus sampling)
- Top-k: 20 (top-k sampling)
- Repetition Penalty: 1.05 (prevents repetitive output)
- Do Sample: True (enables sampling-based generation)
Recommended Settings
- For Code Generation: temperature=0.3, top_p=0.9
- For Explanations: temperature=0.7, top_p=0.8
- For Debugging: temperature=0.1, top_p=0.95
- For Learning: temperature=0.5, top_p=0.85
Performance Characteristics
Model Size and Memory
- Total Parameters: ~3 billion parameters
- Model Size: 11.5GB (distributed across 3 shards)
- Memory Usage: ~16-24GB RAM during inference
- GPU Memory: 12-16GB VRAM (depending on precision)
Inference Performance
- Context Length: Up to 32,768 tokens
- Generation Speed: ~10-50 tokens/second (depending on hardware)
- Memory Efficiency: Optimized for large context windows
- Batch Processing: Supports batched inference
Installation and Setup
Requirements
# Python dependencies
pip install torch transformers accelerate safetensors
# For GPU acceleration
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Model Loading
# Basic loading
model = AutoModelForCausalLM.from_pretrained("path/to/model")
# With device mapping
model = AutoModelForCausalLM.from_pretrained(
"path/to/model",
device_map="auto",
torch_dtype="auto"
)
# With quantization (for memory efficiency)
model = AutoModelForCausalLM.from_pretrained(
"path/to/model",
device_map="auto",
load_in_8bit=True # or load_in_4bit=True
)
Comparison with GGUF Versions
Aspect | Original Model | GGUF Main | GGUF Quantized |
---|---|---|---|
Format | SafeTensors | GGUF | GGUF (Quantized) |
Size | 11.5GB | 5.8GB | 1.8GB |
Memory Usage | Highest | High | Lower |
Accuracy | Highest | High | High |
Flexibility | Maximum | High | Standard |
Deployment | Development/Research | Production | Production |
Hardware Requirements | High | Medium | Low |
Limitations and Considerations
Technical Limitations
- Context Window: Limited to 32,768 tokens
- Real-time Updates: May not reflect latest API changes
- Complex Queries: Very complex scenarios may require human review
- Edge Cases: Unusual configurations may need manual intervention
Best Practices
- Verify Output: Always review generated code before deployment
- Test Thoroughly: Validate generated queries and functions
- Stay Updated: Check for newer model versions
- Human Oversight: Use as assistant, not replacement for expertise
Resources
- DataFusion Documentation: https://docs.datafusion.org/
- Apache Arrow: https://arrow.apache.org/
- Rust Programming Language: https://www.rust-lang.org/
- Training Dataset: https://huggingface.co/datasets/yarenty/datafusion_QA
- Hugging Face Model: Available for download and use
Citation
When using this model in research or publications, please cite:
@software{qwen2.5_3b_datafusion_instruct,
title={Qwen2.5-3B-DataFusion-Instruct: A Specialized Model for DataFusion Ecosystem},
author={Fine-tuned on DataFusion Ecosystem QA Dataset},
year={2025},
url={https://github.com/yarenty/trainer},
license={Apache-2.0}
}
License
This model is licensed under the Apache 2.0 License. See the LICENSE file for full details.
This original trained model represents the foundation of specialized AI assistance for the DataFusion ecosystem, providing the highest quality outputs for development, research, and production use cases. It serves as the source for creating optimized GGUF versions for various deployment scenarios.
- Downloads last month
- 9