yarenty's picture
Update README.md
c653139 verified
metadata
license: apache-2.0
datasets:
  - yarenty/datafusion_QA
base_model:
  - Qwen/Qwen2.5-3B-Instruct
tags:
  - rust
  - datafusion
  - qwen

Qwen2.5-3B-DataFusion-Instruct: Original Trained Model

Model Overview

Model Name: Qwen2.5-3B-DataFusion-Instruct
Model Type: Fine-tuned Large Language Model
Base Model: Qwen2.5-3B
Specialization: DataFusion SQL Engine and Rust Programming
Format: Hugging Face Transformers (SafeTensors)
License: Apache 2.0
Total Size: ~11.5GB (distributed across 3 shards)

Model Description

This is the original trained version of the Qwen2.5-3B-DataFusion-Instruct model, containing the complete fine-tuned weights and configuration files. This model has been specifically trained on comprehensive DataFusion ecosystem data to excel at Rust programming, DataFusion SQL queries, and data processing tasks.

Model Architecture

Base Architecture

  • Model Type: Qwen2ForCausalLM
  • Architecture: Transformer-based causal language model
  • Hidden Size: 2,048 dimensions
  • Intermediate Size: 11,008 dimensions
  • Number of Layers: 36 transformer layers
  • Attention Heads: 16 attention heads
  • Key-Value Heads: 2 key-value heads (Grouped Query Attention)
  • Max Position Embeddings: 32,768 tokens
  • Vocabulary Size: 151,936 tokens

Training Configuration

  • Attention Dropout: 0.0 (no dropout during inference)
  • RMS Norm Epsilon: 1e-06
  • Initializer Range: 0.02
  • Hidden Activation: SiLU (Swish)
  • Layer Types: Full attention across all 36 layers
  • Sliding Window: Disabled (full attention context)
  • RoPE Scaling: None (standard rotary position encoding)
  • RoPE Theta: 1,000,000

Model Files

Core Model Weights

  • model-00001-of-00003.safetensors (4.6GB) - First shard
  • model-00002-of-00003.safetensors (4.6GB) - Second shard
  • model-00003-of-00003.safetensors (2.3GB) - Third shard
  • model.safetensors.index.json (35KB) - Shard index file

Tokenizer and Vocabulary

  • tokenizer.json (11MB) - Main tokenizer configuration
  • vocab.json (2.6MB) - Vocabulary mapping
  • merges.txt (1.6MB) - Byte-pair encoding merges
  • tokenizer_config.json (4.6KB) - Tokenizer settings
  • special_tokens_map.json (613B) - Special token definitions
  • added_tokens.json (605B) - Additional tokens added during training

Configuration Files

  • config.json (1.5KB) - Model architecture configuration
  • generation_config.json (243B) - Generation parameters
  • chat_template.jinja (2.4KB) - Chat conversation template
  • training_args.bin (5.7KB) - Training arguments and metadata

Training Data

Dataset Composition

  • Total QA Pairs: 265,180
  • Source Projects: 36 different repositories
  • Content Types: Code implementation, documentation, usage examples
  • Coverage: Comprehensive DataFusion ecosystem

Training Projects Covered

  • Core DataFusion: datafusion, datafusion-ballista, datafusion-federation
  • DataFusion Extensions: datafusion-functions-json, datafusion-postgres, datafusion-python
  • Arrow Ecosystem: arrow-rs, arrow-zarr
  • Related Tools: blaze, exon, feldera, greptimedb, horaedb, influxdb
  • Modern Data Stack: iceberg-rust, LakeSoul, lance, openobserve, parseable

Data Quality Features

  • Structured JSONL format with source attribution
  • Code examples with best practices and common pitfalls
  • Error handling guidance and troubleshooting solutions
  • Performance optimization tips and best practices

Model Capabilities

Primary Strengths

  1. Rust Programming Expertise

    • Idiomatic Rust code generation
    • DataFusion API usage patterns
    • Error handling and testing best practices
    • Performance optimization techniques
  2. DataFusion SQL Mastery

    • Complex SQL query construction
    • Table provider implementations
    • UDF (User-Defined Function) development
    • Query optimization and execution planning
  3. Data Processing Knowledge

    • Arrow format operations
    • Parquet file handling
    • Data transformation pipelines
    • Streaming and batch processing
  4. System Architecture Understanding

    • Distributed query execution
    • Federation and integration patterns
    • Observability and tracing
    • Performance monitoring

Technical Domains

  • SQL Engine Internals: Query planning, optimization, execution
  • Data Formats: Arrow, Parquet, JSON, CSV, Avro
  • Storage Systems: Object storage, databases, file systems
  • Distributed Computing: Ray, Ballista, cluster management
  • Streaming: Real-time data processing, windowing, aggregations

Usage Instructions

Direct Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "path/to/qwen2.5-3B-datafusion-instruct",
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "path/to/qwen2.5-3B-datafusion-instruct"
)

# Generate response
prompt = "How do I create a custom UDF in DataFusion?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Chat Template Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "path/to/qwen2.5-3B-datafusion-instruct"
)

# Prepare chat messages
messages = [
    {"role": "system", "content": "You are a DataFusion expert."},
    {"role": "user", "content": "How do I optimize a SQL query?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(prompt)

Generation Parameters

Default Configuration

  • Temperature: 0.7 (balanced creativity vs consistency)
  • Top-p: 0.8 (nucleus sampling)
  • Top-k: 20 (top-k sampling)
  • Repetition Penalty: 1.05 (prevents repetitive output)
  • Do Sample: True (enables sampling-based generation)

Recommended Settings

  • For Code Generation: temperature=0.3, top_p=0.9
  • For Explanations: temperature=0.7, top_p=0.8
  • For Debugging: temperature=0.1, top_p=0.95
  • For Learning: temperature=0.5, top_p=0.85

Performance Characteristics

Model Size and Memory

  • Total Parameters: ~3 billion parameters
  • Model Size: 11.5GB (distributed across 3 shards)
  • Memory Usage: ~16-24GB RAM during inference
  • GPU Memory: 12-16GB VRAM (depending on precision)

Inference Performance

  • Context Length: Up to 32,768 tokens
  • Generation Speed: ~10-50 tokens/second (depending on hardware)
  • Memory Efficiency: Optimized for large context windows
  • Batch Processing: Supports batched inference

Installation and Setup

Requirements

# Python dependencies
pip install torch transformers accelerate safetensors

# For GPU acceleration
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Model Loading

# Basic loading
model = AutoModelForCausalLM.from_pretrained("path/to/model")

# With device mapping
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    device_map="auto",
    torch_dtype="auto"
)

# With quantization (for memory efficiency)
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    device_map="auto",
    load_in_8bit=True  # or load_in_4bit=True
)

Comparison with GGUF Versions

Aspect Original Model GGUF Main GGUF Quantized
Format SafeTensors GGUF GGUF (Quantized)
Size 11.5GB 5.8GB 1.8GB
Memory Usage Highest High Lower
Accuracy Highest High High
Flexibility Maximum High Standard
Deployment Development/Research Production Production
Hardware Requirements High Medium Low

Limitations and Considerations

Technical Limitations

  • Context Window: Limited to 32,768 tokens
  • Real-time Updates: May not reflect latest API changes
  • Complex Queries: Very complex scenarios may require human review
  • Edge Cases: Unusual configurations may need manual intervention

Best Practices

  • Verify Output: Always review generated code before deployment
  • Test Thoroughly: Validate generated queries and functions
  • Stay Updated: Check for newer model versions
  • Human Oversight: Use as assistant, not replacement for expertise

Resources

Citation

When using this model in research or publications, please cite:

@software{qwen2.5_3b_datafusion_instruct,
  title={Qwen2.5-3B-DataFusion-Instruct: A Specialized Model for DataFusion Ecosystem},
  author={Fine-tuned on DataFusion Ecosystem QA Dataset},
  year={2025},
  url={https://github.com/yarenty/trainer},
  license={Apache-2.0}
}

License

This model is licensed under the Apache 2.0 License. See the LICENSE file for full details.


This original trained model represents the foundation of specialized AI assistance for the DataFusion ecosystem, providing the highest quality outputs for development, research, and production use cases. It serves as the source for creating optimized GGUF versions for various deployment scenarios.