Update README.md

00bdb7b verified 4 days ago

9.8 kB

	---
	language: en
	license: apache-2.0
	base_model: Qwen/Qwen2.5-Coder-0.5B
	tags:
	- text-to-sql
	- spider-dataset
	- sql-generation
	- code-generation
	- master-thesis-research
	datasets:
	- spider
	metrics:
	- execution_accuracy
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Qwen2.5-Coder-0.5B Fine-tuned on Spider Dataset

	This model is a fine-tuned version of [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) on the Spider dataset for text-to-SQL generation, developed as part of academic thesis research.

	## Model Details

	### Model Description

	This model converts natural language questions into SQL queries by leveraging the Qwen2.5-Coder architecture fine-tuned on the comprehensive Spider dataset. The model demonstrates strong performance on cross-domain semantic parsing tasks and can handle complex SQL constructs including joins, aggregations, and nested queries.

	- Developed by: Ali Assi
	- Model type: Causal Language Model (Text-to-SQL)
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: Qwen/Qwen2.5-Coder-0.5B
	- Research Context: Academic thesis research
	- University: Lebanese University
	- Contact: [email protected]

	### Model Sources

	- Repository: https://github.com/AliiAssi
	- Hugging Face: https://huggingface.co/alialialialaiali/qwen2.5-coder-spider-sql
	- Base Model: https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B

	## Performance

	### Performance Comparison

	\| Metric \| Base Qwen Model \| Fine-tuned Model \|
	\|--------\|----------------\|------------------\|
	\| Total Queries \| 1,034 \| 1,034 \|
	\| Execution Success \| 570 \| 577 \|
	\| Execution Success Rate \| 55.1% \| 55.8% \|
	\| Correct Results \| 357 \| 405 \|
	\| Execution Accuracy \| 34.5% \| 39.2% \|
	\| Parse Errors \| 464 \| 457 \|
	\| Evaluation Time \| 84.9 min \| 38.7 min \|

	Overall Performance Summary

	The model achieved a 39.17% execution accuracy on the Spider development set, correctly generating 405 out of 1,034 SQL queries. While this represents moderate performance, it demonstrates the model's capability to handle basic to intermediate SQL generation tasks across diverse database domains.

	Key Performance Metrics:

	- 🏆 Execution Accuracy: 39.17% (405/1,034 queries returned correct results)
	- Execution Success Rate: 55.80% (577/1,034 queries executed without errors)
	- Parse Error Rate: 44.20% (457 queries had syntax issues)
	- Database Error Rate: 0.00% (no database-related errors when queries parsed correctly)

	Key Findings:

	The evaluation reveals distinct performance characteristics:

	Execution Statistics:
	- Success Rate: 55.80% of queries executed successfully, indicating reasonable SQL syntax generation capability
	- Parse Errors: 44.2% of queries failed to parse, highlighting the primary challenge in SQL syntax generation
	- Database Validity: 0% database errors suggest that when queries do parse correctly, they are generally semantically valid for the target schemas

	This pattern indicates the model's main limitation lies in generating syntactically correct SQL rather than logical query construction, suggesting potential for improvement through enhanced syntax training or post-processing validation.

	## Uses

	### Direct Use

	The model is designed for converting natural language questions into SQL queries for database querying applications. It works best with:

	- Cross-domain database queries (trained on 200+ diverse databases)
	- Complex SQL generation (joins, aggregations, subqueries)
	- Academic research in semantic parsing and code generation
	- Educational applications for SQL learning and demonstration

	### Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model and tokenizer
	model_name = "alialialialaiali/qwen2.5-coder-spider-sql"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Example database schema
	schema = '''-- Table: students
	student_id (number)
	name (text)
	age (number)
	major (text)

	-- Table: courses
	course_id (number)
	course_name (text)
	credits (number)

	-- Table: enrollments
	student_id (number)
	course_id (number)
	grade (text)'''

	# Natural language question
	question = "What are the names of students enrolled in courses with more than 3 credits?"

	# Create prompt
	prompt = f'''-- Database Schema:
	{schema}

	-- Question: {question}
	-- SQL Query:'''

	# Generate SQL
	inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=150,
	temperature=0.1,
	do_sample=False,
	pad_token_id=tokenizer.eos_token_id
	)

	# Extract generated SQL
	generated_sql = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip()
	print("Generated SQL:", generated_sql)
	```

	### Out-of-Scope Use

	- Production database systems without thorough testing and validation
	- Non-English natural language queries
	- Database systems with significantly different SQL dialects
	- Queries requiring real-time execution guarantees

	## Training Details

	### Training Data

	The model was trained on the Spider dataset, a large-scale cross-domain semantic parsing dataset containing:

	- 10,181 questions with corresponding SQL queries
	- 200 databases across diverse domains (academic, business, government, etc.)
	- 5,693 unique complex SQL queries
	- Multiple table relationships and complex schema structures

	Training Split:
	- Training examples: 7,000
	- Validation examples: 1,034
	- Database schemas: 166

	### Training Procedure

	#### Training Hyperparameters

	- Training regime: Mixed precision (bfloat16 where supported)
	- Epochs: 2.29 (early stopping applied)
	- Batch size: 2 examples per device
	- Gradient accumulation steps: 4
	- Learning rate: 5e-5
	- Weight decay: 0.01
	- Warmup steps: 10% of total steps
	- Max sequence length: 512 tokens
	- Optimizer: AdamW

	#### Infrastructure

	- Hardware: NVIDIA T4 GPU (Google Colab)
	- Training time: ~2.75 hours
	- Framework: Hugging Face Transformers 4.52.4
	- Early stopping: Patience of 3 steps on validation loss

	## Evaluation

	### Testing Data & Metrics

	Dataset: Full Spider development set (1,034 examples)

	Evaluation Method: Execution Accuracy - measuring whether generated SQL queries return the same results as ground truth when executed on actual Spider databases.

	Key Metrics:
	- Execution Accuracy: Percentage of queries producing correct results
	- Execution Success Rate: Percentage of syntactically valid queries
	- Parse Error Rate: Percentage of queries with SQL syntax errors

	### Results Summary

	The model achieved 39.17% execution accuracy on the complete Spider development set, demonstrating competent handling of:
	- ✅ Multi-table joins with proper aliasing
	- ✅ Aggregate functions (COUNT, SUM, AVG) with GROUP BY
	- ✅ Set operations (INTERSECT, EXCEPT, UNION)
	- ✅ Subqueries and nested SELECT statements
	- ✅ Complex WHERE clauses with multiple conditions

	Performance Analysis:
	- Successfully parsed and executed 55.80% of generated queries
	- Primary challenge identified in SQL syntax generation (44.2% parse errors)
	- When syntactically correct, queries demonstrate strong semantic validity (0% database errors)

	## Limitations and Bias

	### Technical Limitations

	- Parse errors: 44.2% of generated queries contain syntax errors, representing the primary performance bottleneck
	- Semantic accuracy: Model may generate syntactically correct but semantically incorrect queries
	- Complex reasoning: Performance likely degrades on highly complex nested queries
	- Schema understanding: May have limited ability to infer implicit relationships

	### Recommendations

	- Validation required: Always validate generated SQL before execution
	- Human review: Recommend human oversight for production applications
	- Testing: Thoroughly test on your specific database schema and domain
	- Error handling: Implement robust error handling for parse failures
	- Syntax validation: Consider implementing SQL syntax validation as post-processing step

	## Environmental Impact

	Training was conducted on Google Colab infrastructure:
	- Hardware Type: NVIDIA T4 GPU
	- Training Hours: ~2.75 hours
	- Cloud Provider: Google Cloud Platform
	- Estimated Carbon Impact: Minimal due to short training duration

	## Citation

	BibTeX:
	```bibtex
	@misc{ali2025qwen-spider-sql,
	title={Qwen2.5-Coder Fine-tuned on Spider Dataset for Text-to-SQL Generation},
	author={ALI},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/alialialialaiali/qwen2.5-coder-spider-sql}},
	note={Academic thesis research}
	}
	```

	Spider Dataset Citation:
	```bibtex
	@inproceedings{yu2018spider,
	title={Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
	author={Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Li, Qingning and Roman, Shanelle and others},
	booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
	pages={3911--3921},
	year={2018}
	}
	```

	## Model Card Authors

	ALI
	📧 [email protected]
	🔗 https://github.com/AliiAssi

	## Model Card Contact

	For questions about this model or research collaboration:
	- Email: [email protected]
	- GitHub: https://github.com/AliiAssi
	- Hugging Face: https://huggingface.co/alialialialaiali