ALIO-SQL-7B / README.md

Update README.md

f6047fc verified 4 months ago

6.38 kB

	---
	library_name: transformers
	tags:
	- text-to-sql
	- sql
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-Coder-7B-Instruct
	datasets:
	- AlioLeuchtmann/BIRD_SPIDER_Qwen2.5-Coder-32B
	---

	BIRD-bench: https://bird-bench.github.io/

	## 57% on BIRD private test Set (Qwen 2.5 Coder Models might have been trained on BIRD dev, which could explain the Performance Drop)
	## 60% on BIRD dev one shot
	## 70% on BIRD dev 4 shot

	Offical Results on BIRD Test Set

	\| Metric \| Simple \| Moderate \| Challenging \| Total \|
	\| ----------- \| -----: \| -------: \| ----------: \| ----: \|
	\| Count \| 949 \| 555 \| 285 \| 1789 \|
	\| EX \| 65.33 \| 53.33 \| 39.65 \| 57.52 \|
	\| Soft F1 \| 66.31 \| 55.51 \| 41.48 \| 59.01 \|
	\| R-VES \| 61.15 \| 47.73 \| 36.71 \| 53.09 \|



	## Model Details

	SFT of Qwen2.5-Coder-7B <br>
	Dataset is a combination of BIRD and Spider; answers were created through knowledge distillation with CoT prompting from Qwen2.5-Coder-32B, and filtered for correctness. <br>
	<br>
	We increased the BIRD dev set performance from 51.1% to 60%. <br>
	Our model beats much larger universal models like GPT-4o and Gemini 1.5 Pro, demonstrating the effectiveness of black-box knowledge distillation for constrained domains. <br>
	After training, this checkpoint even exceeds the performance of its teacher model, Qwen2.5-Coder-32B, which can be achieved by filtering the teacher’s outputs for Quality <br>



	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65e01c279c88ffe889d64a73/pPwqvkFLB4VRKm6XBcZcn.png)


	## Prompt:

	Single Prompt used throughout the Dataset so best used as follows:

	```sql
	CREATE TABLE geographic\n(\n city TEXT not null\n primary key,\n county TEXT null,\n region TEXT null\n) \n /* \n 2 example rows: \n SELECT * FROM geographic LIMIT 2; \n city county region \nalameda alameda county bay area \n alamo contra costa county bay area \n */\n\n <br>
	CREATE TABLE generalinfo\n(\n id_restaurant INTEGER not null\n primary key,\n label TEXT null,\n food_type TEXT null,\n city TEXT null,\n review REAL null,\n foreign key (city) references geographic(city)\n on update cascade on delete cascade\n) \n /* \n 2 example rows: \n SELECT * FROM generalinfo LIMIT 2; \n id_restaurant label food_type city review \n 1 sparky's diner 24 hour diner san francisco 2.3 \n 2 kabul afghan cuisine afghani san carlos 3.8 \n /\n\nCREATE TABLE location\n(\n id_restaurant INTEGER not null\n primary key,\n street_num INTEGER null,\n street_name TEXT null,\n city TEXT null,\n foreign key (city) references geographic (city)\n on update cascade on delete cascade,\n foreign key (id_restaurant) references generalinfo (id_restaurant)\n on update cascade on delete cascade\n) \n / \n 2 example rows: \n SELECT * FROM location LIMIT 2; \n id_restaurant street_num street_name city \n 1 242 church st san francisco \n 2 135 el camino real san carlos \n */\n\n <br>
	-- External Knowledge: Atlantic Ave refers to street_name = 'atlantic ave'; rating refers to review\n <br>
	-- Using valid SQLite and understanding External Knowledge, answer the following question for the tables provided above.\n <br>
	-- What is the rating of each restaurant reviews on Atlantic Ave?\n <br>
	Generate the SQL after thinking step by step:\n <br>
	```

	```python
	def bird_gpt_template_no_format(question, commonsense, schema):
	return f"""{schema}

	-- External Knowledge: {commonsense}
	-- Using valid SQLite and understanding External Knowledge, answer the following question for the tables provided above.
	-- {question}
	Generate the SQL after thinking step by step:
	"""

	```

	### Generate Schema:

	```python
	def generate_schema_prompt(db_path, num_rows=None):
	# extract create ddls
	'''
	:param root_place:
	:param db_name:
	:return:
	'''
	full_schema_prompt_list = []
	conn = sqlite3.connect(db_path)
	# Create a cursor object
	cursor = conn.cursor()
	cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
	tables = cursor.fetchall()
	schemas = {}
	for table in tables:
	if table == 'sqlite_sequence':
	continue
	cursor.execute("SELECT sql FROM sqlite_master WHERE type='table' AND name='{}';".format(table[0]))
	create_prompt = cursor.fetchone()[0]
	schemas[table[0]] = create_prompt
	if num_rows:
	cur_table = table[0]
	if cur_table in ['order', 'by', 'group','transaction'] or ' ' in str(cur_table).strip() or '-' in str(cur_table).strip():
	cur_table = '"{}"'.format(cur_table)


	cursor.execute("SELECT * FROM {} LIMIT {}".format(cur_table, num_rows))
	column_names = [description[0] for description in cursor.description]
	values = cursor.fetchall()
	rows_prompt = nice_look_table(column_names=column_names, values=values)
	verbose_prompt = "/* \n {} example rows: \n SELECT * FROM {} LIMIT {}; \n {} \n */".format(num_rows,
	cur_table,
	num_rows,
	rows_prompt)
	schemas[table[0]] = "{} \n {}".format(create_prompt, verbose_prompt)

	for k, v in schemas.items():
	full_schema_prompt_list.append(v)

	schema_prompt = "\n\n".join(full_schema_prompt_list)

	return schema_prompt
	```


	### System Prompt:

	Default Qwen2.5 System Prompt

	```python
	def preprocess_prompt(prompt):
	return f'''<\|im_start\|>system
	You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	'''
	```

	### Generation Config:
	- No Sampling for best Performance



	### Restrictions:
	- only trained on SQLite Dialect
	- only trained in English
	- Did not care to keep any other Skills than Text to SQL