huggingface-KREW
/

Llama-3.1-8B-Spider-SQL-Ko

+---
+language:
+- ko
+license: apache-2.0
+tags:
+- text2sql
+- spider
+- korean
+- llama
+- text-generation
+- table-question-answering
+datasets:
+- spider
+- huggingface-KREW/spider-ko
+base_model: unsloth/Meta-Llama-3.1-8B-Instruct
+model-index:
+- name: Llama-3.1-8B-Spider-SQL-Ko
+  results:
+  - task:
+      type: text2sql
+      name: Text to SQL
+    dataset:
+      name: Spider (Korean)
+      type: text2sql
+    metrics:
+    - type: exact_match
+      value: 42.65
+    - type: execution_accuracy
+      value: 65.47
+---
+# Llama-3.1-8B-Spider-SQL-Ko
+한국어 질문을 SQL 쿼리로 변환하는 Text-to-SQL 모델입니다. spider 데이터셋의 train 🤖
+[Spider](https://yale-lily.github.io/spider) 데이터셋을 한국어로 번역한 [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) 데이터셋을 활용하여 미세조정하였습니다.
+## 📊 주요 성능
+Spider 한국어 검증 데이터셋(1,034개) 평가 결과:
+- **정확 일치율**: 42.65% (441/1034)
+- **실행 정확도**: 65.47% (677/1034)
+> 💡 실행 정확도가 정확 일치율보다 높은 이유는, SQL 문법이 다르더라도 동일한 결과를 반환하는 경우가 많기 때문입니다.
+## 🚀 바로 시작하기
+```python
+from unsloth import FastLanguageModel
+# 모델 불러오기
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko",
+    max_seq_length=2048,
+    dtype=None,
+    load_in_4bit=True,
+)
+# 한국어 질문 → SQL 변환
+question = "가수는 몇 명이 있나요?"
+schema = """테이블: singer
+컬럼: singer_id, name, country, age"""
+prompt = f"""데이터베이스 스키마:
+{schema}
+질문: {question}
+SQL:"""
+# 결과: SELECT count(*) FROM singer
+```
+## 📝 모델 소개
+- **기반 모델**: Llama 3.1 8B Instruct (4bit 양자화)
+- **학습 데이터**: [spider-ko](https://huggingface.co/datasets/huggingface-KREW/spider-ko) (1-epoch)
+- **지원 DB**: 166개의 다양한 도메인 데이터베이스 ([spider dataset]([Spider](https://yale-lily.github.io/spider)))
+- **학습 방법**: LoRA (r=16, alpha=32)
+## 💬 활용 예시
+### 기본 사용법
+```python
+def generate_sql(question, schema_info):
+    """한국어 질문을 SQL로 변환"""
+    prompt = f"""다음 데이터베이스 스키마를 참고하여 질문에 대한 SQL 쿼리를 생성하세요.
+### 데이터베이스 스키마:
+{schema_info}
+### 질문: {question}
+### SQL 쿼리:"""
+    messages = [{"role": "user", "content": prompt}]
+    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+    outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1)
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return response.split("### SQL 쿼리:")[-1].strip()
+```
+### 실제 사용 예시
+```python
+# 예시 1: 집계 함수
+question = "부서장들 중 56세보다 나이가 많은 사람이 몇 명입니까?"
+# 결과: SELECT count(*) FROM head WHERE age > 56
+# 예시 2: 조인
+question = "가장 많은 대회를 개최한 도시의 상태는 무엇인가요?"
+# 결과: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1
+# 예시 3: 서브쿼리
+question = "기업가가 아닌 사람들의 이름은 무엇입니까?"
+# 결과: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur)
+```
+## ⚠️ 사용 시 주의사항
+### 제한사항
+- ✅ 영어 테이블/컬럼명 사용 (한국어 질문 → 영어 SQL)
+- ✅ Spider 데이터셋 도메인에 최적화
+- ❌ NoSQL, 그래프 DB 미지원
+- ❌ 매우 복잡한 중첩 쿼리는 정확도 하락
+## 🔧 기술 사양
+### 학습 환경
+- **GPU**: NVIDIA Tesla T4 (16GB)
+- **학습 시간**: 약 4시간
+- **메모리 사용**: 최대 7.6GB VRAM
+### 하이퍼파라미터
+```python
+training_args = {
+    "per_device_train_batch_size": 2,
+    "gradient_accumulation_steps": 4,
+    "learning_rate": 5e-4,
+    "num_train_epochs": 1,
+    "optimizer": "adamw_8bit",
+    "lr_scheduler_type": "cosine",
+    "warmup_ratio": 0.05
+}
+lora_config = {
+    "r": 16,
+    "lora_alpha": 32,
+    "lora_dropout": 0,
+    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
+                      "gate_proj", "up_proj", "down_proj"]
+}
+```
+## 📚 참고 자료
+### 인용
+```bibtex
+@misc{llama31_spider_sql_ko_2025,
+  title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model},
+  author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]},
+  year={2025},
+  publisher={Hugging Face KREW},
+  url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko}
+}
+```
+### 관련 논문
+- [Spider: A Large-Scale Human-Labeled Dataset](https://arxiv.org/abs/1809.08887) (Yu et al., 2018)
+## 🤝 기여자
+[@sim-so](https://huggingface.co/sim-so), [@choincnp](https://huggingface.co/choincnp), [@nuatmochoi](https://huggingface.co/nuatmochoi)