Lyte's picture
Update README.md
e288520 verified
|
raw
history blame
9.1 kB
---
base_model: unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
library_name: transformers
model_name: QuadConnect2.5-0.5B-v0.0.9b
pipeline_tag: text-generation
tags:
- unsloth
- trl
- grpo
- connect4
- qwen
- RL
licence: license
datasets:
- Lyte/ConnectFour-T10
language:
- en
---
# Model Card for QuadConnect2.5-0.5B-v0.0.9b
## Model Details
### Model Description
- Still very early training experiments, the reward functions are still changing.
- This model was created using GRPO and Unsloth. It was trained to reason over Connect Four and learn to play it strategically.
- It's made for a specific project task.
- **Developed by:** [Lyte](https://hf.co/Lyte)
- **Model type:** *Small Language Model*
- **Language(s) (NLP):** *English*
- **License:** *TBD*
- **Finetuned from model:** [unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit](https://huggingface.co/unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit)
- **Trained Using:** [TRL](https://github.com/huggingface/trl)'s GRPO.
# Demo:
- Example from the hf space(version: 0.0.6b):
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f847d692950415b63c6011/cV87vnDwFAPhOOZIT2tPp.png)
## Quick start
* Solution #1:
```python
from transformers import pipeline
SYSTEM_PROMPT = """You are a master Connect Four strategist whose goal is to win while preventing your opponent from winning. The game is played on a 6x7 grid (columns a–g, rows 1–6 with 1 at the bottom) where pieces drop to the lowest available spot.
Board:
- Represented as a list of occupied cells in the format: <column><row>(<piece>), e.g., 'a1(O)'.
- For example: 'a1(O), a2(X), b1(O)' indicates that cell a1 has an O, a2 has an X, and b1 has an O.
- An empty board is shown as 'Empty Board'.
- Win by connecting 4 pieces in any direction (horizontal, vertical, or diagonal).
Strategy:
1. Identify taken positions, and empty positions.
2. Find and execute winning moves.
3. If There isn't a winning move, then block your opponent’s potential wins.
4. Control the center and set up future moves.
Respond in XML:
<reasoning>
Explain your thought process, focusing on your winning move, how you block your opponent, and your strategic plans.
</reasoning>
<move>
Specify the column letter (a–g) for your next move.
</move>
"""
board = {
"empty": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: \n- Current board state: Empty Board\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
"one_move": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: b1\n- Current board state: b1(O)\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
"four_moves": "Game State:\n- You are playing as: X\n- Your previous moves: a1, a2\n- Opponent's moves: d1, a3\n- Current board state: a1(X), d1(O), a2(X), a3(O)\n- Next available position per column: \nColumn a: a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.",
}
generator = pipeline("text-generation", model="Lyte/QuadConnect2.5-0.5B-v0.0.9b", device="cuda")
# use 'empty', 'one_move' or 'four_moves' in board['']
output = generator([{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": board['empty']}], max_new_tokens=10245, return_full_text=False)[0]
print(output["generated_text"])
```
* Solution #2:
[GGUF Q8](https://hf.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b/blob/main/quadconnect.Q8_0.gguf): Download the Quantized GGUF in any of your favorite GGUF inference engine(e.g. LMStudio)
* Solution #3:
[Huggingface Space](http://hf.co/spaces/Lyte/QuadConnect)): You can duplicate the space or download the code from the space and use it locally.
## Training procedure
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
#### Preprocessing
- First I searched for datasets of the game Connect Four and found 3 potential datasets and ended up selecting this dataset [Leon-LLM/Connect-Four-Datasets-Collection](https://huggingface.co/datasets/Leon-LLM/Connect-Four-Datasets-Collection), I took the dataset filtered it for any empty or broken entries and uploaded it as Lyte/ConnectFour-clean and finally filtered to remove games that go for more than 10 turns, I then split it into train and validation(which wasn't used).
- The final dataset is Lyte/ConnectFour-T10
### Evaluation
* Evaluations were conducted on the [Lyte/ConnectFour-T10](hf.co/datasets/Lyte/ConnectFour-T10) dataset's validation split to test whether the model learns to win by presenting it with a board showing only the winning position left.
* evals sampling parameters are as follows:
* temperature=0.6, top_p=0.95, max_tokens=1024
#### Summary Metrics Comparison
#### Summary Metrics Comparison
| Metric | Lyte/QuadConnect2.5-0.5B-v0.0.6b | Lyte/QuadConnect2.5-0.5B-v0.0.8b | Lyte/QuadConnect2.5-0.0.9b (Temp 0.6) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.8) |
|-----------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
| Total games evaluated | 5082 | 5082 | 5082 | 5082 |
| Correct predictions | 518 | 394 | 516 | **713** |
| Accuracy | 10.19% | 7.75% | 10.15% | **14.03%** |
| Most common move | d (41.14%) | d (67.61%) | a (38.72%) | **a (31.01%)** |
| Middle column usage | 75.05% | 99.53% | 29.08% | **35.43%** |
*(Middle column usage = c + d + e → 20.11% + 4.05% + 11.27% = 35.43%)*
#### Move Distribution Comparison
| Column | Lyte/QuadConnect2.5-0.5B-v0.0.6b (Count, %) | Lyte/QuadConnect2.5-0.5B-v0.0.8b (Count, %) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.6) (Count, %) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.8) (Count, %) |
|--------|-----------------------------------|-----------------------------------|------------------------------|------------------------------|
| a | 603 (19.02%) | 3 (0.12%) | 1447 (38.72%) | 1547 (31.01%) |
| b | 111 (3.50%) | 4 (0.16%) | 644 (17.23%) | 924 (18.52%) |
| c | 785 (24.76%) | 463 (17.96%) | 648 (17.34%) | 1003 (20.11%) |
| d | 1304 (41.14%) | 1743 (67.61%) | 101 (2.70%) | 202 (4.05%) |
| e | 290 (9.15%) | 360 (13.96%) | 338 (9.04%) | 562 (11.27%) |
| f | 50 (1.58%) | 3 (0.12%) | 310 (8.30%) | 408 (8.18%) |
| g | 27 (0.85%) | 2 (0.08%) | 249 (6.66%) | 342 (6.86%) |
### Framework versions
- TRL: 0.15.1
- Transformers: 4.49.0
- Pytorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0
## Citations
Cite GRPO as:
```bibtex
@article{zhihong2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
```
Cite TRL as:
```bibtex
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
```