|
--- |
|
base_model: unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit |
|
library_name: transformers |
|
model_name: QuadConnect2.5-0.5B-v0.0.9b |
|
pipeline_tag: text-generation |
|
tags: |
|
- unsloth |
|
- trl |
|
- grpo |
|
- connect4 |
|
- qwen |
|
- RL |
|
licence: license |
|
datasets: |
|
- Lyte/ConnectFour-T10 |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for QuadConnect2.5-0.5B-v0.0.9b |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- Still very early training experiments, the reward functions are still changing. |
|
- This model was created using GRPO and Unsloth. It was trained to reason over Connect Four and learn to play it strategically. |
|
- It's made for a specific project task. |
|
|
|
- **Developed by:** [Lyte](https://hf.co/Lyte) |
|
- **Model type:** *Small Language Model* |
|
- **Language(s) (NLP):** *English* |
|
- **License:** *TBD* |
|
- **Finetuned from model:** [unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit](https://huggingface.co/unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit) |
|
- **Trained Using:** [TRL](https://github.com/huggingface/trl)'s GRPO. |
|
|
|
# Demo: |
|
|
|
- Example from the hf space(version: 0.0.6b): |
|
 |
|
|
|
## Quick start |
|
|
|
* Solution #1: |
|
```python |
|
from transformers import pipeline |
|
|
|
SYSTEM_PROMPT = """You are a master Connect Four strategist whose goal is to win while preventing your opponent from winning. The game is played on a 6x7 grid (columns a–g, rows 1–6 with 1 at the bottom) where pieces drop to the lowest available spot. |
|
|
|
Board: |
|
- Represented as a list of occupied cells in the format: <column><row>(<piece>), e.g., 'a1(O)'. |
|
- For example: 'a1(O), a2(X), b1(O)' indicates that cell a1 has an O, a2 has an X, and b1 has an O. |
|
- An empty board is shown as 'Empty Board'. |
|
- Win by connecting 4 pieces in any direction (horizontal, vertical, or diagonal). |
|
|
|
Strategy: |
|
1. Identify taken positions, and empty positions. |
|
2. Find and execute winning moves. |
|
3. If There isn't a winning move, then block your opponent’s potential wins. |
|
4. Control the center and set up future moves. |
|
|
|
Respond in XML: |
|
<reasoning> |
|
Explain your thought process, focusing on your winning move, how you block your opponent, and your strategic plans. |
|
</reasoning> |
|
<move> |
|
Specify the column letter (a–g) for your next move. |
|
</move> |
|
""" |
|
|
|
board = { |
|
"empty": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: \n- Current board state: Empty Board\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.", |
|
"one_move": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: b1\n- Current board state: b1(O)\n- Next available position per column: \nColumn a: a1, a2, a3, a4, a5, a6 \nColumn b: b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d1, d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.", |
|
"four_moves": "Game State:\n- You are playing as: X\n- Your previous moves: a1, a2\n- Opponent's moves: d1, a3\n- Current board state: a1(X), d1(O), a2(X), a3(O)\n- Next available position per column: \nColumn a: a4, a5, a6 \nColumn b: b1, b2, b3, b4, b5, b6 \nColumn c: c1, c2, c3, c4, c5, c6 \nColumn d: d2, d3, d4, d5, d6 \nColumn e: e1, e2, e3, e4, e5, e6 \nColumn f: f1, f2, f3, f4, f5, f6 \nColumn g: g1, g2, g3, g4, g5, g6\n\nMake your move.", |
|
} |
|
|
|
generator = pipeline("text-generation", model="Lyte/QuadConnect2.5-0.5B-v0.0.9b", device="cuda") |
|
|
|
# use 'empty', 'one_move' or 'four_moves' in board[''] |
|
output = generator([{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": board['empty']}], max_new_tokens=10245, return_full_text=False)[0] |
|
print(output["generated_text"]) |
|
``` |
|
* Solution #2: |
|
[GGUF Q8](https://hf.co/Lyte/QuadConnect2.5-0.5B-v0.0.9b/blob/main/quadconnect.Q8_0.gguf): Download the Quantized GGUF in any of your favorite GGUF inference engine(e.g. LMStudio) |
|
|
|
* Solution #3: |
|
[Huggingface Space](http://hf.co/spaces/Lyte/QuadConnect)): You can duplicate the space or download the code from the space and use it locally. |
|
|
|
## Training procedure |
|
|
|
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300). |
|
|
|
#### Preprocessing |
|
|
|
- First I searched for datasets of the game Connect Four and found 3 potential datasets and ended up selecting this dataset [Leon-LLM/Connect-Four-Datasets-Collection](https://huggingface.co/datasets/Leon-LLM/Connect-Four-Datasets-Collection), I took the dataset filtered it for any empty or broken entries and uploaded it as Lyte/ConnectFour-clean and finally filtered to remove games that go for more than 10 turns, I then split it into train and validation(which wasn't used). |
|
- The final dataset is Lyte/ConnectFour-T10 |
|
|
|
### Evaluation |
|
|
|
* Evaluations were conducted on the [Lyte/ConnectFour-T10](hf.co/datasets/Lyte/ConnectFour-T10) dataset's validation split to test whether the model learns to win by presenting it with a board showing only the winning position left. |
|
|
|
* evals sampling parameters are as follows: |
|
* temperature=0.6, top_p=0.95, max_tokens=1024 |
|
|
|
#### Summary Metrics Comparison |
|
|
|
#### Summary Metrics Comparison |
|
|
|
| Metric | Lyte/QuadConnect2.5-0.5B-v0.0.6b | Lyte/QuadConnect2.5-0.5B-v0.0.8b | Lyte/QuadConnect2.5-0.0.9b (Temp 0.6) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.8) | |
|
|-----------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------| |
|
| Total games evaluated | 5082 | 5082 | 5082 | 5082 | |
|
| Correct predictions | 518 | 394 | 516 | **713** | |
|
| Accuracy | 10.19% | 7.75% | 10.15% | **14.03%** | |
|
| Most common move | d (41.14%) | d (67.61%) | a (38.72%) | **a (31.01%)** | |
|
| Middle column usage | 75.05% | 99.53% | 29.08% | **35.43%** | |
|
|
|
*(Middle column usage = c + d + e → 20.11% + 4.05% + 11.27% = 35.43%)* |
|
|
|
#### Move Distribution Comparison |
|
|
|
| Column | Lyte/QuadConnect2.5-0.5B-v0.0.6b (Count, %) | Lyte/QuadConnect2.5-0.5B-v0.0.8b (Count, %) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.6) (Count, %) | Lyte/QuadConnect2.5-0.0.9b (Temp 0.8) (Count, %) | |
|
|--------|-----------------------------------|-----------------------------------|------------------------------|------------------------------| |
|
| a | 603 (19.02%) | 3 (0.12%) | 1447 (38.72%) | 1547 (31.01%) | |
|
| b | 111 (3.50%) | 4 (0.16%) | 644 (17.23%) | 924 (18.52%) | |
|
| c | 785 (24.76%) | 463 (17.96%) | 648 (17.34%) | 1003 (20.11%) | |
|
| d | 1304 (41.14%) | 1743 (67.61%) | 101 (2.70%) | 202 (4.05%) | |
|
| e | 290 (9.15%) | 360 (13.96%) | 338 (9.04%) | 562 (11.27%) | |
|
| f | 50 (1.58%) | 3 (0.12%) | 310 (8.30%) | 408 (8.18%) | |
|
| g | 27 (0.85%) | 2 (0.08%) | 249 (6.66%) | 342 (6.86%) | |
|
|
|
|
|
|
|
### Framework versions |
|
|
|
- TRL: 0.15.1 |
|
- Transformers: 4.49.0 |
|
- Pytorch: 2.5.1+cu121 |
|
- Datasets: 3.2.0 |
|
- Tokenizers: 0.21.0 |
|
|
|
## Citations |
|
|
|
Cite GRPO as: |
|
|
|
```bibtex |
|
@article{zhihong2024deepseekmath, |
|
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}}, |
|
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo}, |
|
year = 2024, |
|
eprint = {arXiv:2402.03300}, |
|
} |
|
|
|
``` |
|
|
|
Cite TRL as: |
|
|
|
```bibtex |
|
@misc{vonwerra2022trl, |
|
title = {{TRL: Transformer Reinforcement Learning}}, |
|
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec}, |
|
year = 2020, |
|
journal = {GitHub repository}, |
|
publisher = {GitHub}, |
|
howpublished = {\url{https://github.com/huggingface/trl}} |
|
} |
|
``` |