File size: 6,386 Bytes
7d31c26
235c209
02b9dba
 
8f0e7db
 
 
 
70a4fc3
073c2e8
44d3e28
073c2e8
6a151ac
 
 
 
073c2e8
 
f3b7d63
073c2e8
 
 
70a4fc3
073c2e8
 
3ebf33e
 
8f0e7db
40743b7
 
678fc9c
4db087a
678fc9c
40743b7
7935afb
 
 
40743b7
5ed1529
f14f107
c1b45a5
7a08fef
f14f107
5c70869
c1b45a5
 
8f0e7db
63cea6e
b0f0a84
 
 
6c84629
b0f0a84
4b90342
 
8f0e7db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5ec8c7
8f0e7db
4130100
8f0e7db
 
 
 
 
91d695b
8f0e7db
 
 
 
 
 
 
 
 
 
 
 
 
0d07384
8f0e7db
5f0d886
8f0e7db
 
 
 
 
4d689a6
8f0e7db
8a40834
71477dc
 
 
 
535bf0e
8a40834
0a7a24a
8f0e7db
bb46339
 
 
 
 
d686f61
8f0e7db
 
 
 
f5ec8c7
8f0e7db
 
 
f5ec8c7
8f0e7db
f5ec8c7
8f0e7db
 
f5ec8c7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: mit
library_name: transformers
pipeline_tag: text-generation
---

# Seed-Coder-8B-Base

<div align="left" style="line-height: 1;">
  <a href="https://bytedance-seed-coder.github.io/" target="_blank" style="margin: 2px;">
    <img alt="Homepage" src="https://img.shields.io/badge/Seed--Coder-Homepage-a468fe?color=a468fe&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>

  <a href="https://github.com/ByteDance-Seed/Seed-Coder/blob/master/Seed-Coder.pdf" target="_blank" style="margin: 2px;">
    <img alt="Technical Report" src="https://img.shields.io/badge/(upcoming)-Technical%20Report-brightgreen?logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  
  <a href="https://huggingface.co/ByteDance-Seed" target="_blank" style="margin: 2px;">
      <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ByteDance%20Seed-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  
  <a href="https://github.com/ByteDance-Seed/Seed-Coder/blob/master/LICENSE" style="margin: 2px;">
      <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?color=f5de53&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>


## Introduction
We are thrilled to introduce Seed-Coder, a powerful, transparent, and parameter-efficient family of open-source code models at the 8B scale, featuring base, instruct, and reasoning variants. Seed-Coder contributes to promote the evolution of open code models through the following highlights.

- **Model-centric:** Seed-Coder predominantly leverages LLMs instead of hand-crafted rules for code data filtering, minimizing manual effort in pretraining data construction.
- **Transparent:** We openly share detailed insights into our model-centric data pipeline, including methods for curating GitHub data, commits data, and code-related web data.
- **Powerful:** Seed-Coder achieves state-of-the-art performance among open-source models of comparable size across a diverse range of coding tasks.

<p align="center">
  <img width="100%" src="imgs/seed-coder_intro_performance.jpg">
</p>

This repo contains the **Seed-Coder-8B-Base** model, with the following features:
- Type: Causal language models
- Training Stage: Pretraining
- Data Source: GitHub data, code-related web data
- Training Tokens: 6 trillion
- Supports: Code completion, code infilling (Fill-in-the-Middle)
- Context Length: 32,768


## Model Downloads
| Model Name                  | Length | Download   |    Notes |
|---------------------------------------------------------|--------|------------------------------------|-----------------------|
| 👉 **Seed-Coder-8B-Base**           | 32K    | 🤗 [Model](https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)   |  Pretrained on our model-centric code data.  |
| Seed-Coder-8B-Instruct             | 32K    | 🤗 [Model](https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct)   |  Instruction-tuned for alignment with user intent. |
| Seed-Coder-8B-Reasoning            | 32K    | 🤗 [Model](https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Reasoning)   |  RL trained to boost reasoning capabilities.  |


## Requirements
You will need to install the latest versions of `transformers` and `accelerate`:

```bash
pip install -U transformers accelerate
```

## Quickstart

Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face `pipeline` API:

```python
import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

output = pipeline("def say_hello_world():", max_new_tokens=100)
print(output[0]["generated_text"])
```

### Fill-in-the-Middle (FIM) Example

Seed-Coder-8B-Base natively supports **Fill-in-the-Middle (FIM)** tasks, where the model is given a prefix and a suffix and asked to predict the missing middle content. This allows for code infilling scenarios such as completing a function body or inserting missing logic between two pieces of code.

A typical example:

```python
import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# You can concatenate a prefix, a special FIM separator token, and a suffix
prefix = "def add_numbers(a, b):\n    "
suffix = "\n    return result"

# Combine prefix and suffix following the FIM format
fim_input = '<[fim-suffix]>' + suffix + '<[fim-prefix]>' + prefix + '<[fim-middle]>'

output = pipeline(fim_input, max_new_tokens=512)
print(output[0]["generated_text"])
```

## Evaluation

Seed-Coder-8B-Base has been evaluated on code generation, code completion, and code reasoning benchmarks, achieving state-of-the-art performance among ~8B open-source models.

|            | DeepSeek-Coder-6.7B-Base | OpenCoder-8B-Base | Qwen2.5-Coder-7B | Seed-Coder-8B-Base |
|------------|:------------------------:|:-----------------:|:----------------:|:------------------:|
| HumanEval  |           47.6           |        66.5       |       72.0       |        77.4        |
| MBPP       |           70.2           |        79.9       |       79.4       |        82.0        |
| MultiPL-E  |           44.7           |        61.0       |       58.8       |        67.6        |
| cruxeval-O |           41.0           |        43.9       |       56.0       |        48.4        |

For detailed benchmark performance, please refer to our [📑 Technical Report](https://github.com/ByteDance-Seed/Seed-Coder/blob/master/Seed-Coder.pdf).


## License

This project is licensed under the MIT License. See the [LICENSE file](https://github.com/ByteDance-Seed/Seed-Coder/blob/master/LICENSE) for details.

<!-- ## Citation

If you find Seed-Coder helpful, please consider citing our work:

```
@article{bytedance2025seedcoder,
    title={Seed-Coder: Let the Code Model Curate Data for Itself},
    author={Xxx},
    year={2025},
    eprint={xxxx.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/xxxx.xxxxx}, 
}
```  -->