Seed-Coder-8B-Base / README.md

yuyuzhang

Update README.md

62fe4cd verified 5 months ago

preview code

raw

history blame

5.18 kB

metadata

license: apache-2.0

Seed-Coder-8B-Base

Introduction

Seed-Coder-8B-Base is an 8-billion-parameter foundation model tailored for code understanding and generation. It is designed to provide developers with a powerful, general-purpose code model capable of handling a wide range of coding tasks.
It features:

Pre-trained on a massively curated corpus, filtered using LLM-based techniques to ensure high-quality real-world code, text-code alignment data, and synthetic datasets, resulting in cleaner and more effective learning signals.
Excels at code completion and supports Fill-in-the-Middle (FIM) tasks, enabling it to predict missing code spans given partial contexts.
Robust performance across various programming languages and code reasoning scenarios, making it ideal for downstream finetuning or direct use in code generation systems.
Long-context support up to 32K tokens, enabling it to handle large codebases, multi-file projects, and extended editing tasks.

Seed-Coder-8B-Base serves as the foundation for Seed-Coder-8B-Instruct and Seed-Coder-8B-reasoning.

Model Downloads

Model Name	Type	Length	Download
👉Seed-Coder-8B-Base	base	32k	🤗 Hugging Face
Seed-Coder-8B-Instruct	instruct	32k	🤗 Hugging Face
Seed-Coder-8B-Reasoning	reasoning	32k	🤗 Hugging Face

Requirements

You will need to install the latest versions of transformers and accelerate:

pip install -U transformers accelerate

Quickstart

Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face pipeline API:

import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

output = pipeline("def say_hello_world():", max_new_tokens=100)
print(output[0]["generated_text"])

Fill-in-the-Middle (FIM) Example

Seed-Coder-8B-Base natively supports Fill-in-the-Middle (FIM) tasks, where the model is given a prefix and a suffix and asked to predict the missing middle content.
This allows for code infilling scenarios such as completing a function body or inserting missing logic between two pieces of code.

A typical usage flow:

import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# You can concatenate a prefix, a special FIM separator token, and a suffix
prefix = "def add_numbers(a, b):\n    "
suffix = "\n    return result"

# Combine prefix and suffix following the FIM format
fim_input = '<[fim-suffix]>' + suffix + '<[fim-prefix]>' + prefix + '<[fim-middle]>'

output = pipeline(fim_input, max_new_tokens=512)
print(output[0]["generated_text"])

Evaluation

Seed-Coder-8B-Base has been internally evaluated across a variety of code understanding and generation benchmarks.
It demonstrates strong capabilities in:

Fluent and contextually appropriate code completion.
Reasoning about code structure and inferring missing logic.
Generalizing across different programming languages, coding styles, and codebases.

	DeepSeek-Coder-6.7B-Base	OpenCoder-8B-Base	Qwen2.5-Coder-7B	Seed-Coder-8B-Base
HumanEval	47.6	66.5	72.0	77.4
MBPP	70.2	79.9	79.4	82.0
MultiPL-E	44.7	61.0	58.8	67.6
CruxEval-O	41.0	43.9	56.0	48.4

For detailed benchmark results, please refer to our 📑 paper.

Citation

If you find Seed-Coder helpful, please consider citing our work:

@article{zhang2025seedcoder,
    title={Seed-Coder: Let the Code Model Curate Data for Itself},
    author={Xxx},
    year={2025},
    eprint={2504.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/xxxx.xxxxx}, 
}