Seed-Coder-8B-Base / README.md
yuyuzhang's picture
Update README.md
62fe4cd verified
|
raw
history blame
5.18 kB
metadata
license: apache-2.0

Seed-Coder-8B-Base

Introduction

Seed-Coder-8B-Base is an 8-billion-parameter foundation model tailored for code understanding and generation. It is designed to provide developers with a powerful, general-purpose code model capable of handling a wide range of coding tasks.
It features:

  • Pre-trained on a massively curated corpus, filtered using LLM-based techniques to ensure high-quality real-world code, text-code alignment data, and synthetic datasets, resulting in cleaner and more effective learning signals.
  • Excels at code completion and supports Fill-in-the-Middle (FIM) tasks, enabling it to predict missing code spans given partial contexts.
  • Robust performance across various programming languages and code reasoning scenarios, making it ideal for downstream finetuning or direct use in code generation systems.
  • Long-context support up to 32K tokens, enabling it to handle large codebases, multi-file projects, and extended editing tasks.

Seed-Coder-8B-Base serves as the foundation for Seed-Coder-8B-Instruct and Seed-Coder-8B-reasoning.

Model Downloads

Model Name Type Length Download
👉Seed-Coder-8B-Base base 32k 🤗 Hugging Face
Seed-Coder-8B-Instruct instruct 32k 🤗 Hugging Face
Seed-Coder-8B-Reasoning reasoning 32k 🤗 Hugging Face

Requirements

You will need to install the latest versions of transformers and accelerate:

pip install -U transformers accelerate

Quickstart

Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face pipeline API:

import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

output = pipeline("def say_hello_world():", max_new_tokens=100)
print(output[0]["generated_text"])

Fill-in-the-Middle (FIM) Example

Seed-Coder-8B-Base natively supports Fill-in-the-Middle (FIM) tasks, where the model is given a prefix and a suffix and asked to predict the missing middle content.
This allows for code infilling scenarios such as completing a function body or inserting missing logic between two pieces of code.

A typical usage flow:

import transformers
import torch

model_id = "ByteDance-Seed/Seed-Coder-8B-Base"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# You can concatenate a prefix, a special FIM separator token, and a suffix
prefix = "def add_numbers(a, b):\n    "
suffix = "\n    return result"

# Combine prefix and suffix following the FIM format
fim_input = '<[fim-suffix]>' + suffix + '<[fim-prefix]>' + prefix + '<[fim-middle]>'

output = pipeline(fim_input, max_new_tokens=512)
print(output[0]["generated_text"])

Evaluation

Seed-Coder-8B-Base has been internally evaluated across a variety of code understanding and generation benchmarks.
It demonstrates strong capabilities in:

  • Fluent and contextually appropriate code completion.
  • Reasoning about code structure and inferring missing logic.
  • Generalizing across different programming languages, coding styles, and codebases.
DeepSeek-Coder-6.7B-Base OpenCoder-8B-Base Qwen2.5-Coder-7B Seed-Coder-8B-Base
HumanEval 47.6 66.5 72.0 77.4
MBPP 70.2 79.9 79.4 82.0
MultiPL-E 44.7 61.0 58.8 67.6
CruxEval-O 41.0 43.9 56.0 48.4

For detailed benchmark results, please refer to our 📑 paper.

Citation

If you find Seed-Coder helpful, please consider citing our work:

@article{zhang2025seedcoder,
    title={Seed-Coder: Let the Code Model Curate Data for Itself},
    author={Xxx},
    year={2025},
    eprint={2504.xxxxx},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/xxxx.xxxxx}, 
}