sapromak's picture
Update README.md
0a19bad verified
---
license: other
license_name: inf
license_link: https://huggingface.co/infly/OpenCoder-1.5B-Base/blob/main/LICENSE
language:
- en
- zh
base_model: infly/OpenCoder-1.5B-Base
pipeline_tag: text-generation
library_name: transformers
tags:
- code
---
<h1 align="center">
<br>
OpenCoder-1.5B-Base-16K-via-4K
<br>
</h1>
<p align="center">
<a href="https://github.com/sapromak/adaptive-code-completion">Home Page</a> •
<a href="https://huggingface.co/collections/sapromak/repository-level-pre-trained-opencoder-684206bfc99d48a7e94c0789">Collection</a> •
<a href="https://openreview.net/forum?id=t9RN9WX4Ic">Paper</a> •
<a href="https://github.com/sapromak/adaptive-code-completion/blob/main/thesis.pdf">Thesis</a>
</p>
## Description
This model is derived from [OpenCoder-1.5B-Base](https://huggingface.co/infly/OpenCoder-1.5B-Base) by applying an additional context extension fine-tuning with an Adjustment of the Base Frequency parameter of RoPE from 10,000 to 500,000. The number of optimization steps is 512 with a batch size of 128 on sequences up to 4,096 in length. The repository context is not used to obtain this checkpoint. More details on the training procedure and other aspects, including all code used, can be found on the [Home Page](https://github.com/sapromak/adaptive-code-completion) of the project. Note that this model is created with the intent to answer specific research questions and __not__ to gain the maximum possible performance on the repository-level code completion setup. Consider it more as a baseline.
The associated research was initialized and conducted by the [JetBrains Research](https://huggingface.co/JetBrains-Research) association.
<div align="center">
<img src="https://github.com/sapromak/adaptive-code-completion/blob/main/paper/figures/compilation/beyond-training-window/beyond-training-window-inproject.svg?raw=true" width="100%" alt="Performance" />
<p>Exact Match on the <em>inproject</em> lines of the <em>large-context</em> subset of the <a href="https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion">Project-Level Code Completion task</a> from the <a href="https://arxiv.org/abs/2406.11612">Long Code Arena benchmark</a>. This checkpoint (dashed orange curve) demonstrates its best performance at a context length of 16,384. "1K" refers to 1,024 tokens. The star markers denote the context length used during the repository-level pre-training stage.
</div>
## Quickstart
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sapromak/OpenCoder-1.5B-Base-16K-via-4K"
tokenizer_name = "infly/OpenCoder-1.5B-Base"
model = AutoModelForCausalLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer("# write a quick sort algorithm", return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```