sapromak
/

OpenCoder-1.5B-Base-16K-via-4K

Text Generation

text-generation-inference

Model card Files Files and versions

OpenCoder-1.5B-Base-16K-via-4K / README.md

sapromak's picture

Update README.md

0a19bad verified 3 months ago

|

history blame contribute delete

3.25 kB

	---
	license: other
	license_name: inf
	license_link: https://huggingface.co/infly/OpenCoder-1.5B-Base/blob/main/LICENSE
	language:
	- en
	- zh
	base_model: infly/OpenCoder-1.5B-Base
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- code
	---
	<h1 align="center">
	<br>
	OpenCoder-1.5B-Base-16K-via-4K
	<br>
	</h1>
	<p align="center">
	<a href="https://github.com/sapromak/adaptive-code-completion">Home Page</a> •
	<a href="https://huggingface.co/collections/sapromak/repository-level-pre-trained-opencoder-684206bfc99d48a7e94c0789">Collection</a> •
	<a href="https://openreview.net/forum?id=t9RN9WX4Ic">Paper</a> •
	<a href="https://github.com/sapromak/adaptive-code-completion/blob/main/thesis.pdf">Thesis</a>
	</p>

	## Description

	This model is derived from [OpenCoder-1.5B-Base](https://huggingface.co/infly/OpenCoder-1.5B-Base) by applying an additional context extension fine-tuning with an Adjustment of the Base Frequency parameter of RoPE from 10,000 to 500,000. The number of optimization steps is 512 with a batch size of 128 on sequences up to 4,096 in length. The repository context is not used to obtain this checkpoint. More details on the training procedure and other aspects, including all code used, can be found on the [Home Page](https://github.com/sapromak/adaptive-code-completion) of the project. Note that this model is created with the intent to answer specific research questions and __not__ to gain the maximum possible performance on the repository-level code completion setup. Consider it more as a baseline.

	The associated research was initialized and conducted by the [JetBrains Research](https://huggingface.co/JetBrains-Research) association.

	<div align="center">
	<img src="https://github.com/sapromak/adaptive-code-completion/blob/main/paper/figures/compilation/beyond-training-window/beyond-training-window-inproject.svg?raw=true" width="100%" alt="Performance" />
	<p>Exact Match on the <em>inproject</em> lines of the <em>large-context</em> subset of the <a href="https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion">Project-Level Code Completion task</a> from the <a href="https://arxiv.org/abs/2406.11612">Long Code Arena benchmark</a>. This checkpoint (dashed orange curve) demonstrates its best performance at a context length of 16,384. "1K" refers to 1,024 tokens. The star markers denote the context length used during the repository-level pre-training stage.
	</div>

	## Quickstart

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "sapromak/OpenCoder-1.5B-Base-16K-via-4K"
	tokenizer_name = "infly/OpenCoder-1.5B-Base"

	model = AutoModelForCausalLM.from_pretrained(model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)

	inputs = tokenizer("# write a quick sort algorithm", return_tensors="pt")
	outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)

	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```