File size: 4,477 Bytes
98ef490
 
 
 
fb99c6c
98ef490
d63834d
 
d3a1b2a
d63834d
22cbd10
27de13f
 
 
4e71283
27de13f
 
 
 
 
 
 
 
 
b963f7b
 
 
 
da27032
b963f7b
da27032
b963f7b
 
 
da27032
19c1733
 
 
 
 
1d960c0
19c1733
 
 
 
 
 
 
 
dc4aef9
19c1733
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8740e1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e81566
 
8740e1b
 
 
 
 
747372a
8740e1b
 
 
 
 
 
 
 
 
 
 
 
 
dc4aef9
8740e1b
 
 
 
 
19c1733
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
inference: true
widget:
- text: "public class HelloWorld {\n    public static void main(String[] args) {"
  example_title: Hello world
  group: Java
---


# NT-Java


##  Table of Contents

1. [Model Summary](##model-summary)
2. [Use](##use)
3. [Limitations](##limitations)
4. [Training](##training)
5. [License](##license)
6. [Citation](##citation)

## Model Summary

The Narrow Transformer (NT) model NT-Java-1.1B is an open-source specialized code model built on StarCoderBase, designed for code completion tasks in Java programming. The model is a decoder-only transformer with Multi-Query-Attention and learned absolute positional embeddings and was finetuned for Java subset of the training data (starcoderdata) which is ~22B tokens and with a context of 8192 tokens. 

- **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
- **Project Website:** 
- **Paper:** 
- **Point of Contact:** 
- **Languages:** Java

## Use

### Intended use

Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This model addresses the gap by focusing on the development of a small Java code model and introducing a quantized version of NT-Java-1.1B, which performs comparably to open 1.1B models on MultiPL-E Java code benchmarks, making it ideal for desktop deployment.

**Feel free to share your generations in the Community tab!**

### Generation
```Java
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "infosys/NT-Java-1.1B"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("public class HelloWorld {\n    public static void main(String[] args) {", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```

### Fill-in-the-middle
Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:

```Java
input_text = "<fim_prefix>public class HelloWorld {\n    public static void main(String[] args) {<fim_suffix>}\n}<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```

### Attribution & Other Requirements

The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected. We provide a [search index](https://huggingface.co/spaces/bigcode/starcoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.

# Limitations

The model has been trained on source code from 80+ programming languages. The predominant natural language in source code is English although other languages are also present. As such the model is capable of generating code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits. See [the paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) for an in-depth discussion of the model limitations. 

# Training

## Model

- **Architecture:** GPT-2 model with multi-query attention and Fill-in-the-Middle objective
- **•Fine-training steps:** 50k
- **Pretraining tokens:** 22 Billion
- **Precision:** bfloat16

## Hardware

- **GPUs:** 6 NVIDIA A100 80GB 
- **Training time:**  4 days

## Software

- **Orchestration:** [Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
- **BP16 if applicable:** [apex](https://github.com/NVIDIA/apex)

# License
The model is licensed under the Apache license 2.0 license agreement. You can find the full agreement [here](https://www.apache.org/licenses/LICENSE-2.0).
# Citation
```
@article{li2023starcoder,
      title={JavaCoder: may the source be with you!}, 
      author={},
      year={2023},
      eprint={2305.06161},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```