File size: 2,241 Bytes
281cd3a
 
 
 
 
 
 
 
4b854d4
 
 
 
e000132
4b854d4
 
 
 
 
 
5efcae5
4b854d4
 
 
 
 
aeac705
4b854d4
 
aeac705
 
 
4b854d4
 
 
 
 
 
d6fde0c
 
 
 
 
 
 
 
 
 
 
 
 
281cd3a
 
 
 
4b854d4
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: "cc-by-nc-4.0"
tags:
- code
- python
- javascript
---

# InCoder 1B

A 1B parameter decoder-only Transformer model trained on code using a causal-masked objective, which allows inserting/infilling code as well as standard left-to-right generation.

The model was trained on public open-source repositories with a permissive, non-copyleft, license (Apache 2.0, MIT, BSD-2 or BSD-3) from GitHub and GitLab, as well as StackOverflow. Repositories primarily contained Python and JavaScript, but also include code from 28 languages, as well as StackOverflow. 

For more information, see our:

- [Demo](https://huggingface.co/spaces/facebook/incoder-demo)
- [Project site](https://sites.google.com/view/incoder-code-models)
- [Examples](https://sites.google.com/view/incoder-code-models/home/examples)
- [Paper](https://arxiv.org/abs/2204.05999)

A larger, 6B, parameter model is also available at [facebook/incoder-6B](https://huggingface.co/facebook/incoder-6B).

## Requirements

`pytorch`, `tokenizers`, and `transformers`. Our model requires HF's tokenizers >= 0.12.1, due to changes in the pretokenizer.

```
pip install torch
pip install "tokenizers>=0.12.1"
pip install transformers
```

## Usage

See [https://github.com/dpfried/incoder](https://github.com/dpfried/incoder) for example code.

### Model
Load with 
`model = AutoModelForCausalLM.from_pretrained("facebook/incoder-1B")`

### Tokenizer
`tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-1B")`.

Note: the incoder-1B and incoder-6B tokenizers are identical, so 'facebook/incoder-6B' could also be used.

When calling `tokenizer.decode`, it's important to pass `clean_up_tokenization_spaces=False` to avoid removing spaces after punctuation:

`tokenizer.decode(tokenizer.encode("from ."), clean_up_tokenization_spaces=False)`

## License

CC-BY-NC 4.0

## Credits

The model was developed by Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer and Mike Lewis.

Thanks to Lucile Saulnier, Leandro von Werra, Nicolas Patry, Suraj Patil, Omar Sanseviero, and others at HuggingFace for help with the model release, and to Naman Goyal and Stephen Roller for the code our demo was based on!