Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ tags:
|
|
| 5 |
- vision
|
| 6 |
---
|
| 7 |
|
| 8 |
-
# Vision Transformer (base-sized model
|
| 9 |
|
| 10 |
Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
|
| 11 |
|
|
@@ -15,7 +15,7 @@ Disclaimer: The team releasing DINOv2 did not write a model card for this model
|
|
| 15 |
|
| 16 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
|
| 17 |
|
| 18 |
-
Images are presented to the model as a sequence of fixed-size patches
|
| 19 |
|
| 20 |
Note that this model does not include any fine-tuned heads.
|
| 21 |
|
|
|
|
| 5 |
- vision
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# Vision Transformer (base-sized model) trained using DINOv2
|
| 9 |
|
| 10 |
Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
|
| 11 |
|
|
|
|
| 15 |
|
| 16 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
|
| 17 |
|
| 18 |
+
Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
|
| 19 |
|
| 20 |
Note that this model does not include any fine-tuned heads.
|
| 21 |
|