File size: 12,587 Bytes
29d411b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# easy_ViTPose

<p align="center">

<img src="https://user-images.githubusercontent.com/24314647/236082274-b25a70c8-9267-4375-97b0-eddf60a7dfc6.png" width=375> easy_ViTPose
</p>

## Accurate 2d human and animal pose estimation

<a target="_blank" href="https://colab.research.google.com/github/JunkyByte/easy_ViTPose/blob/main/colab_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Easy to use SOTA `ViTPose` [Y. Xu et al., 2022] models for fast inference.  
We provide all the VitPose original models, converted for inference, with single dataset format output.

In addition to that we also provide a Coco-25 model, trained on the original coco dataset + feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/
Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of `train.py`

> [!WARNING]
> Ultralytics `yolov8` has issue with wrong bounding boxes when using `mps`, upgrade to latest version! (Works correctly on 8.2.48)

## Results
![resimg](https://github.com/JunkyByte/easy_ViTPose/assets/24314647/51c0777f-b268-448a-af02-9a3537f288d8)

https://github.com/JunkyByte/easy_ViTPose/assets/24314647/e9a82c17-6e99-4111-8cc8-5257910cb87e



https://github.com/JunkyByte/easy_ViTPose/assets/24314647/63af44b1-7245-4703-8906-3f034a43f9e3

(Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw )  
(Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk )

## Features
- Image / Video / Webcam support
- Video support using SORT algorithm to track bboxes between frames
- Torch / ONNX / Tensorrt inference
- Runs the original VitPose checkpoints from [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose)
- 4 ViTPose architectures with different sizes and performances (s: small, b: base, l: large, h: huge)
- Multi skeleton and dataset: (AIC / MPII / COCO / COCO + FEET / COCO WHOLEBODY / APT36k / AP10k)
- Human / Animal pose estimation
- cpu / gpu / metal support
- show and save images / videos and output to json

We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in,
if you do please open an issue, we might want to integrate other models for detection.

### Benchmark:
You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!).  

### Skeleton reference
There are multiple skeletons for different dataset. Check the definition here [visualization.py](https://github.com/JunkyByte/easy_ViTPose/blob/main/easy_ViTPose/vit_utils/visualization.py).

## Installation and Usage
> [!IMPORTANT]
> Install `torch>2.0 with cuda / mps support` by yourself.
> also check `requirements_gpu.txt`.



```bash

git clone [email protected]:JunkyByte/easy_ViTPose.git
cd easy_ViTPose/

pip install -e .

pip install -r requirements.txt

```



### Download models

- Download the models from [Huggingface](https://huggingface.co/JunkyByte/easy_ViTPose)

We provide torch models for every dataset and architecture.  

If you want to run onnx / tensorrt inference download the appropriate torch ckpt and use `export.py` to convert it.  

You can use `ultralytics` `yolo export` command to export yolo to onnx and tensorrt as well.



#### Export to onnx and tensorrt

```bash

$ python export.py --help

usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET]

optional arguments:
  -h, --help            show this help message and exit
  --model-ckpt MODEL_CKPT

                        The torch model that shall be used for conversion

  --model-name {s,b,l,h}

                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]

  --output OUTPUT       File (without extension) or dir path for checkpoint output

  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]

```


### Run inference
To run inference from command line you can use the `inference.py` script as follows:  
```bash

$ python inference.py --help

usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET]

                    [--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE]

                    [--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP]

                    [--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json]



optional arguments:

  -h, --help            show this help message and exit

  --input INPUT         path to image / video or webcam ID (=cv2)

  --output-path OUTPUT_PATH

                        output path, if the path provided is a directory output files are "input_name

                        +_result{extension}".

  --model MODEL         checkpoint path of the model

  --yolo YOLO           checkpoint path of the yolo model

  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",

                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]

  --det-class DET_CLASS

                        ["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",

                        "animals"]

  --model-name {s,b,l,h}

                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]

  --yolo-size YOLO_SIZE

                        YOLOv8 image size during inference

  --conf-threshold CONF_THRESHOLD

                        Minimum confidence for keypoints to be drawn. [0, 1] range

  --rotate {0,90,180,270}

                        Rotate the image of [90, 180, 270] degress counterclockwise

  --yolo-step YOLO_STEP

                        The tracker can be used to predict the bboxes instead of yolo for performance, this flag

                        specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any

                        effect when is_video is False

  --single-pose         Do not use SORT tracker because single pose is expected in the video

  --show                preview result during inference

  --show-yolo           draw yolo results

  --show-raw-yolo       draw yolo result before that SORT is applied for tracking (only valid during video inference)

  --save-img            save image results

  --save-json           save json results

```

You can run inference from code as follows:
```python

import cv2

from easy_ViTPose import VitInference



# Image to run inference RGB format

img = cv2.imread('./examples/img1.jpg')

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)



# set is_video=True to enable tracking in video inference

# be sure to use VitInference.reset() function to reset the tracker after each video

# There are a few flags that allows to customize VitInference, be sure to check the class definition

model_path = './ckpts/vitpose-s-coco_25.pth'

yolo_path = './yolov8s.pth'



# If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo

# If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda')

# dataset and det_class parameters can be inferred from the ckpt name, but you can specify them.

model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)



# Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score))

# If is_video=True the IDs will be consistent among the ordered video frames.

keypoints = model.inference(img)



# call model.reset() after each video



img = model.draw(show_yolo=True)  # Returns RGB image with drawings

cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0)

```
> [!NOTE]
> If the input file is a video [SORT](https://github.com/abewley/sort) is used to track people IDs and output consistent identifications.

### OUTPUT json format
The output format of the json files:

```

{

    "keypoints":

    [  # The list of frames, len(json['keypoints']) == len(video)

        {  # For each frame a dict

            "0": [  #  keys are id to track people and value the keypoints

                [121.19, 458.15, 0.99], # Each keypoint is (y, x, score)

                [110.02, 469.43, 0.98],

                [110.86, 445.04, 0.99],

            ],

            "1": [

                ...

            ],

        },

        {

            "0": [

                [122.19, 458.15, 0.91],

                [105.02, 469.43, 0.95],

                [122.86, 445.04, 0.99],

            ],

            "1": [

                ...

            ]

        }

    ],

    "skeleton":

    {  # Skeleton reference, key the idx, value the name

        "0": "nose",

        "1": "left_eye",

        "2": "right_eye",

        "3": "left_ear",

        "4": "right_ear",

        "5": "neck",

        ...

    }

}

```

## Finetuning
Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue.  
You can check `train.py`, `datasets/COCO.py` and `config.yaml` for details.

---

## Evaluation on COCO dataset
1. Download COCO dataset images and labels
    - 2017 Val images [5K/1GB]: http://images.cocodataset.org/zips/val2017.zip <br>
        The extracted directory looks like this:

        ```

        val2017/              

        ├── 000000000139.jpg

        ├── 000000000285.jpg

        ├── 000000000632.jpg

        └── ...

        ```  

    - 2017 Train/Val annotations [241MB]: http://images.cocodataset.org/annotations/annotations_trainval2017.zip <br>

        The extracted directory looks like this:

        ```

        annotations/              

        ├── person_keypoints_val2017.json

        ├── person_keypoints_train2017.json

        └── ...

        ```  


2. Run the following command:

    ```bash


    $ python evaluation_on_coco.py


    Command line arguments:

        --model_path: Path to the pretrained ViT Pose model

        

        --yolo_path: Path to the YOLOv8 model


        --img_folder_path: Path to the directory containing COCO val images (/val2017 extracted in step 1). 


        --annFile: Path to json file for COCO keypoints for val set (annotations/person_keypoints_val2017.json extracted in step 1)

    ```


---


## Docker
The system may be built in a container using Docker. This is intended to demonstrate container-wise inference, adapt it to your own needs by changing models and skeletons:

`docker build . -t easy_vitpose`

The image is based on NVIDIA's PyTorch image, which is 20GB large. 
If you have a compatible GPU set up with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), 
ViTPose will run with hardware acceleration.

To test an example, create a folder called `cats` with a picture of a cat as `image.jpg`. 
Run `./models/download.sh` to fetch the large yolov8 and ap10k ViTPose models. Then run inference using the following command (replace with the correct `cats` and `models` paths):

`docker run --gpus all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ./models:/models -v ~/cats:/cats easy_vitpose python inference.py --det-class cat --input /cats/image.jpg --output-path /cats --save-img --model /models/vitpose-l-ap10k.onnx --yolo /models/yolov8l.pt`

The result image may be viewed in your `cats` folder.

## TODO:
- refactor finetuning (currently not available)
- benchmark and check bottlenecks of inference pipeline
- parallel batched inference
- other minor fixes
- yolo version for animal pose, check https://github.com/JunkyByte/easy_ViTPose/pull/18

- solve cuda exceptions on script exit when using tensorrt (no idea how)

- add infos about inferred informations during inference, better output of inference status (device etc)

- check if is possible to make colab work without runtime restart



Feel free to open issues, pull requests and contribute on these TODOs.



## Reference

Thanks to the VitPose authors and their official implementation [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose).  

The SORT code is taken from [abewley/sort](https://github.com/abewley/sort)