File size: 12,587 Bytes
29d411b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
# easy_ViTPose
<p align="center">
<img src="https://user-images.githubusercontent.com/24314647/236082274-b25a70c8-9267-4375-97b0-eddf60a7dfc6.png" width=375> easy_ViTPose
</p>
## Accurate 2d human and animal pose estimation
<a target="_blank" href="https://colab.research.google.com/github/JunkyByte/easy_ViTPose/blob/main/colab_demo.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
### Easy to use SOTA `ViTPose` [Y. Xu et al., 2022] models for fast inference.
We provide all the VitPose original models, converted for inference, with single dataset format output.
In addition to that we also provide a Coco-25 model, trained on the original coco dataset + feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/
Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of `train.py`
> [!WARNING]
> Ultralytics `yolov8` has issue with wrong bounding boxes when using `mps`, upgrade to latest version! (Works correctly on 8.2.48)
## Results

https://github.com/JunkyByte/easy_ViTPose/assets/24314647/e9a82c17-6e99-4111-8cc8-5257910cb87e
https://github.com/JunkyByte/easy_ViTPose/assets/24314647/63af44b1-7245-4703-8906-3f034a43f9e3
(Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw )
(Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk )
## Features
- Image / Video / Webcam support
- Video support using SORT algorithm to track bboxes between frames
- Torch / ONNX / Tensorrt inference
- Runs the original VitPose checkpoints from [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose)
- 4 ViTPose architectures with different sizes and performances (s: small, b: base, l: large, h: huge)
- Multi skeleton and dataset: (AIC / MPII / COCO / COCO + FEET / COCO WHOLEBODY / APT36k / AP10k)
- Human / Animal pose estimation
- cpu / gpu / metal support
- show and save images / videos and output to json
We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in,
if you do please open an issue, we might want to integrate other models for detection.
### Benchmark:
You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!).
### Skeleton reference
There are multiple skeletons for different dataset. Check the definition here [visualization.py](https://github.com/JunkyByte/easy_ViTPose/blob/main/easy_ViTPose/vit_utils/visualization.py).
## Installation and Usage
> [!IMPORTANT]
> Install `torch>2.0 with cuda / mps support` by yourself.
> also check `requirements_gpu.txt`.
```bash
git clone [email protected]:JunkyByte/easy_ViTPose.git
cd easy_ViTPose/
pip install -e .
pip install -r requirements.txt
```
### Download models
- Download the models from [Huggingface](https://huggingface.co/JunkyByte/easy_ViTPose)
We provide torch models for every dataset and architecture.
If you want to run onnx / tensorrt inference download the appropriate torch ckpt and use `export.py` to convert it.
You can use `ultralytics` `yolo export` command to export yolo to onnx and tensorrt as well.
#### Export to onnx and tensorrt
```bash
$ python export.py --help
usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET]
optional arguments:
-h, --help show this help message and exit
--model-ckpt MODEL_CKPT
The torch model that shall be used for conversion
--model-name {s,b,l,h}
[s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
--output OUTPUT File (without extension) or dir path for checkpoint output
--dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
"wholebody", "mpii", "ap10k", "apt36k", "aic"]
```
### Run inference
To run inference from command line you can use the `inference.py` script as follows:
```bash
$ python inference.py --help
usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET]
[--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE]
[--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP]
[--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json]
optional arguments:
-h, --help show this help message and exit
--input INPUT path to image / video or webcam ID (=cv2)
--output-path OUTPUT_PATH
output path, if the path provided is a directory output files are "input_name
+_result{extension}".
--model MODEL checkpoint path of the model
--yolo YOLO checkpoint path of the yolo model
--dataset DATASET Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
"wholebody", "mpii", "ap10k", "apt36k", "aic"]
--det-class DET_CLASS
["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",
"animals"]
--model-name {s,b,l,h}
[s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
--yolo-size YOLO_SIZE
YOLOv8 image size during inference
--conf-threshold CONF_THRESHOLD
Minimum confidence for keypoints to be drawn. [0, 1] range
--rotate {0,90,180,270}
Rotate the image of [90, 180, 270] degress counterclockwise
--yolo-step YOLO_STEP
The tracker can be used to predict the bboxes instead of yolo for performance, this flag
specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any
effect when is_video is False
--single-pose Do not use SORT tracker because single pose is expected in the video
--show preview result during inference
--show-yolo draw yolo results
--show-raw-yolo draw yolo result before that SORT is applied for tracking (only valid during video inference)
--save-img save image results
--save-json save json results
```
You can run inference from code as follows:
```python
import cv2
from easy_ViTPose import VitInference
# Image to run inference RGB format
img = cv2.imread('./examples/img1.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# set is_video=True to enable tracking in video inference
# be sure to use VitInference.reset() function to reset the tracker after each video
# There are a few flags that allows to customize VitInference, be sure to check the class definition
model_path = './ckpts/vitpose-s-coco_25.pth'
yolo_path = './yolov8s.pth'
# If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo
# If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda')
# dataset and det_class parameters can be inferred from the ckpt name, but you can specify them.
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)
# Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score))
# If is_video=True the IDs will be consistent among the ordered video frames.
keypoints = model.inference(img)
# call model.reset() after each video
img = model.draw(show_yolo=True) # Returns RGB image with drawings
cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0)
```
> [!NOTE]
> If the input file is a video [SORT](https://github.com/abewley/sort) is used to track people IDs and output consistent identifications.
### OUTPUT json format
The output format of the json files:
```
{
"keypoints":
[ # The list of frames, len(json['keypoints']) == len(video)
{ # For each frame a dict
"0": [ # keys are id to track people and value the keypoints
[121.19, 458.15, 0.99], # Each keypoint is (y, x, score)
[110.02, 469.43, 0.98],
[110.86, 445.04, 0.99],
],
"1": [
...
],
},
{
"0": [
[122.19, 458.15, 0.91],
[105.02, 469.43, 0.95],
[122.86, 445.04, 0.99],
],
"1": [
...
]
}
],
"skeleton":
{ # Skeleton reference, key the idx, value the name
"0": "nose",
"1": "left_eye",
"2": "right_eye",
"3": "left_ear",
"4": "right_ear",
"5": "neck",
...
}
}
```
## Finetuning
Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue.
You can check `train.py`, `datasets/COCO.py` and `config.yaml` for details.
---
## Evaluation on COCO dataset
1. Download COCO dataset images and labels
- 2017 Val images [5K/1GB]: http://images.cocodataset.org/zips/val2017.zip <br>
The extracted directory looks like this:
```
val2017/
├── 000000000139.jpg
├── 000000000285.jpg
├── 000000000632.jpg
└── ...
```
- 2017 Train/Val annotations [241MB]: http://images.cocodataset.org/annotations/annotations_trainval2017.zip <br>
The extracted directory looks like this:
```
annotations/
├── person_keypoints_val2017.json
├── person_keypoints_train2017.json
└── ...
```
2. Run the following command:
```bash
$ python evaluation_on_coco.py
Command line arguments:
--model_path: Path to the pretrained ViT Pose model
--yolo_path: Path to the YOLOv8 model
--img_folder_path: Path to the directory containing COCO val images (/val2017 extracted in step 1).
--annFile: Path to json file for COCO keypoints for val set (annotations/person_keypoints_val2017.json extracted in step 1)
```
---
## Docker
The system may be built in a container using Docker. This is intended to demonstrate container-wise inference, adapt it to your own needs by changing models and skeletons:
`docker build . -t easy_vitpose`
The image is based on NVIDIA's PyTorch image, which is 20GB large.
If you have a compatible GPU set up with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
ViTPose will run with hardware acceleration.
To test an example, create a folder called `cats` with a picture of a cat as `image.jpg`.
Run `./models/download.sh` to fetch the large yolov8 and ap10k ViTPose models. Then run inference using the following command (replace with the correct `cats` and `models` paths):
`docker run --gpus all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ./models:/models -v ~/cats:/cats easy_vitpose python inference.py --det-class cat --input /cats/image.jpg --output-path /cats --save-img --model /models/vitpose-l-ap10k.onnx --yolo /models/yolov8l.pt`
The result image may be viewed in your `cats` folder.
## TODO:
- refactor finetuning (currently not available)
- benchmark and check bottlenecks of inference pipeline
- parallel batched inference
- other minor fixes
- yolo version for animal pose, check https://github.com/JunkyByte/easy_ViTPose/pull/18
- solve cuda exceptions on script exit when using tensorrt (no idea how)
- add infos about inferred informations during inference, better output of inference status (device etc)
- check if is possible to make colab work without runtime restart
Feel free to open issues, pull requests and contribute on these TODOs.
## Reference
Thanks to the VitPose authors and their official implementation [ViTAE-Transformer/ViTPose](https://github.com/ViTAE-Transformer/ViTPose).
The SORT code is taken from [abewley/sort](https://github.com/abewley/sort)
|