Sapiens2-1B

Sapiens2 is a family of high-resolution vision transformers pretrained on 1 billion human images — designed for human-centric tasks such as pose estimation, body-part segmentation, surface normals, and pointmaps.

This repository contains the 1B parameter pretrained backbone. It produces dense per-patch features suitable for fine-tuning downstream task heads.

Model Details

  • Developed by: Meta
  • Model type: Vision Transformer
  • License: Sapiens2 License
  • Task: pretrain
  • Format: safetensors
  • File: sapiens2_1b_pretrain.safetensors

Quick Start

Install the Sapiens2 repo (pip install -e .).

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from sapiens.backbones.standalone.sapiens2 import Sapiens2

# Build the model and load the pretrained checkpoint
model = Sapiens2(arch="sapiens2_1b", img_size=(1024, 768), patch_size=16).eval().cuda()  # img_size is (H, W)
ckpt_path = hf_hub_download(repo_id="facebook/sapiens2-pretrain-1b", filename="sapiens2_1b_pretrain.safetensors")
model.load_state_dict(load_file(ckpt_path))

# Forward pass on a single image (RGB; ImageNet normalization recommended)
x = torch.randn(1, 3, 1024, 768).cuda()
with torch.no_grad():
    features = model(x)[0]  # dense backbone features: (B, num_tokens, embed_dim)

Model Card

Field Value
Architecture Sapiens2 ViT (RoPE, GQA, SwiGLU, RMSNorm, QK-norm)
Parameters 1.462 B
FLOPs 4.715 T
Embedding dim 1536
Layers 40
Attention heads 24
Pretraining resolution 1024 × 768 (H × W)
Patch size 16
Pretraining data 1B human images

Sapiens2 Family

Model Params FLOPs Embed dim Layers Heads
Sapiens2-0.1B 0.114 B 0.342 T 768 12 12
Sapiens2-0.4B 0.398 B 1.260 T 1024 24 16
Sapiens2-0.8B 0.818 B 2.592 T 1280 32 16
Sapiens2-1B (this) 1.462 B 4.715 T 1536 40 24
Sapiens2-1B-4K 1.607 B 1536 40 24
Sapiens2-5B 5.071 B 15.722 T 2432 56 32

See the Sapiens2 Collection for all variants and downstream task checkpoints (pose, segmentation, normals, pointmaps).

Intended Use

  • Feature extraction for human-centric downstream tasks
  • Initialization for fine-tuning task heads (pose, segmentation, normals, pointmap)
  • Research on human-centric vision

License

Released under the Sapiens2 License.

Citation

@article{khirodkarsapiens2,
  title={Sapiens2},
  author={Khirodkar, Rawal and Wen, He and Martinez, Julieta and Dong, Yuan and Su, Zhaoen and Saito, Shunsuke},
  journal={arXiv preprint arXiv:2604.21681},
  year={2026}
}
Downloads last month
444
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for facebook/sapiens2-pretrain-1b

Finetunes
5 models
Quantizations
1 model

Collection including facebook/sapiens2-pretrain-1b

Paper for facebook/sapiens2-pretrain-1b

Free AI Image Generator No sign-up. Instant results. Open Now