sapiens

Maintained By
facebook

Sapiens Vision Model

PropertyValue
DeveloperMeta
Model TypeVision Transformers
LicenseCreative Commons Attribution-NonCommercial 4.0
PaperResearch Paper

What is Sapiens?

Sapiens is Meta's cutting-edge family of vision models designed specifically for human-centric analysis tasks. Trained on over 300 million in-the-wild human images, these models excel at four fundamental vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. The model family ranges from 0.3 to 2 billion parameters, demonstrating impressive scalability and performance improvements as model size increases.

Implementation Details

The model architecture leverages Vision Transformers and comes in multiple variants optimized for different tasks. Available in three formats: original (for fine-tuning), TorchScript (inference-only), and BFloat16 (for large-scale processing on A100 GPUs), Sapiens supports high-resolution 1K inference natively.

  • Pre-trained versions: 0.3B, 0.6B, 1B, and 2B parameter models
  • Task-specific variants for pose, segmentation, depth, and normal estimation
  • Easy adaptation through fine-tuning on specific tasks
  • Native support for high-resolution processing

Core Capabilities

  • 2D Pose Estimation (17, 133, and 308 keypoints)
  • Body-part Segmentation (28 classes)
  • Depth Estimation
  • Surface Normal Prediction
  • Helper models for bounding box detection and background removal

Frequently Asked Questions

Q: What makes this model unique?

Sapiens stands out for its remarkable generalization capabilities on in-the-wild data, even with limited or synthetic training data. Its scalable architecture shows consistent performance improvements across all tasks as the model size increases, and it maintains high accuracy even at 1K resolution.

Q: What are the recommended use cases?

The model is ideal for applications requiring detailed human analysis, such as motion capture, virtual try-on systems, 3D avatar creation, augmented reality applications, and computer vision research. It's particularly valuable when high-resolution processing and accurate human body analysis are required.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.