sam-vit-huge

Maintained By
facebook

SAM-ViT-Huge

PropertyValue
Parameter Count641M parameters
LicenseApache 2.0
Tensor TypeF32
AuthorFacebook

What is sam-vit-huge?

SAM-ViT-Huge is Facebook's flagship implementation of the Segment Anything Model, utilizing a huge Vision Transformer architecture for advanced image segmentation tasks. Trained on an impressive dataset of 11 million images and 1.1 billion masks, this model excels at generating high-quality object masks from various input prompts such as points or boxes.

Implementation Details

The model architecture consists of three primary components: a ViT-based Vision Encoder for computing image embeddings, a Prompt Encoder for handling point and box inputs, and a Mask Decoder using a two-way transformer architecture. The model uses relative positional embedding and supports both prompted and automatic mask generation.

  • Zero-shot capability for various segmentation tasks
  • Supports point-based and box-based prompting
  • Efficient processing with batch support
  • Automatic mask generation with 1024-point grid system

Core Capabilities

  • High-quality object mask generation
  • Flexible prompt handling (points, boxes)
  • Zero-shot transfer to new image distributions
  • Automatic mask generation for entire images
  • Real-time segmentation processing

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its massive scale (641M parameters), extensive training dataset (11M images), and ability to perform zero-shot segmentation with competitive or superior results compared to supervised approaches.

Q: What are the recommended use cases?

The model is ideal for computer vision tasks requiring precise object segmentation, including automated image analysis, content editing, and research applications. It's particularly useful when dealing with unknown object types or when flexible prompting is needed.

The first platform built for prompt engineering