SAM-ViT-Large

Property	Value
Parameter Count	312M
License	Apache 2.0
Author	Facebook
Paper	View Paper

What is sam-vit-large?

SAM-ViT-Large is a state-of-the-art computer vision model developed by Facebook, part of the Segment Anything Model (SAM) family. Trained on an impressive dataset of 11 million images and 1.1 billion masks, it excels at producing high-quality object masks from various input prompts like points or boxes.

Implementation Details

The model architecture consists of four main components: a ViT-based Vision Encoder for computing image embeddings, a Prompt Encoder for generating embeddings from points and boxes, a Mask Decoder utilizing a two-way transformer architecture, and a Neck module for final mask prediction.

Vision Encoder: Processes image patches using attention mechanisms with Relative Positional Embedding
Prompt Encoder: Handles point and bounding box inputs
Mask Decoder: Performs cross-attention between image and point embeddings
Tensor Type: F32

Core Capabilities

Zero-shot performance on various segmentation tasks
Automatic mask generation with grid-based prompting
Point and bounding box-based segmentation
Efficient processing of large-scale images
Support for both prompted and automatic mask generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to perform zero-shot segmentation and its training on an unprecedented scale of 1.1 billion masks makes it stand out. It can generate high-quality masks from simple prompts without additional training.

Q: What are the recommended use cases?

The model excels in various computer vision tasks including object segmentation, image analysis, and automated mask generation. It's particularly useful for applications requiring precise object isolation in images.

sam-vit-large