SAM-ViT-Large
Property | Value |
---|---|
Parameter Count | 312M |
License | Apache 2.0 |
Author | |
Paper | View Paper |
What is sam-vit-large?
SAM-ViT-Large is a state-of-the-art computer vision model developed by Facebook, part of the Segment Anything Model (SAM) family. Trained on an impressive dataset of 11 million images and 1.1 billion masks, it excels at producing high-quality object masks from various input prompts like points or boxes.
Implementation Details
The model architecture consists of four main components: a ViT-based Vision Encoder for computing image embeddings, a Prompt Encoder for generating embeddings from points and boxes, a Mask Decoder utilizing a two-way transformer architecture, and a Neck module for final mask prediction.
- Vision Encoder: Processes image patches using attention mechanisms with Relative Positional Embedding
- Prompt Encoder: Handles point and bounding box inputs
- Mask Decoder: Performs cross-attention between image and point embeddings
- Tensor Type: F32
Core Capabilities
- Zero-shot performance on various segmentation tasks
- Automatic mask generation with grid-based prompting
- Point and bounding box-based segmentation
- Efficient processing of large-scale images
- Support for both prompted and automatic mask generation
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to perform zero-shot segmentation and its training on an unprecedented scale of 1.1 billion masks makes it stand out. It can generate high-quality masks from simple prompts without additional training.
Q: What are the recommended use cases?
The model excels in various computer vision tasks including object segmentation, image analysis, and automated mask generation. It's particularly useful for applications requiring precise object isolation in images.