sam-vit-large

Maintained By
facebook

SAM-ViT-Large

PropertyValue
Parameter Count312M
LicenseApache 2.0
AuthorFacebook
PaperView Paper

What is sam-vit-large?

SAM-ViT-Large is a state-of-the-art computer vision model developed by Facebook, part of the Segment Anything Model (SAM) family. Trained on an impressive dataset of 11 million images and 1.1 billion masks, it excels at producing high-quality object masks from various input prompts like points or boxes.

Implementation Details

The model architecture consists of four main components: a ViT-based Vision Encoder for computing image embeddings, a Prompt Encoder for generating embeddings from points and boxes, a Mask Decoder utilizing a two-way transformer architecture, and a Neck module for final mask prediction.

  • Vision Encoder: Processes image patches using attention mechanisms with Relative Positional Embedding
  • Prompt Encoder: Handles point and bounding box inputs
  • Mask Decoder: Performs cross-attention between image and point embeddings
  • Tensor Type: F32

Core Capabilities

  • Zero-shot performance on various segmentation tasks
  • Automatic mask generation with grid-based prompting
  • Point and bounding box-based segmentation
  • Efficient processing of large-scale images
  • Support for both prompted and automatic mask generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to perform zero-shot segmentation and its training on an unprecedented scale of 1.1 billion masks makes it stand out. It can generate high-quality masks from simple prompts without additional training.

Q: What are the recommended use cases?

The model excels in various computer vision tasks including object segmentation, image analysis, and automated mask generation. It's particularly useful for applications requiring precise object isolation in images.

The first platform built for prompt engineering