RobustSAM-ViT-Large

Property	Value
Parameter Count	312M
License	MIT
Paper	arXiv:2406.09627
Tensor Type	F32

What is robustsam-vit-large?

RobustSAM-ViT-Large is an enhanced version of the Segment Anything Model (SAM) specifically designed to handle degraded images while maintaining SAM's core capabilities. This CVPR 2024 Highlight model introduces marginal parameter increments to achieve robust segmentation performance on low-quality images, including blurred, hazy, or poorly lit content.

Implementation Details

The model architecture consists of four main components: a VIT-based Vision Encoder for image embedding generation, a Prompt Encoder for point and bounding box embedding creation, a Mask Decoder utilizing two-way transformers for cross-attention processing, and a Neck module for final mask prediction. The implementation can be optimized in just 30 hours using eight GPUs, making it accessible for typical research environments.

Preserves SAM's promptability and zero-shot generalization capabilities
Introduces the Robust-Seg dataset with 688K image-mask pairs
Supports both prompted and automatic mask generation
Optimized for various degradation types including blur, haze, low-light, and rain

Core Capabilities

Zero-shot segmentation on degraded images
Flexible prompting system supporting points and bounding boxes
Automatic mask generation with customizable point batching
Enhanced performance in downstream tasks like dehazing and deblurring

Frequently Asked Questions

Q: What makes this model unique?

RobustSAM's ability to maintain high-quality segmentation performance on degraded images while requiring minimal additional parameters sets it apart from traditional SAM implementations. It bridges a critical gap in real-world applications where image quality cannot be guaranteed.

Q: What are the recommended use cases?

The model is ideal for scenarios involving degraded image segmentation, such as outdoor surveillance, autonomous driving in adverse conditions, and processing user-generated content with varying quality levels. It's particularly valuable in applications requiring robust performance across different image conditions.