RobustSAM-ViT-Large
Property | Value |
---|---|
Parameter Count | 312M |
License | MIT |
Paper | arXiv:2406.09627 |
Tensor Type | F32 |
What is robustsam-vit-large?
RobustSAM-ViT-Large is an enhanced version of the Segment Anything Model (SAM) specifically designed to handle degraded images while maintaining SAM's core capabilities. This CVPR 2024 Highlight model introduces marginal parameter increments to achieve robust segmentation performance on low-quality images, including blurred, hazy, or poorly lit content.
Implementation Details
The model architecture consists of four main components: a VIT-based Vision Encoder for image embedding generation, a Prompt Encoder for point and bounding box embedding creation, a Mask Decoder utilizing two-way transformers for cross-attention processing, and a Neck module for final mask prediction. The implementation can be optimized in just 30 hours using eight GPUs, making it accessible for typical research environments.
- Preserves SAM's promptability and zero-shot generalization capabilities
- Introduces the Robust-Seg dataset with 688K image-mask pairs
- Supports both prompted and automatic mask generation
- Optimized for various degradation types including blur, haze, low-light, and rain
Core Capabilities
- Zero-shot segmentation on degraded images
- Flexible prompting system supporting points and bounding boxes
- Automatic mask generation with customizable point batching
- Enhanced performance in downstream tasks like dehazing and deblurring
Frequently Asked Questions
Q: What makes this model unique?
RobustSAM's ability to maintain high-quality segmentation performance on degraded images while requiring minimal additional parameters sets it apart from traditional SAM implementations. It bridges a critical gap in real-world applications where image quality cannot be guaranteed.
Q: What are the recommended use cases?
The model is ideal for scenarios involving degraded image segmentation, such as outdoor surveillance, autonomous driving in adverse conditions, and processing user-generated content with varying quality levels. It's particularly valuable in applications requiring robust performance across different image conditions.