owlvit-large-patch14

Maintained By
google

OWL-ViT Large Patch14

PropertyValue
LicenseApache 2.0
Release DateMay 2022
PaperSimple Open-Vocabulary Object Detection with Vision Transformers
AuthorGoogle

What is owlvit-large-patch14?

OWL-ViT (Vision Transformer for Open-World Localization) is an advanced zero-shot text-conditioned object detection model that combines CLIP's multimodal capabilities with object detection functionality. It features a ViT-L/14 architecture and can process multiple text queries to locate objects in images without requiring prior training on specific object classes.

Implementation Details

The model architecture consists of two main components: a CLIP backbone featuring a ViT-L/14 Transformer for image encoding and a masked self-attention Transformer for text encoding. The model removes CLIP's final token pooling layer and adds lightweight classification and box heads to each transformer output token, enabling object detection capabilities.

  • CLIP backbone trained from scratch on image-caption data
  • Fine-tuned on COCO and OpenImages datasets
  • Supports multiple text queries per image
  • Uses bipartite matching loss during training

Core Capabilities

  • Zero-shot object detection without class-specific training
  • Open-vocabulary classification using text embeddings
  • Multiple object detection with confidence scores
  • Precise bounding box predictions
  • Text-conditioned search within images

Frequently Asked Questions

Q: What makes this model unique?

OWL-ViT's ability to perform zero-shot object detection using natural language queries sets it apart. It can identify objects it wasn't explicitly trained to detect, making it highly flexible for various applications.

Q: What are the recommended use cases?

The model is primarily intended for research purposes, particularly in studying zero-shot detection capabilities and exploring interdisciplinary applications where identifying previously unseen objects is crucial. It's especially useful for researchers investigating model robustness and generalization in computer vision.

The first platform built for prompt engineering