ColPali-v1.2-hf
Property | Value |
---|---|
License | Gemma license (backbone) + MIT (adapters) |
Paper | arXiv:2407.01449 |
Training Data | 127,460 query-page pairs |
Architecture | PaliGemma-3B with ColBERT strategy |
What is colpali-v1.2-hf?
ColPali is an innovative Vision Language Model (VLM) designed specifically for efficient document retrieval. Built on PaliGemma-3B, it generates ColBERT-style multi-vector representations of both text and images, offering a novel approach to document indexing and retrieval. The model combines SigLIP's visual capabilities with advanced language modeling to create a powerful retrieval system.
Implementation Details
The model is implemented using a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both textual and visual content. It uses LoRA adapters with alpha=32 and r=32 on transformer layers, trained with bfloat16 format and a paged_adamw_8bit optimizer.
- Training conducted on 8 GPUs with data parallelism
- Learning rate of 5e-5 with linear decay
- 2.5% warmup steps and batch size of 32
- Trained on English dataset with zero-shot multilingual capabilities
Core Capabilities
- Efficient document indexing from visual features
- Multi-vector representation generation
- Cross-modal retrieval between text and images
- Zero-shot generalization to non-English languages
- PDF document processing and analysis
Frequently Asked Questions
Q: What makes this model unique?
ColPali's unique strength lies in its ability to map image patch embeddings to a latent space similar to textual input, enabling efficient ColBERT-style interactions between text tokens and image patches. This approach significantly improves retrieval performance compared to traditional methods.
Q: What are the recommended use cases?
The model is particularly well-suited for document retrieval tasks, especially those involving PDF documents. It excels in scenarios requiring cross-modal understanding between text queries and visual document content, making it valuable for digital library systems, document management, and academic research.