colpali-v1.2-hf

Maintained By
vidore

ColPali-v1.2-hf

PropertyValue
LicenseGemma license (backbone) + MIT (adapters)
PaperarXiv:2407.01449
Training Data127,460 query-page pairs
ArchitecturePaliGemma-3B with ColBERT strategy

What is colpali-v1.2-hf?

ColPali is an innovative Vision Language Model (VLM) designed specifically for efficient document retrieval. Built on PaliGemma-3B, it generates ColBERT-style multi-vector representations of both text and images, offering a novel approach to document indexing and retrieval. The model combines SigLIP's visual capabilities with advanced language modeling to create a powerful retrieval system.

Implementation Details

The model is implemented using a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both textual and visual content. It uses LoRA adapters with alpha=32 and r=32 on transformer layers, trained with bfloat16 format and a paged_adamw_8bit optimizer.

  • Training conducted on 8 GPUs with data parallelism
  • Learning rate of 5e-5 with linear decay
  • 2.5% warmup steps and batch size of 32
  • Trained on English dataset with zero-shot multilingual capabilities

Core Capabilities

  • Efficient document indexing from visual features
  • Multi-vector representation generation
  • Cross-modal retrieval between text and images
  • Zero-shot generalization to non-English languages
  • PDF document processing and analysis

Frequently Asked Questions

Q: What makes this model unique?

ColPali's unique strength lies in its ability to map image patch embeddings to a latent space similar to textual input, enabling efficient ColBERT-style interactions between text tokens and image patches. This approach significantly improves retrieval performance compared to traditional methods.

Q: What are the recommended use cases?

The model is particularly well-suited for document retrieval tasks, especially those involving PDF documents. It excels in scenarios requiring cross-modal understanding between text queries and visual document content, making it valuable for digital library systems, document management, and academic research.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.