colpali-v1.2

Maintained By
vidore

ColPali v1.2

PropertyValue
LicenseMIT
Base ModelPaliGemma-3B
PaperColPali: Efficient Document Retrieval with Vision Language Models
Primary LanguageEnglish

What is colpali-v1.2?

ColPali v1.2 is an advanced visual document retrieval model that combines the power of PaliGemma-3B with ColBERT strategy for efficient document indexing. This version introduces significant improvements over its predecessor, including right padding for queries and deterministic projection layer initialization. The model processes both text and images, generating multi-vector representations for enhanced retrieval accuracy.

Implementation Details

The model is built on a sophisticated architecture that integrates SigLIP's visual capabilities with PaliGemma-3B's language understanding. It utilizes LoRA adapters with alpha=32 and r=32 on transformer layers, and implements an 8-bit optimizer for efficient training. The training process spans 5 epochs with enhanced warmup steps to prevent non-English language collapse.

  • Trained on 127,460 query-page pairs
  • Uses bfloat16 format for computation
  • Implements data parallelism across 8 GPUs
  • Features a learning rate of 5e-5 with linear decay

Core Capabilities

  • Efficient document indexing from visual features
  • Multi-vector representations of text and images
  • Zero-shot generalization potential to non-English languages
  • Compatible with colpali-engine>=0.2.0

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its ability to map image patch embeddings through a language model, creating a unified latent space for both text and visual content. This enables superior document retrieval performance through the ColBERT interaction mechanism.

Q: What are the recommended use cases?

The model excels in PDF document retrieval tasks, particularly in academic and professional contexts where precise document matching is crucial. It's especially effective for English-language content but shows potential for cross-lingual applications.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.