ColPali v1.2

Property	Value
License	MIT
Base Model	PaliGemma-3B
Paper	ColPali: Efficient Document Retrieval with Vision Language Models
Primary Language	English

What is colpali-v1.2?

ColPali v1.2 is an advanced visual document retrieval model that combines the power of PaliGemma-3B with ColBERT strategy for efficient document indexing. This version introduces significant improvements over its predecessor, including right padding for queries and deterministic projection layer initialization. The model processes both text and images, generating multi-vector representations for enhanced retrieval accuracy.

Implementation Details

The model is built on a sophisticated architecture that integrates SigLIP's visual capabilities with PaliGemma-3B's language understanding. It utilizes LoRA adapters with alpha=32 and r=32 on transformer layers, and implements an 8-bit optimizer for efficient training. The training process spans 5 epochs with enhanced warmup steps to prevent non-English language collapse.

Trained on 127,460 query-page pairs
Uses bfloat16 format for computation
Implements data parallelism across 8 GPUs
Features a learning rate of 5e-5 with linear decay

Core Capabilities

Efficient document indexing from visual features
Multi-vector representations of text and images
Zero-shot generalization potential to non-English languages
Compatible with colpali-engine>=0.2.0

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its ability to map image patch embeddings through a language model, creating a unified latent space for both text and visual content. This enables superior document retrieval performance through the ColBERT interaction mechanism.

Q: What are the recommended use cases?

The model excels in PDF document retrieval tasks, particularly in academic and professional contexts where precise document matching is crucial. It's especially effective for English-language content but shows potential for cross-lingual applications.

colpali-v1.2