Late Interaction (ColBERT)

Do they do Stage-1 and Stage-2 together?

Short Answer

Yes, late interaction models like ColBERT bridge the gap between Stage 1 and Stage 2.

They can serve as:

  • High-quality Stage 1: Retrieval from millions of documents (with optimized indexing)

  • Fine-grained Stage 2: Token-level matching that mimics cross-encoder precision

Long Answer: Why Late Interaction is Special

Standard Dense Retrieval (Stage 1)

  • Compresses entire document into single vector (768-d)

  • Fast but loses nuance

  • Can’t capture fine-grained matches

Cross-Encoders (Stage 2)

  • Full attention between every query-document token pair

  • Extremely accurate but prohibitively slow

  • Can’t pre-compute (needs query at inference time)

Late Interaction (ColBERT) - The Bridge

  • Stores vector for every token in document

  • Performs interaction after retrieval (“late”)

  • Fast enough for Stage 1, accurate enough for Stage 2

The ColBERT Architecture

Standard Bi-Encoder:
┌─────────┐                    ┌──────────┐
│  Query  │ → BERT → [768-d] → │          │
│         │                    │ Dot Prod │ → Score
│Document │ → BERT → [768-d] → │          │
└─────────┘                    └──────────┘

ColBERT (Late Interaction):
┌─────────┐                    ┌──────────┐
│  Query  │ → BERT → [32 x 128-d] →│      │
│ (32 tok)│                         │MaxSim│ → Score
│Document │ → BERT → [200 x 128-d]→│      │
│(200 tok)│                         └──────┘
└─────────┘

MaxSim Operation:
For each query token, find max similarity with any document token,
then sum across all query tokens.

ColBERT Literature

Paper

Author

Venue

Code

Key Innovation

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Khattab & Zaharia

SIGIR 2020

Code

Late Interaction: Retains token-level embeddings and computes MaxSim after retrieval. Captures fine details like cross-encoders but indexable like bi-encoders.

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Santhanam et al.

NAACL 2022

Code

Compression + Denoising: Residual compression reduces index size by 6-10x. Denoised supervision from cross-encoder improves quality. Enables billion-scale retrieval.

Other Multi-Vector Methods

Paper

Author

Venue

Code

Key Innovation

Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Humeau et al.

ICLR 2020

Code

Attention over Codes: Learned codes represent the candidate document. Query attends to these codes. Balance of speed/accuracy; versatile for dialogue systems.

ME-BERT: Multi-Vector Encoding for Document Retrieval

Luan et al.

arXiv 2020

NA

Multi-Vector per Doc: Multiple vectors per document to capture diverse topics. Each vector represents different aspect/sub-topic.

How ColBERT Does Both Stages

ColBERT v2 with PLAID

The key innovation is the PLAID (Performance-optimized Late Interaction Driver) index:

Stage 1 Capability: Fast Retrieval

  1. Centroid-based pruning: Groups similar token embeddings into centroids

  2. Early termination: Stops scoring if document clearly won’t be in top-k

  3. Quantization: Compresses embeddings from 128-d float to 2-bit integers

  4. Result: Can search 10M documents in ~50-100ms

Stage 2 Capability: Fine-grained Matching

  1. Token-level MaxSim: Each query token finds best matching document token

  2. Captures phrases: “capital of France” can match non-contiguous tokens

  3. Position awareness: Different tokens can match different parts

  4. Result: Accuracy approaching cross-encoders

Performance Comparison

Speed vs Accuracy Trade-off

Method

Latency

Accuracy

Index Size

Use Case

BM25

~1ms

Baseline

Small (GB)

Keywords

Bi-Encoder

~10ms

+15%

Medium (10GB)

Semantic

ColBERT

~50ms

+25%

Large (100GB)

Both stages

Cross-Encoder

~1000ms/doc

+30%

None

Stage 2 only

Key Insight: ColBERT is 20x slower than bi-encoder but 20x faster than cross-encoder.

When to Use ColBERT

Use ColBERT As Primary Retriever When:

  • Accuracy is critical (medical, legal, research)

  • You can afford the larger index (100-300GB for 10M docs)

  • Latency budget allows 50-200ms

  • Want single-stage solution (no separate re-ranker needed)

Use ColBERT As Re-ranker When:

  • Bi-encoder retrieves top-1000

  • ColBERT MaxSim re-ranks to top-100

  • Cross-encoder (optional) produces final top-10

  • Best of all worlds: fast initial retrieval, precise final ranking

Don’t Use ColBERT When:

  • Index size is constrained (edge devices, mobile)

  • Need sub-10ms latency (real-time autocomplete)

  • Corpus is small enough for cross-encoder on everything (<10K docs)

Implementation Example

Basic ColBERT Retrieval

from colbert import Searcher
from colbert.infra import ColBERTConfig

# Initialize
config = ColBERTConfig(root="experiments/")
searcher = Searcher(index="my_index", config=config)

# Search
query = "What is the capital of France?"
results = searcher.search(query, k=10)

for passage_id, rank, score in results:
    print(f"Rank {rank}: {passages[passage_id]} (score: {score:.2f})")

ColBERT as Re-ranker

# Stage 1: Bi-encoder retrieves candidates
bi_encoder = SentenceTransformer('BAAI/bge-base-en-v1.5')
candidates = bi_encoder.search(query, corpus, top_k=1000)

# Stage 2: ColBERT re-ranks
colbert_scores = []
for doc in candidates:
    score = colbert.maxsim(query, doc)  # Token-level matching
    colbert_scores.append(score)

# Top-100 after ColBERT re-ranking
top_100 = rank_by_score(candidates, colbert_scores)[:100]

# Stage 3 (optional): Cross-encoder for final top-10
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
final_top_10 = cross_encoder.rerank(query, top_100)[:10]

The Index Size Problem

Why ColBERT Indexes Are Large

Standard bi-encoder:

\[ \begin{align}\begin{aligned}\text{Index Size} = N \times d \times 4 \text{ bytes}\\\text{For 10M docs: } 10M \times 768 \times 4 = 30\text{GB}\end{aligned}\end{align} \]

ColBERT:

\[ \begin{align}\begin{aligned}\text{Index Size} = N \times \text{avg\_tokens} \times d \times 4\\\text{For 10M docs: } 10M \times 200 \times 128 \times 4 = 1\text{TB}\end{aligned}\end{align} \]

ColBERTv2 Solutions:

  1. Compression: Quantization reduces to ~100GB (10x smaller)

  2. Pruning: Remove low-importance tokens

  3. Residual encoding: Store deltas from centroids

Advanced Topic: MaxSim Operation

The MaxSim operation is what gives ColBERT its power:

def maxsim(query_embeddings, doc_embeddings):
    """
    query_embeddings: (num_query_tokens, 128)
    doc_embeddings: (num_doc_tokens, 128)
    """
    # Compute all pairwise similarities
    similarities = query_embeddings @ doc_embeddings.T  # (Q, D)

    # For each query token, find max similarity with any doc token
    max_per_query_token = similarities.max(dim=1).values  # (Q,)

    # Sum across all query tokens
    score = max_per_query_token.sum()

    return score

Why This Works:

  • Query token “capital” matches doc token “capital” (exact)

  • Query token “France” matches doc tokens “French”, “Paris” (semantic)

  • Flexible matching while maintaining efficiency

Comparison with Cross-Encoders

ColBERT vs Cross-Encoder

Dimension

ColBERT

Cross-Encoder

Encoding

Independent (query, doc separate)

Joint ([CLS] query [SEP] doc [SEP])

Interaction

Late (after encoding)

Early (full self-attention)

Pre-computation

Yes (doc embeddings offline)

No (must encode each pair)

Speed for 100 docs

~50ms (MaxSim is cheap)

~5000ms (100 forward passes)

Accuracy

90-95% of cross-encoder

Best possible

Best Use

Stage 1 or Stage 2

Stage 2 only

Poly-Encoders: Another Middle Ground

Poly-encoders offer a different trade-off:

Architecture:

  1. Document → BERT → Multiple “code” vectors (e.g., 64 codes)

  2. Query → BERT → Single query vector

  3. Query attends to document codes

  4. Weighted sum of codes based on attention

Advantage: More flexible than ColBERT’s MaxSim

Disadvantage: More complex, less interpretable

Use Case: Dialogue systems, where documents are short and interaction patterns complex

Research Directions

Current research on late interaction focuses on:

  1. Reducing Index Size: Can we get ColBERT quality with bi-encoder size?

  2. Dynamic Pruning: Adaptively decide which tokens to keep

  3. Learned Aggregation: Learn better operations than MaxSim

  4. Multi-modal: Extend late interaction to images, video

  5. Long Documents: Handle documents with thousands of tokens

Next Steps