Building RAG Pipelines: A Practical Guide¶
This guide presents practical patterns for building Retrieval-Augmented Generation (RAG) pipelines, progressing from minimal viable implementations to production-ready systems. The content is inspired by Ben Clavié’s “Beyond the Basics of RAG” talk at the Mastering LLMs Conference (video) and reflects real-world best practices.
Note
Key Insight from Ben Clavié (Answer.AI):
“RAG is not a new paradigm, a framework, or an end-to-end system. RAG is the act of stitching together Retrieval and Generation to ground the latter. Good RAG is made up of good components: good retrieval pipeline, good generative model, good way of linking them up.”
Video: Beyond the Basics of RAG¶
Watch Ben Clavié’s full talk from the Mastering LLMs Conference:
Slides¶
Download the presentation slides (PDF)
You can also view the full slides and transcript on the Parlance Labs education page.
The Compact MVP: Start Simple¶
The most minimal deep retrieval pipeline is surprisingly simple. Before reaching for complex architectures, start here.
Minimal Implementation¶
from sentence_transformers import SentenceTransformer
import numpy as np
# Load embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Embed your documents (do this once, store the results)
documents = ["Document 1 text...", "Document 2 text...", ...]
doc_embeddings = model.encode(documents, normalize_embeddings=True)
# At query time: embed query and find similar documents
query = "What is the capital of France?"
query_embedding = model.encode(query, normalize_embeddings=True)
# Compute similarities (this IS your "vector database" at small scale)
similarities = np.dot(doc_embeddings, query_embedding.T)
top_k_indices = np.argsort(similarities)[-3:][::-1]
results = [documents[i] for i in top_k_indices]
That’s it. This works for thousands of documents on any modern CPU.
When Do You Need a Vector Database?¶
Important
You don’t need a vector database for small-scale search.
A numpy array IS your vector database at small scale. Any modern CPU can search through hundreds of vectors in milliseconds.
Vector databases (FAISS, Milvus, Pinecone, etc.) become necessary when:
Scale: > 100K documents (need approximate nearest neighbor search)
Persistence: Need to store and reload indexes
Filtering: Need metadata-based pre-filtering
Updates: Frequent document additions/deletions
Distribution: Multi-node deployment
Scale |
Solution |
Rationale |
|---|---|---|
< 10K docs |
NumPy array |
Brute force is fast enough (~10ms) |
10K - 100K docs |
FAISS (flat or IVF) |
Need some optimization |
100K - 10M docs |
FAISS HNSW / Milvus |
Need ANN for sub-second latency |
> 10M docs |
Distributed (Milvus, Pinecone) |
Need sharding and replication |
Why Bi-Encoders Work (and When They Don’t)¶
Bi-encoders encode queries and documents entirely separately. They are unaware of each other until the similarity computation.
Advantages:
Pre-compute all document embeddings (offline)
Only encode the query at inference time
Extremely fast retrieval via ANN indexes
Limitations:
Compressing hundreds of tokens to a single vector loses information
Training data never fully represents your domain
Humans use keywords that embeddings may not capture well
This is why we need the next component: reranking.
Adding Reranking: The Power of Cross-Encoders¶
Cross-encoders fix the “query-document unawareness” problem by processing them together.
How Cross-Encoders Work¶
Bi-Encoder (Stage 1):
┌─────────┐ ┌─────────┐
│ Query │ │ Doc │
│ Encoder │ │ Encoder │
└────┬────┘ └────┬────┘
│ │
▼ ▼
[768] [768] → Dot Product → Score
Cross-Encoder (Stage 2):
┌─────────────────────────────────┐
│ [CLS] Query [SEP] Document [SEP] │
│ Joint Encoder │
└───────────────┬─────────────────┘
│
▼
Score
The key difference: Cross-encoders see the full query-document interaction through self-attention. This is much more powerful but computationally expensive.
Adding Reranking to the Pipeline¶
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
# Stage 1: Fast retrieval with bi-encoder
bi_encoder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)
query = "What was Studio Ghibli's first film?"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)
# Get top-100 candidates (fast, ~10ms)
similarities = np.dot(doc_embeddings, query_embedding.T)
top_100_indices = np.argsort(similarities)[-100:][::-1]
candidates = [documents[i] for i in top_100_indices]
# Stage 2: Precise reranking with cross-encoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, doc] for doc in candidates]
scores = cross_encoder.predict(pairs)
# Get final top-10 (slower but much more accurate, ~2-5s)
top_10_indices = np.argsort(scores)[-10:][::-1]
final_results = [candidates[i] for i in top_10_indices]
The World of Rerankers¶
Beyond basic cross-encoders, there are many reranking approaches:
Type |
Examples |
Trade-off |
|---|---|---|
Cross-Encoders |
MiniLM, BGE-reranker |
Best accuracy, moderate speed |
T5-based |
MonoT5, RankT5 |
Good accuracy, slower |
LLM-based |
RankGPT, RankZephyr |
Excellent zero-shot, expensive |
API-based |
Cohere, Jina, Voyage |
Easy to use, cost per query |
Tip
Using the rerankers library (maintained by Ben Clavié):
from rerankers import Reranker
# Local cross-encoder
ranker = Reranker("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Or API-based (Cohere)
ranker = Reranker("cohere", api_key="...")
# Same interface for all!
results = ranker.rank(query="...", docs=[...])
Keyword Search: The Old Legend Lives On¶
One of the most overlooked components in modern RAG systems is good old BM25.
Why BM25 Still Matters¶
Important
“An ongoing joke is that information retrieval has progressed slowly because BM25 is too strong a baseline.” — Ben Clavié
Semantic search via embeddings is powerful, but compressing hundreds of tokens to a single vector inevitably loses information:
Embeddings learn to represent information useful to their training queries
Training data is never fully representative of your domain
Humans love keywords: acronyms, domain-specific terms, product codes
BEIR Benchmark Evidence¶
From the BEIR benchmark (Thakur et al., 2021), BM25 outperforms many dense models on several datasets:
Dataset │ BM25 │ DPR │ ANCE │ ColBERT
────────────────┼────────┼────────┼────────┼────────
TREC-COVID │ 0.656 │ 0.332 │ 0.654 │ 0.677
NFCorpus │ 0.325 │ 0.189 │ 0.237 │ 0.319
Touché-2020 │ 0.367 │ 0.131 │ 0.240 │ 0.162
Robust04 │ 0.408 │ 0.252 │ 0.392 │ 0.427
Avg vs BM25 │ — │ -47.7% │ -7.4% │ -2.8%
BM25 is especially powerful for:
Longer documents (more term statistics to leverage)
Domain-specific jargon (medical, legal, technical)
Exact match requirements (product codes, statute numbers)
And its inference overhead is virtually unnoticeable — a near free-lunch addition.
Hybrid Search: Best of Both Worlds¶
Combine BM25 and dense retrieval for robustness:
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
# Prepare BM25
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
# Prepare dense
bi_encoder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)
def hybrid_search(query, top_k=100, alpha=0.5):
"""Combine BM25 and dense scores with weight alpha."""
# BM25 scores
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)
# Dense scores
query_emb = bi_encoder.encode(query, normalize_embeddings=True)
dense_scores = np.dot(doc_embeddings, query_emb.T).flatten()
dense_scores = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-6)
# Combine
hybrid_scores = alpha * dense_scores + (1 - alpha) * bm25_scores
top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]
return [documents[i] for i in top_indices]
Metadata Filtering: Don’t Search What You Don’t Need¶
Outside of academic benchmarks, documents don’t exist in a vacuum. Metadata filtering is crucial for production systems.
The Problem¶
Consider this query:
“Get me the cruise division financial report for Q4 2022”
Vector search can fail here because:
The model must accurately represent “financial report” + “cruise division” + “Q4” + “2022” in a single vector
If top-k is too high, you’ll pass irrelevant financial reports to your LLM
The Solution: Pre-filtering¶
Use entity extraction to identify filterable attributes:
Query: "Get me the cruise division financial report for Q4 2022"
Extracted entities:
- DEPARTMENT: "cruise division"
- DOCUMENT_TYPE: "financial report"
- TIME_PERIOD: "Q4 2022"
Then filter before vector search:
# Instead of searching all documents...
results = vector_search(query, all_documents, top_k=100)
# Pre-filter to relevant subset
filtered_docs = [d for d in all_documents
if d.department == "cruise"
and d.doc_type == "financial_report"
and d.period == "Q4_2022"]
results = vector_search(query, filtered_docs, top_k=10)
Entity Extraction with GliNER¶
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_base")
query = "Get me the cruise division financial report for Q4 2022"
labels = ["department", "document_type", "time_period"]
entities = model.predict_entities(query, labels)
# [{'text': 'cruise division', 'label': 'department'},
# {'text': 'financial report', 'label': 'document_type'},
# {'text': 'Q4 2022', 'label': 'time_period'}]
The Final MVP++: Putting It All Together¶
Here’s the complete production-ready pipeline in ~30 lines:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
from lancedb.rerankers import CohereReranker
# Initialize embedding model
model = get_registry().get("sentence-transformers").create(
name="BAAI/bge-small-en-v1.5"
)
# Define document schema with metadata
class Document(LanceModel):
text: str = model.SourceField()
vector: Vector(384) = model.VectorField()
category: str # Metadata for filtering
# Create database and table
db = lancedb.connect(".my_db")
tbl = db.create_table("my_table", schema=Document)
# Add documents (embedding happens automatically)
tbl.add(docs) # docs = [{"text": "...", "category": "..."}, ...]
# Create full-text search index for hybrid search
tbl.create_fts_index("text")
# Initialize reranker
reranker = CohereReranker()
# Query with all components
query = "What is Chihiro's new name given to her by the witch?"
results = (
tbl.search(query, query_type="hybrid") # Hybrid = BM25 + dense
.where("category = 'film'", prefilter=True) # Metadata filter
.limit(100) # First-pass retrieval
.rerank(reranker=reranker) # Cross-encoder reranking
)
Pipeline Architecture Summary¶
┌─────────────────────────────────────────────────────────────────────┐
│ MVP++ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Query │ ──► │ Entity │ ──► │ Metadata Filtering │ │
│ │ │ │ Extraction │ │ (Pre-filter corpus) │ │
│ └─────────┘ └──────────────┘ └────────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Hybrid Retrieval │ │
│ │ (BM25 + Dense) │ │
│ │ → Top-100 candidates │ │
│ └────────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Cross-Encoder │ │
│ │ Reranking │ │
│ │ → Top-10 final │ │
│ └────────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ LLM Generation │ │
│ │ (with retrieved docs) │ │
│ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Component Checklist¶
Component |
Priority |
Notes |
|---|---|---|
Bi-encoder retrieval |
Required |
Start with BGE or E5 models |
Cross-encoder reranking |
Highly recommended |
10-30% accuracy improvement typical |
BM25 / Hybrid search |
Recommended |
Near-zero overhead, helps with keywords |
Metadata filtering |
Situational |
Essential when documents have clear attributes |
Entity extraction |
Optional |
Automates metadata filtering from queries |
What’s Next?¶
This guide covers the “compact MVP++” — the foundation every RAG system should have. More advanced topics include:
Beyond Single Vectors:
ColBERT / Late Interaction: Multiple vectors per document for fine-grained matching
SPLADE: Learned sparse representations combining neural + keyword matching
Training and Optimization:
Hard negative mining: Improving retrieval with better training data
Knowledge distillation: Making cross-encoders faster
Domain adaptation: Fine-tuning for your specific use case
Evaluation:
Systematic evaluation is critical but too important to cover briefly
See Benchmarks and Datasets for Retrieval and Re-ranking for evaluation metrics and datasets
References¶
Clavié, B. (2024). “Beyond Explaining the Basics of Retrieval (Augmented Generation).” Talk at Mastering LLMs Conference.
Video: YouTube
Slides & Transcript: Parlance Labs
Thakur, N., et al. (2021). “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” NeurIPS 2021.
RAGatouille library: https://github.com/AnswerDotAI/RAGatouille
Rerankers library: https://github.com/AnswerDotAI/rerankers
LanceDB documentation: https://lancedb.github.io/lancedb/
GLiNER: Generalist Model for Named Entity Recognition (arXiv:2311.08526)