Building RAG Pipelines: A Practical Guide

This guide presents practical patterns for building Retrieval-Augmented Generation (RAG) pipelines, progressing from minimal viable implementations to production-ready systems. The content is inspired by Ben Clavié’s “Beyond the Basics of RAG” talk at the Mastering LLMs Conference (video) and reflects real-world best practices.

Note

Key Insight from Ben Clavié (Answer.AI):

“RAG is not a new paradigm, a framework, or an end-to-end system. RAG is the act of stitching together Retrieval and Generation to ground the latter. Good RAG is made up of good components: good retrieval pipeline, good generative model, good way of linking them up.”

Video: Beyond the Basics of RAG

Watch Ben Clavié’s full talk from the Mastering LLMs Conference:

Slides

Download the presentation slides (PDF)

You can also view the full slides and transcript on the Parlance Labs education page.

The Compact MVP: Start Simple

The most minimal deep retrieval pipeline is surprisingly simple. Before reaching for complex architectures, start here.

Minimal Implementation

from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Embed your documents (do this once, store the results)
documents = ["Document 1 text...", "Document 2 text...", ...]
doc_embeddings = model.encode(documents, normalize_embeddings=True)

# At query time: embed query and find similar documents
query = "What is the capital of France?"
query_embedding = model.encode(query, normalize_embeddings=True)

# Compute similarities (this IS your "vector database" at small scale)
similarities = np.dot(doc_embeddings, query_embedding.T)
top_k_indices = np.argsort(similarities)[-3:][::-1]

results = [documents[i] for i in top_k_indices]

That’s it. This works for thousands of documents on any modern CPU.

When Do You Need a Vector Database?

Important

You don’t need a vector database for small-scale search.

A numpy array IS your vector database at small scale. Any modern CPU can search through hundreds of vectors in milliseconds.

Vector databases (FAISS, Milvus, Pinecone, etc.) become necessary when:

  • Scale: > 100K documents (need approximate nearest neighbor search)

  • Persistence: Need to store and reload indexes

  • Filtering: Need metadata-based pre-filtering

  • Updates: Frequent document additions/deletions

  • Distribution: Multi-node deployment

When to Use What

Scale

Solution

Rationale

< 10K docs

NumPy array

Brute force is fast enough (~10ms)

10K - 100K docs

FAISS (flat or IVF)

Need some optimization

100K - 10M docs

FAISS HNSW / Milvus

Need ANN for sub-second latency

> 10M docs

Distributed (Milvus, Pinecone)

Need sharding and replication

Why Bi-Encoders Work (and When They Don’t)

Bi-encoders encode queries and documents entirely separately. They are unaware of each other until the similarity computation.

Advantages:

  • Pre-compute all document embeddings (offline)

  • Only encode the query at inference time

  • Extremely fast retrieval via ANN indexes

Limitations:

  • Compressing hundreds of tokens to a single vector loses information

  • Training data never fully represents your domain

  • Humans use keywords that embeddings may not capture well

This is why we need the next component: reranking.

Adding Reranking: The Power of Cross-Encoders

Cross-encoders fix the “query-document unawareness” problem by processing them together.

How Cross-Encoders Work

Bi-Encoder (Stage 1):
┌─────────┐     ┌─────────┐
│  Query  │     │  Doc    │
│ Encoder │     │ Encoder │
└────┬────┘     └────┬────┘
     │               │
     ▼               ▼
   [768]           [768]      → Dot Product → Score

Cross-Encoder (Stage 2):
┌─────────────────────────────────┐
│  [CLS] Query [SEP] Document [SEP] │
│         Joint Encoder            │
└───────────────┬─────────────────┘
                │
                ▼
              Score

The key difference: Cross-encoders see the full query-document interaction through self-attention. This is much more powerful but computationally expensive.

Adding Reranking to the Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Fast retrieval with bi-encoder
bi_encoder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)

query = "What was Studio Ghibli's first film?"
query_embedding = bi_encoder.encode(query, normalize_embeddings=True)

# Get top-100 candidates (fast, ~10ms)
similarities = np.dot(doc_embeddings, query_embedding.T)
top_100_indices = np.argsort(similarities)[-100:][::-1]
candidates = [documents[i] for i in top_100_indices]

# Stage 2: Precise reranking with cross-encoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, doc] for doc in candidates]
scores = cross_encoder.predict(pairs)

# Get final top-10 (slower but much more accurate, ~2-5s)
top_10_indices = np.argsort(scores)[-10:][::-1]
final_results = [candidates[i] for i in top_10_indices]

The World of Rerankers

Beyond basic cross-encoders, there are many reranking approaches:

Type

Examples

Trade-off

Cross-Encoders

MiniLM, BGE-reranker

Best accuracy, moderate speed

T5-based

MonoT5, RankT5

Good accuracy, slower

LLM-based

RankGPT, RankZephyr

Excellent zero-shot, expensive

API-based

Cohere, Jina, Voyage

Easy to use, cost per query

Tip

Using the rerankers library (maintained by Ben Clavié):

from rerankers import Reranker

# Local cross-encoder
ranker = Reranker("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Or API-based (Cohere)
ranker = Reranker("cohere", api_key="...")

# Same interface for all!
results = ranker.rank(query="...", docs=[...])

Keyword Search: The Old Legend Lives On

One of the most overlooked components in modern RAG systems is good old BM25.

Why BM25 Still Matters

Important

“An ongoing joke is that information retrieval has progressed slowly because BM25 is too strong a baseline.” — Ben Clavié

Semantic search via embeddings is powerful, but compressing hundreds of tokens to a single vector inevitably loses information:

  • Embeddings learn to represent information useful to their training queries

  • Training data is never fully representative of your domain

  • Humans love keywords: acronyms, domain-specific terms, product codes

BEIR Benchmark Evidence

From the BEIR benchmark (Thakur et al., 2021), BM25 outperforms many dense models on several datasets:

Dataset         │ BM25   │ DPR    │ ANCE   │ ColBERT
────────────────┼────────┼────────┼────────┼────────
TREC-COVID      │ 0.656  │ 0.332  │ 0.654  │ 0.677
NFCorpus        │ 0.325  │ 0.189  │ 0.237  │ 0.319
Touché-2020     │ 0.367  │ 0.131  │ 0.240  │ 0.162
Robust04        │ 0.408  │ 0.252  │ 0.392  │ 0.427

Avg vs BM25     │   —    │ -47.7% │ -7.4%  │ -2.8%

BM25 is especially powerful for:

  • Longer documents (more term statistics to leverage)

  • Domain-specific jargon (medical, legal, technical)

  • Exact match requirements (product codes, statute numbers)

And its inference overhead is virtually unnoticeable — a near free-lunch addition.

Hybrid Search: Best of Both Worlds

Combine BM25 and dense retrieval for robustness:

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

# Prepare BM25
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Prepare dense
bi_encoder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = bi_encoder.encode(documents, normalize_embeddings=True)

def hybrid_search(query, top_k=100, alpha=0.5):
    """Combine BM25 and dense scores with weight alpha."""
    # BM25 scores
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)

    # Dense scores
    query_emb = bi_encoder.encode(query, normalize_embeddings=True)
    dense_scores = np.dot(doc_embeddings, query_emb.T).flatten()
    dense_scores = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-6)

    # Combine
    hybrid_scores = alpha * dense_scores + (1 - alpha) * bm25_scores
    top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]

    return [documents[i] for i in top_indices]

Metadata Filtering: Don’t Search What You Don’t Need

Outside of academic benchmarks, documents don’t exist in a vacuum. Metadata filtering is crucial for production systems.

The Problem

Consider this query:

“Get me the cruise division financial report for Q4 2022”

Vector search can fail here because:

  1. The model must accurately represent “financial report” + “cruise division” + “Q4” + “2022” in a single vector

  2. If top-k is too high, you’ll pass irrelevant financial reports to your LLM

The Solution: Pre-filtering

Use entity extraction to identify filterable attributes:

Query: "Get me the cruise division financial report for Q4 2022"

Extracted entities:
- DEPARTMENT: "cruise division"
- DOCUMENT_TYPE: "financial report"
- TIME_PERIOD: "Q4 2022"

Then filter before vector search:

# Instead of searching all documents...
results = vector_search(query, all_documents, top_k=100)

# Pre-filter to relevant subset
filtered_docs = [d for d in all_documents
                 if d.department == "cruise"
                 and d.doc_type == "financial_report"
                 and d.period == "Q4_2022"]
results = vector_search(query, filtered_docs, top_k=10)

Entity Extraction with GliNER

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_base")

query = "Get me the cruise division financial report for Q4 2022"
labels = ["department", "document_type", "time_period"]

entities = model.predict_entities(query, labels)
# [{'text': 'cruise division', 'label': 'department'},
#  {'text': 'financial report', 'label': 'document_type'},
#  {'text': 'Q4 2022', 'label': 'time_period'}]

The Final MVP++: Putting It All Together

Here’s the complete production-ready pipeline in ~30 lines:

import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
from lancedb.rerankers import CohereReranker

# Initialize embedding model
model = get_registry().get("sentence-transformers").create(
    name="BAAI/bge-small-en-v1.5"
)

# Define document schema with metadata
class Document(LanceModel):
    text: str = model.SourceField()
    vector: Vector(384) = model.VectorField()
    category: str  # Metadata for filtering

# Create database and table
db = lancedb.connect(".my_db")
tbl = db.create_table("my_table", schema=Document)

# Add documents (embedding happens automatically)
tbl.add(docs)  # docs = [{"text": "...", "category": "..."}, ...]

# Create full-text search index for hybrid search
tbl.create_fts_index("text")

# Initialize reranker
reranker = CohereReranker()

# Query with all components
query = "What is Chihiro's new name given to her by the witch?"
results = (
    tbl.search(query, query_type="hybrid")  # Hybrid = BM25 + dense
    .where("category = 'film'", prefilter=True)  # Metadata filter
    .limit(100)  # First-pass retrieval
    .rerank(reranker=reranker)  # Cross-encoder reranking
)

Pipeline Architecture Summary

┌─────────────────────────────────────────────────────────────────────┐
│                        MVP++ RAG Pipeline                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────┐     ┌──────────────┐     ┌──────────────────────────┐ │
│  │  Query  │ ──► │   Entity     │ ──► │   Metadata Filtering     │ │
│  │         │     │  Extraction  │     │   (Pre-filter corpus)    │ │
│  └─────────┘     └──────────────┘     └────────────┬─────────────┘ │
│                                                     │               │
│                                                     ▼               │
│                                       ┌──────────────────────────┐ │
│                                       │   Hybrid Retrieval       │ │
│                                       │   (BM25 + Dense)         │ │
│                                       │   → Top-100 candidates   │ │
│                                       └────────────┬─────────────┘ │
│                                                     │               │
│                                                     ▼               │
│                                       ┌──────────────────────────┐ │
│                                       │   Cross-Encoder          │ │
│                                       │   Reranking              │ │
│                                       │   → Top-10 final         │ │
│                                       └────────────┬─────────────┘ │
│                                                     │               │
│                                                     ▼               │
│                                       ┌──────────────────────────┐ │
│                                       │   LLM Generation         │ │
│                                       │   (with retrieved docs)  │ │
│                                       └──────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Component Checklist

Component

Priority

Notes

Bi-encoder retrieval

Required

Start with BGE or E5 models

Cross-encoder reranking

Highly recommended

10-30% accuracy improvement typical

BM25 / Hybrid search

Recommended

Near-zero overhead, helps with keywords

Metadata filtering

Situational

Essential when documents have clear attributes

Entity extraction

Optional

Automates metadata filtering from queries

What’s Next?

This guide covers the “compact MVP++” — the foundation every RAG system should have. More advanced topics include:

Beyond Single Vectors:

  • ColBERT / Late Interaction: Multiple vectors per document for fine-grained matching

  • SPLADE: Learned sparse representations combining neural + keyword matching

Training and Optimization:

  • Hard negative mining: Improving retrieval with better training data

  • Knowledge distillation: Making cross-encoders faster

  • Domain adaptation: Fine-tuning for your specific use case

Evaluation:

References

  1. Clavié, B. (2024). “Beyond Explaining the Basics of Retrieval (Augmented Generation).” Talk at Mastering LLMs Conference.

  2. Thakur, N., et al. (2021). “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” NeurIPS 2021.

  3. RAGatouille library: https://github.com/AnswerDotAI/RAGatouille

  4. Rerankers library: https://github.com/AnswerDotAI/rerankers

  5. LanceDB documentation: https://lancedb.github.io/lancedb/

  6. GLiNER: Generalist Model for Named Entity Recognition (arXiv:2311.08526)