Computational Complexity Analysis of RAG Systems¶

This section provides a rigorous analysis of the time and space complexity of each component in a Retrieval-Augmented Generation (RAG) pipeline. While Overview of RAG and the Two-Stage Pipeline covers the conceptual architecture, this page dives into the mathematical foundations that determine system performance at scale.

Why Complexity Analysis Matters ¶

Understanding computational complexity is essential for:

Capacity Planning: Predicting infrastructure requirements as corpus size grows
Bottleneck Identification: Knowing where optimization efforts yield the highest returns
Architecture Selection: Choosing between HNSW vs. IVF, cross-encoder vs. bi-encoder
Cost Estimation: Translating Big-O to actual latency and compute costs

Note

Latency Distribution in Production RAG (empirical data from Milvus, 2024):

Retrieval: 41-47% of total latency
LLM Generation: 50-60% of total latency
Preprocessing/Postprocessing: <5%

Document Processing and Chunking ¶

Complexity Analysis ¶

Operation	Time Complexity	Space Complexity
Text tokenization	\(O(n)\)	\(O(n)\)
Fixed-size chunking	\(O(n)\)	\(O(n/c)\) chunks
Semantic chunking	\(O(n \cdot d^2)\) (embedding-based)	\(O(n)\)
Overlap handling	\(O(n \cdot o/c)\)	\(O(n \cdot (1 + o/c))\)

Where:

\(n\) = document length in tokens
\(c\) = chunk size
\(o\) = overlap size
\(d\) = embedding dimension

Chunk Size Trade-offs ¶

The choice of chunk size \(c\) affects both retrieval quality and storage:

\[\text{Storage} = N_{chunks} \times d_{emb} \times 4 \text{ bytes} = \frac{n}{c - o} \times d_{emb} \times 4\]

Empirical Guidelines (from NVIDIA, 2025):

256-512 tokens: Best for factoid QA, precise retrieval
512-1024 tokens: Balanced for general RAG
1024+ tokens: Better context but noisier embeddings

Embedding Generation ¶

Transformer Encoder Complexity ¶

For a transformer encoder with \(L\) layers, \(d\) hidden dimension, and sequence length \(n\):

Self-Attention Complexity:

\[\text{Time}_{attention} = O(n^2 \cdot d)\]

Feed-Forward Complexity:

\[\text{Time}_{FFN} = O(n \cdot d^2)\]

Total Per Layer:

\[\text{Time}_{layer} = O(n^2 \cdot d + n \cdot d^2)\]

For typical embedding models where \(n < d\) (e.g., n=512, d=768):

\[\text{Time}_{total} \approx O(L \cdot n \cdot d^2)\]

Practical Latency Comparison ¶

Embedding Model Latencies (CPU, 100 tokens)¶
Model	Dimension	Latency	Storage (1M docs)
all-MiniLM-L6-v2	384	~2ms	~1.5 GB
BGE-base-en-v1.5	768	~10ms	~3.0 GB
E5-large-v2	1024	~50ms	~4.0 GB
BGE-M3	1024	~100ms	~4.0 GB

Vector Indexing Algorithms ¶

This is where the “magic” of sub-linear retrieval happens. Understanding these algorithms is crucial for production systems.

HNSW (Hierarchical Navigable Small World)¶

Algorithm Overview:

HNSW constructs a multi-layer graph where:

Layer 0 contains all vectors
Higher layers contain exponentially fewer vectors (skip-list structure)
Each vector connects to \(M\) neighbors per layer

Complexity:

Operation	Time	Space
Index Construction	\(O(N \cdot \log N \cdot M)\)	\(O(N \cdot d + N \cdot M \cdot L_{max})\)
Query (Search)	\(O(\log N \cdot M \cdot ef)\)	\(O(ef)\)
Insert (Single)	\(O(\log N \cdot M)\)	\(O(d + M \cdot L_{max})\)

Where:

\(N\) = number of vectors
\(M\) = connections per node (typically 16-64)
\(ef\) = search beam width (controls recall/speed trade-off)
\(L_{max}\) = maximum layer (typically \(\log N\))

Memory Formula:

\[\text{Memory}_{HNSW} = N \times (d \times 4 + M \times L_{avg} \times 8) \text{ bytes}\]

For 1M vectors with d=768, M=32:

\[\text{Memory} = 10^6 \times (768 \times 4 + 32 \times 4 \times 8) \approx 4.1 \text{ GB}\]

IVF (Inverted File Index)¶

Algorithm Overview:

IVF partitions the vector space into \(k\) clusters via k-means, then searches only the \(nprobe\) nearest clusters.

Complexity:

Operation	Time	Space
Index Construction	\(O(N \cdot d \cdot k \cdot I)\)	\(O(N \cdot d + k \cdot d)\)
Query (Search)	\(O(k \cdot d + nprobe \cdot N/k \cdot d)\)	\(O(d)\)
Optimal nprobe	\(O(\sqrt{N} \cdot d)\)	—

Where:

\(k\) = number of clusters (typically \(\sqrt{N}\) to \(4\sqrt{N}\))
\(I\) = k-means iterations
\(nprobe\) = clusters to search

With Product Quantization (PQ):

PQ compresses vectors by splitting into \(m\) subvectors and quantizing each to \(2^{nbits}\) centroids:

\[\text{Compression Ratio} = \frac{d \times 32}{m \times nbits}\]

For d=768, m=96, nbits=8:

\[\text{Compression} = \frac{768 \times 32}{96 \times 8} = 32\times\]

HNSW vs. IVF Trade-offs ¶

Metric	HNSW	IVF	IVF+PQ	Brute Force
Query Time	\(O(\log N)\)	\(O(\sqrt{N})\)	\(O(\sqrt{N})\)	\(O(N)\)
Recall@10	95-99%	85-95%	75-90%	100%
Memory (1M, d=768)	~4.1 GB	~3.1 GB	~0.1 GB	~3.0 GB
Build Time	Slow	Medium	Medium	None
Dynamic Insert	Good	Poor	Poor	N/A

Cross-Encoder Reranking ¶

Mathematical Formulation ¶

A cross-encoder computes:

\[s(q, d) = \text{MLP}(\text{BERT}([CLS] \oplus q \oplus [SEP] \oplus d \oplus [SEP]))\]

Attention Complexity:

For input length \(L = |q| + |d| + 3\) (including special tokens):

\[\text{Time}_{cross-encoder} = O(L^2 \cdot d + L \cdot d^2)\]

Comparison with Bi-Encoder:

Architecture	Scoring Complexity	Total for k candidates
Bi-Encoder	\(O(d)\) (dot product)	\(O(L_q \cdot d^2 + k \cdot d)\)
Cross-Encoder	\(O(L^2 \cdot d)\) (full attention)	\(O(k \cdot L^2 \cdot d)\)

For k=100 candidates, L=256:

Bi-Encoder: ~10ms (1 query encoding + 100 dot products)
Cross-Encoder: ~5,000ms (100 full forward passes)

Speedup Ratio:

\[\frac{\text{Time}_{cross}}{\text{Time}_{bi}} = \frac{k \cdot L^2}{L_q + k} \approx k \cdot L\]

For k=100, L=256: ~25,000× slower (theoretical), ~500× in practice due to batching.

LLM Generation Complexity ¶

This section analyzes the generation phase, which dominates RAG latency.

Transformer Decoder Analysis ¶

Prefill Phase (Prompt Processing):

Processes the entire prompt in parallel:

\[\text{Time}_{prefill} = O(n_{layers} \cdot L_{prompt}^2 \cdot d)\]

Decode Phase (Token Generation):

Generates tokens autoregressively:

\[\text{Time}_{decode} = O(n_{layers} \cdot n_{tokens} \cdot L_{total} \cdot d)\]

With KV caching, each new token only attends to cached keys/values:

\[\text{Time}_{decode}^{cached} = O(n_{layers} \cdot n_{tokens} \cdot d)\]

KV Cache Memory Analysis ¶

Memory Formula:

\[\text{Memory}_{KV} = 2 \times n_{layers} \times n_{heads} \times L \times d_{head} \times \text{precision}\]

For Llama-2-7B (32 layers, 32 heads, d_head=128, fp16):

\[\text{Memory}_{KV} = 2 \times 32 \times 32 \times L \times 128 \times 2 = 524,288 \times L \text{ bytes}\]

L=2048: 1.07 GB per sequence
L=8192: 4.29 GB per sequence
L=32768: 17.18 GB per sequence

KV Cache Trade-off:

Strategy	Time (200 tokens)	Memory
No Cache	\(O(n \cdot L^2)\) → ~40s	\(O(L \cdot d)\)
Full Cache	\(O(n \cdot L)\) → ~4s	\(O(n_{layers} \cdot L \cdot d)\)
Sliding Window	\(O(n \cdot W)\) → ~2s	\(O(n_{layers} \cdot W \cdot d)\)

End-to-End Latency Model ¶

Putting it all together:

\[T_{total} = T_{embed} + T_{search} + T_{rerank} + T_{prefill} + T_{decode}\]

Typical Values (MS MARCO scale, V100 GPU):

Component	Complexity	Latency	% of Total
Query Embedding	\(O(L_q \cdot d^2)\)	10-20ms	1-2%
Vector Search (HNSW)	\(O(\log N)\)	5-50ms	3-5%
Document Fetch	\(O(k)\)	10-30ms	2-3%
Reranking (optional)	\(O(k \cdot L^2 \cdot d)\)	500-2000ms	30-40%
LLM Prefill	\(O(L^2 \cdot d)\)	100-300ms	10-15%
LLM Decode	\(O(n \cdot L \cdot d)\)	2000-8000ms	40-50%

Total: 2.5-10 seconds depending on configuration.

Optimization Strategies ¶

Based on the complexity analysis, here are high-impact optimizations:

For Retrieval (41-47% of latency):

HNSW Parameter Tuning: Increase ef_search for recall, decrease for speed
Quantization: IVF+PQ for memory-constrained deployments
Caching: Cache frequent queries (30% latency reduction)

For Reranking (30-40% when used):

Distillation: Train smaller cross-encoders (MiniLM-L6 vs BERT-base)
Early Exit: Stop reranking when confidence is high
Batching: Process multiple candidates in parallel

For Generation (40-50% of latency):

KV Caching: Essential for any production system
Speculative Decoding: 2-3× speedup with draft models
Quantization: INT8/INT4 for 2-4× memory reduction
Context Compression: Reduce prompt length with summarization

Scaling Laws ¶

How does performance scale with corpus size?

Retrieval Scaling:

\[T_{retrieval}(N) = c_1 \cdot \log N + c_2\]

For HNSW, doubling corpus size adds ~constant time (logarithmic scaling).

Memory Scaling:

\[M(N) = N \cdot (d \cdot 4 + \text{index overhead})\]

Linear scaling—10× corpus = 10× memory.

Throughput Scaling:

With batching, throughput scales sub-linearly due to memory bandwidth limits:

\[\text{Throughput}(batch) = \frac{batch}{\alpha + \beta \cdot batch}\]

Where \(\alpha\) = fixed overhead, \(\beta\) = per-query cost.

Summary ¶

Important

Key Takeaways:

Retrieval is O(log N) with HNSW—corpus size matters less than you think
Reranking is O(k × L²)—limit candidate count k, not corpus size
LLM generation dominates—optimize here for biggest gains
KV caching is mandatory—10× speedup, but watch memory
Trade-offs are unavoidable—HNSW vs IVF, accuracy vs speed

References ¶

Malkov, Y. A., & Yashunin, D. A. (2018). “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE TPAMI.
Johnson, J., Douze, M., & Jégou, H. (2019). “Billion-scale Similarity Search with GPUs.” IEEE TBD.
Pope, R., et al. (2023). “Efficiently Scaling Transformer Inference.” MLSys 2023.
Milvus Documentation. “RAG Pipeline Latency Analysis.” 2024.
Gao, L., et al. (2024). “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv:2312.10997.

Next Steps ¶

See Overview of RAG and the Two-Stage Pipeline for conceptual architecture and practical configurations
See Benchmarks and Datasets for Retrieval and Re-ranking for evaluation metrics and benchmarks
See Stage 1: Retrieval Methods for retrieval method details
See Stage 2: Re-ranking Methods for reranking architectures