Computational Complexity Analysis of RAG Systems¶
This section provides a rigorous analysis of the time and space complexity of each component in a Retrieval-Augmented Generation (RAG) pipeline. While Overview of RAG and the Two-Stage Pipeline covers the conceptual architecture, this page dives into the mathematical foundations that determine system performance at scale.
Why Complexity Analysis Matters¶
Understanding computational complexity is essential for:
Capacity Planning: Predicting infrastructure requirements as corpus size grows
Bottleneck Identification: Knowing where optimization efforts yield the highest returns
Architecture Selection: Choosing between HNSW vs. IVF, cross-encoder vs. bi-encoder
Cost Estimation: Translating Big-O to actual latency and compute costs
Note
Latency Distribution in Production RAG (empirical data from Milvus, 2024):
Retrieval: 41-47% of total latency
LLM Generation: 50-60% of total latency
Preprocessing/Postprocessing: <5%
Document Processing and Chunking¶
Complexity Analysis¶
Operation |
Time Complexity |
Space Complexity |
|---|---|---|
Text tokenization |
\(O(n)\) |
\(O(n)\) |
Fixed-size chunking |
\(O(n)\) |
\(O(n/c)\) chunks |
Semantic chunking |
\(O(n \cdot d^2)\) (embedding-based) |
\(O(n)\) |
Overlap handling |
\(O(n \cdot o/c)\) |
\(O(n \cdot (1 + o/c))\) |
Where:
\(n\) = document length in tokens
\(c\) = chunk size
\(o\) = overlap size
\(d\) = embedding dimension
Chunk Size Trade-offs¶
The choice of chunk size \(c\) affects both retrieval quality and storage:
Empirical Guidelines (from NVIDIA, 2025):
256-512 tokens: Best for factoid QA, precise retrieval
512-1024 tokens: Balanced for general RAG
1024+ tokens: Better context but noisier embeddings
Embedding Generation¶
Transformer Encoder Complexity¶
For a transformer encoder with \(L\) layers, \(d\) hidden dimension, and sequence length \(n\):
Self-Attention Complexity:
Feed-Forward Complexity:
Total Per Layer:
For typical embedding models where \(n < d\) (e.g., n=512, d=768):
Practical Latency Comparison¶
Model |
Dimension |
Latency |
Storage (1M docs) |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 |
~2ms |
~1.5 GB |
BGE-base-en-v1.5 |
768 |
~10ms |
~3.0 GB |
E5-large-v2 |
1024 |
~50ms |
~4.0 GB |
BGE-M3 |
1024 |
~100ms |
~4.0 GB |
Vector Indexing Algorithms¶
This is where the “magic” of sub-linear retrieval happens. Understanding these algorithms is crucial for production systems.
IVF (Inverted File Index)¶
Algorithm Overview:
IVF partitions the vector space into \(k\) clusters via k-means, then searches only the \(nprobe\) nearest clusters.
Complexity:
Operation |
Time |
Space |
|---|---|---|
Index Construction |
\(O(N \cdot d \cdot k \cdot I)\) |
\(O(N \cdot d + k \cdot d)\) |
Query (Search) |
\(O(k \cdot d + nprobe \cdot N/k \cdot d)\) |
\(O(d)\) |
Optimal nprobe |
\(O(\sqrt{N} \cdot d)\) |
— |
Where:
\(k\) = number of clusters (typically \(\sqrt{N}\) to \(4\sqrt{N}\))
\(I\) = k-means iterations
\(nprobe\) = clusters to search
With Product Quantization (PQ):
PQ compresses vectors by splitting into \(m\) subvectors and quantizing each to \(2^{nbits}\) centroids:
For d=768, m=96, nbits=8:
HNSW vs. IVF Trade-offs¶
Metric |
HNSW |
IVF |
IVF+PQ |
Brute Force |
|---|---|---|---|---|
Query Time |
\(O(\log N)\) |
\(O(\sqrt{N})\) |
\(O(\sqrt{N})\) |
\(O(N)\) |
95-99% |
85-95% |
75-90% |
100% |
|
Memory (1M, d=768) |
~4.1 GB |
~3.1 GB |
~0.1 GB |
~3.0 GB |
Build Time |
Slow |
Medium |
Medium |
None |
Dynamic Insert |
Good |
Poor |
Poor |
N/A |
Cross-Encoder Reranking¶
Mathematical Formulation¶
A cross-encoder computes:
Attention Complexity:
For input length \(L = |q| + |d| + 3\) (including special tokens):
Comparison with Bi-Encoder:
Architecture |
Scoring Complexity |
Total for k candidates |
|---|---|---|
Bi-Encoder |
\(O(d)\) (dot product) |
\(O(L_q \cdot d^2 + k \cdot d)\) |
Cross-Encoder |
\(O(L^2 \cdot d)\) (full attention) |
\(O(k \cdot L^2 \cdot d)\) |
For k=100 candidates, L=256:
Bi-Encoder: ~10ms (1 query encoding + 100 dot products)
Cross-Encoder: ~5,000ms (100 full forward passes)
Speedup Ratio:
For k=100, L=256: ~25,000× slower (theoretical), ~500× in practice due to batching.
LLM Generation Complexity¶
This section analyzes the generation phase, which dominates RAG latency.
Transformer Decoder Analysis¶
Prefill Phase (Prompt Processing):
Processes the entire prompt in parallel:
Decode Phase (Token Generation):
Generates tokens autoregressively:
With KV caching, each new token only attends to cached keys/values:
KV Cache Memory Analysis¶
Memory Formula:
For Llama-2-7B (32 layers, 32 heads, d_head=128, fp16):
L=2048: 1.07 GB per sequence
L=8192: 4.29 GB per sequence
L=32768: 17.18 GB per sequence
KV Cache Trade-off:
Strategy |
Time (200 tokens) |
Memory |
|---|---|---|
No Cache |
\(O(n \cdot L^2)\) → ~40s |
\(O(L \cdot d)\) |
Full Cache |
\(O(n \cdot L)\) → ~4s |
\(O(n_{layers} \cdot L \cdot d)\) |
Sliding Window |
\(O(n \cdot W)\) → ~2s |
\(O(n_{layers} \cdot W \cdot d)\) |
End-to-End Latency Model¶
Putting it all together:
Typical Values (MS MARCO scale, V100 GPU):
Component |
Complexity |
Latency |
% of Total |
|---|---|---|---|
Query Embedding |
\(O(L_q \cdot d^2)\) |
10-20ms |
1-2% |
Vector Search (HNSW) |
\(O(\log N)\) |
5-50ms |
3-5% |
Document Fetch |
\(O(k)\) |
10-30ms |
2-3% |
Reranking (optional) |
\(O(k \cdot L^2 \cdot d)\) |
500-2000ms |
30-40% |
LLM Prefill |
\(O(L^2 \cdot d)\) |
100-300ms |
10-15% |
LLM Decode |
\(O(n \cdot L \cdot d)\) |
2000-8000ms |
40-50% |
Total: 2.5-10 seconds depending on configuration.
Optimization Strategies¶
Based on the complexity analysis, here are high-impact optimizations:
For Retrieval (41-47% of latency):
HNSW Parameter Tuning: Increase ef_search for recall, decrease for speed
Quantization: IVF+PQ for memory-constrained deployments
Caching: Cache frequent queries (30% latency reduction)
For Reranking (30-40% when used):
Distillation: Train smaller cross-encoders (MiniLM-L6 vs BERT-base)
Early Exit: Stop reranking when confidence is high
Batching: Process multiple candidates in parallel
For Generation (40-50% of latency):
KV Caching: Essential for any production system
Speculative Decoding: 2-3× speedup with draft models
Quantization: INT8/INT4 for 2-4× memory reduction
Context Compression: Reduce prompt length with summarization
Scaling Laws¶
How does performance scale with corpus size?
Retrieval Scaling:
For HNSW, doubling corpus size adds ~constant time (logarithmic scaling).
Memory Scaling:
Linear scaling—10× corpus = 10× memory.
Throughput Scaling:
With batching, throughput scales sub-linearly due to memory bandwidth limits:
Where \(\alpha\) = fixed overhead, \(\beta\) = per-query cost.
Summary¶
Important
Key Takeaways:
Retrieval is O(log N) with HNSW—corpus size matters less than you think
Reranking is O(k × L²)—limit candidate count k, not corpus size
LLM generation dominates—optimize here for biggest gains
KV caching is mandatory—10× speedup, but watch memory
Trade-offs are unavoidable—HNSW vs IVF, accuracy vs speed
References¶
Malkov, Y. A., & Yashunin, D. A. (2018). “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE TPAMI.
Johnson, J., Douze, M., & Jégou, H. (2019). “Billion-scale Similarity Search with GPUs.” IEEE TBD.
Pope, R., et al. (2023). “Efficiently Scaling Transformer Inference.” MLSys 2023.
Milvus Documentation. “RAG Pipeline Latency Analysis.” 2024.
Gao, L., et al. (2024). “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv:2312.10997.
Next Steps¶
See Overview of RAG and the Two-Stage Pipeline for conceptual architecture and practical configurations
See Benchmarks and Datasets for Retrieval and Re-ranking for evaluation metrics and benchmarks
See Stage 1: Retrieval Methods for retrieval method details
See Stage 2: Re-ranking Methods for reranking architectures