Computational Complexity Analysis of RAG Systems

This section provides a rigorous analysis of the time and space complexity of each component in a Retrieval-Augmented Generation (RAG) pipeline. While Overview of RAG and the Two-Stage Pipeline covers the conceptual architecture, this page dives into the mathematical foundations that determine system performance at scale.

Why Complexity Analysis Matters

Understanding computational complexity is essential for:

  1. Capacity Planning: Predicting infrastructure requirements as corpus size grows

  2. Bottleneck Identification: Knowing where optimization efforts yield the highest returns

  3. Architecture Selection: Choosing between HNSW vs. IVF, cross-encoder vs. bi-encoder

  4. Cost Estimation: Translating Big-O to actual latency and compute costs

Note

Latency Distribution in Production RAG (empirical data from Milvus, 2024):

  • Retrieval: 41-47% of total latency

  • LLM Generation: 50-60% of total latency

  • Preprocessing/Postprocessing: <5%

Document Processing and Chunking

Complexity Analysis

Operation

Time Complexity

Space Complexity

Text tokenization

\(O(n)\)

\(O(n)\)

Fixed-size chunking

\(O(n)\)

\(O(n/c)\) chunks

Semantic chunking

\(O(n \cdot d^2)\) (embedding-based)

\(O(n)\)

Overlap handling

\(O(n \cdot o/c)\)

\(O(n \cdot (1 + o/c))\)

Where:

  • \(n\) = document length in tokens

  • \(c\) = chunk size

  • \(o\) = overlap size

  • \(d\) = embedding dimension

Chunk Size Trade-offs

The choice of chunk size \(c\) affects both retrieval quality and storage:

\[\text{Storage} = N_{chunks} \times d_{emb} \times 4 \text{ bytes} = \frac{n}{c - o} \times d_{emb} \times 4\]

Empirical Guidelines (from NVIDIA, 2025):

  • 256-512 tokens: Best for factoid QA, precise retrieval

  • 512-1024 tokens: Balanced for general RAG

  • 1024+ tokens: Better context but noisier embeddings

Embedding Generation

Transformer Encoder Complexity

For a transformer encoder with \(L\) layers, \(d\) hidden dimension, and sequence length \(n\):

Self-Attention Complexity:

\[\text{Time}_{attention} = O(n^2 \cdot d)\]

Feed-Forward Complexity:

\[\text{Time}_{FFN} = O(n \cdot d^2)\]

Total Per Layer:

\[\text{Time}_{layer} = O(n^2 \cdot d + n \cdot d^2)\]

For typical embedding models where \(n < d\) (e.g., n=512, d=768):

\[\text{Time}_{total} \approx O(L \cdot n \cdot d^2)\]

Practical Latency Comparison

Embedding Model Latencies (CPU, 100 tokens)

Model

Dimension

Latency

Storage (1M docs)

all-MiniLM-L6-v2

384

~2ms

~1.5 GB

BGE-base-en-v1.5

768

~10ms

~3.0 GB

E5-large-v2

1024

~50ms

~4.0 GB

BGE-M3

1024

~100ms

~4.0 GB

Vector Indexing Algorithms

This is where the “magic” of sub-linear retrieval happens. Understanding these algorithms is crucial for production systems.

HNSW (Hierarchical Navigable Small World)

Algorithm Overview:

HNSW constructs a multi-layer graph where:

  • Layer 0 contains all vectors

  • Higher layers contain exponentially fewer vectors (skip-list structure)

  • Each vector connects to \(M\) neighbors per layer

Complexity:

Operation

Time

Space

Index Construction

\(O(N \cdot \log N \cdot M)\)

\(O(N \cdot d + N \cdot M \cdot L_{max})\)

Query (Search)

\(O(\log N \cdot M \cdot ef)\)

\(O(ef)\)

Insert (Single)

\(O(\log N \cdot M)\)

\(O(d + M \cdot L_{max})\)

Where:

  • \(N\) = number of vectors

  • \(M\) = connections per node (typically 16-64)

  • \(ef\) = search beam width (controls recall/speed trade-off)

  • \(L_{max}\) = maximum layer (typically \(\log N\))

Memory Formula:

\[\text{Memory}_{HNSW} = N \times (d \times 4 + M \times L_{avg} \times 8) \text{ bytes}\]

For 1M vectors with d=768, M=32:

\[\text{Memory} = 10^6 \times (768 \times 4 + 32 \times 4 \times 8) \approx 4.1 \text{ GB}\]

IVF (Inverted File Index)

Algorithm Overview:

IVF partitions the vector space into \(k\) clusters via k-means, then searches only the \(nprobe\) nearest clusters.

Complexity:

Operation

Time

Space

Index Construction

\(O(N \cdot d \cdot k \cdot I)\)

\(O(N \cdot d + k \cdot d)\)

Query (Search)

\(O(k \cdot d + nprobe \cdot N/k \cdot d)\)

\(O(d)\)

Optimal nprobe

\(O(\sqrt{N} \cdot d)\)

Where:

  • \(k\) = number of clusters (typically \(\sqrt{N}\) to \(4\sqrt{N}\))

  • \(I\) = k-means iterations

  • \(nprobe\) = clusters to search

With Product Quantization (PQ):

PQ compresses vectors by splitting into \(m\) subvectors and quantizing each to \(2^{nbits}\) centroids:

\[\text{Compression Ratio} = \frac{d \times 32}{m \times nbits}\]

For d=768, m=96, nbits=8:

\[\text{Compression} = \frac{768 \times 32}{96 \times 8} = 32\times\]

HNSW vs. IVF Trade-offs

Metric

HNSW

IVF

IVF+PQ

Brute Force

Query Time

\(O(\log N)\)

\(O(\sqrt{N})\)

\(O(\sqrt{N})\)

\(O(N)\)

Recall@10

95-99%

85-95%

75-90%

100%

Memory (1M, d=768)

~4.1 GB

~3.1 GB

~0.1 GB

~3.0 GB

Build Time

Slow

Medium

Medium

None

Dynamic Insert

Good

Poor

Poor

N/A

Cross-Encoder Reranking

Mathematical Formulation

A cross-encoder computes:

\[s(q, d) = \text{MLP}(\text{BERT}([CLS] \oplus q \oplus [SEP] \oplus d \oplus [SEP]))\]

Attention Complexity:

For input length \(L = |q| + |d| + 3\) (including special tokens):

\[\text{Time}_{cross-encoder} = O(L^2 \cdot d + L \cdot d^2)\]

Comparison with Bi-Encoder:

Architecture

Scoring Complexity

Total for k candidates

Bi-Encoder

\(O(d)\) (dot product)

\(O(L_q \cdot d^2 + k \cdot d)\)

Cross-Encoder

\(O(L^2 \cdot d)\) (full attention)

\(O(k \cdot L^2 \cdot d)\)

For k=100 candidates, L=256:

  • Bi-Encoder: ~10ms (1 query encoding + 100 dot products)

  • Cross-Encoder: ~5,000ms (100 full forward passes)

Speedup Ratio:

\[\frac{\text{Time}_{cross}}{\text{Time}_{bi}} = \frac{k \cdot L^2}{L_q + k} \approx k \cdot L\]

For k=100, L=256: ~25,000× slower (theoretical), ~500× in practice due to batching.

LLM Generation Complexity

This section analyzes the generation phase, which dominates RAG latency.

Transformer Decoder Analysis

Prefill Phase (Prompt Processing):

Processes the entire prompt in parallel:

\[\text{Time}_{prefill} = O(n_{layers} \cdot L_{prompt}^2 \cdot d)\]

Decode Phase (Token Generation):

Generates tokens autoregressively:

\[\text{Time}_{decode} = O(n_{layers} \cdot n_{tokens} \cdot L_{total} \cdot d)\]

With KV caching, each new token only attends to cached keys/values:

\[\text{Time}_{decode}^{cached} = O(n_{layers} \cdot n_{tokens} \cdot d)\]

KV Cache Memory Analysis

Memory Formula:

\[\text{Memory}_{KV} = 2 \times n_{layers} \times n_{heads} \times L \times d_{head} \times \text{precision}\]

For Llama-2-7B (32 layers, 32 heads, d_head=128, fp16):

\[\text{Memory}_{KV} = 2 \times 32 \times 32 \times L \times 128 \times 2 = 524,288 \times L \text{ bytes}\]
  • L=2048: 1.07 GB per sequence

  • L=8192: 4.29 GB per sequence

  • L=32768: 17.18 GB per sequence

KV Cache Trade-off:

Strategy

Time (200 tokens)

Memory

No Cache

\(O(n \cdot L^2)\) → ~40s

\(O(L \cdot d)\)

Full Cache

\(O(n \cdot L)\) → ~4s

\(O(n_{layers} \cdot L \cdot d)\)

Sliding Window

\(O(n \cdot W)\) → ~2s

\(O(n_{layers} \cdot W \cdot d)\)

End-to-End Latency Model

Putting it all together:

\[T_{total} = T_{embed} + T_{search} + T_{rerank} + T_{prefill} + T_{decode}\]

Typical Values (MS MARCO scale, V100 GPU):

Component

Complexity

Latency

% of Total

Query Embedding

\(O(L_q \cdot d^2)\)

10-20ms

1-2%

Vector Search (HNSW)

\(O(\log N)\)

5-50ms

3-5%

Document Fetch

\(O(k)\)

10-30ms

2-3%

Reranking (optional)

\(O(k \cdot L^2 \cdot d)\)

500-2000ms

30-40%

LLM Prefill

\(O(L^2 \cdot d)\)

100-300ms

10-15%

LLM Decode

\(O(n \cdot L \cdot d)\)

2000-8000ms

40-50%

Total: 2.5-10 seconds depending on configuration.

Optimization Strategies

Based on the complexity analysis, here are high-impact optimizations:

For Retrieval (41-47% of latency):

  1. HNSW Parameter Tuning: Increase ef_search for recall, decrease for speed

  2. Quantization: IVF+PQ for memory-constrained deployments

  3. Caching: Cache frequent queries (30% latency reduction)

For Reranking (30-40% when used):

  1. Distillation: Train smaller cross-encoders (MiniLM-L6 vs BERT-base)

  2. Early Exit: Stop reranking when confidence is high

  3. Batching: Process multiple candidates in parallel

For Generation (40-50% of latency):

  1. KV Caching: Essential for any production system

  2. Speculative Decoding: 2-3× speedup with draft models

  3. Quantization: INT8/INT4 for 2-4× memory reduction

  4. Context Compression: Reduce prompt length with summarization

Scaling Laws

How does performance scale with corpus size?

Retrieval Scaling:

\[T_{retrieval}(N) = c_1 \cdot \log N + c_2\]

For HNSW, doubling corpus size adds ~constant time (logarithmic scaling).

Memory Scaling:

\[M(N) = N \cdot (d \cdot 4 + \text{index overhead})\]

Linear scaling—10× corpus = 10× memory.

Throughput Scaling:

With batching, throughput scales sub-linearly due to memory bandwidth limits:

\[\text{Throughput}(batch) = \frac{batch}{\alpha + \beta \cdot batch}\]

Where \(\alpha\) = fixed overhead, \(\beta\) = per-query cost.

Summary

Important

Key Takeaways:

  1. Retrieval is O(log N) with HNSW—corpus size matters less than you think

  2. Reranking is O(k × L²)—limit candidate count k, not corpus size

  3. LLM generation dominates—optimize here for biggest gains

  4. KV caching is mandatory—10× speedup, but watch memory

  5. Trade-offs are unavoidable—HNSW vs IVF, accuracy vs speed

References

  1. Malkov, Y. A., & Yashunin, D. A. (2018). “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE TPAMI.

  2. Johnson, J., Douze, M., & Jégou, H. (2019). “Billion-scale Similarity Search with GPUs.” IEEE TBD.

  3. Pope, R., et al. (2023). “Efficiently Scaling Transformer Inference.” MLSys 2023.

  4. Milvus Documentation. “RAG Pipeline Latency Analysis.” 2024.

  5. Gao, L., et al. (2024). “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv:2312.10997.

Next Steps