Computational Complexity Analysis of RAG Systems
=================================================

This section provides a rigorous analysis of the time and space complexity of each component 
in a Retrieval-Augmented Generation (RAG) pipeline. While :doc:`rag_overview` covers the 
conceptual architecture, this page dives into the mathematical foundations that determine 
system performance at scale.

.. contents:: Table of Contents
   :local:
   :depth: 2

Why Complexity Analysis Matters
-------------------------------

Understanding computational complexity is essential for:

1. **Capacity Planning**: Predicting infrastructure requirements as corpus size grows
2. **Bottleneck Identification**: Knowing where optimization efforts yield the highest returns
3. **Architecture Selection**: Choosing between HNSW vs. IVF, cross-encoder vs. bi-encoder
4. **Cost Estimation**: Translating Big-O to actual latency and compute costs

.. note::

   **Latency Distribution in Production RAG** (empirical data from Milvus, 2024):
   
   * Retrieval: 41-47% of total latency
   * LLM Generation: 50-60% of total latency
   * Preprocessing/Postprocessing: <5%

Document Processing and Chunking
--------------------------------

Complexity Analysis
^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Operation
     - Time Complexity
     - Space Complexity
   * - Text tokenization
     - :math:`O(n)`
     - :math:`O(n)`
   * - Fixed-size chunking
     - :math:`O(n)`
     - :math:`O(n/c)` chunks
   * - Semantic chunking
     - :math:`O(n \cdot d^2)` (embedding-based)
     - :math:`O(n)`
   * - Overlap handling
     - :math:`O(n \cdot o/c)`
     - :math:`O(n \cdot (1 + o/c))`

Where:

* :math:`n` = document length in tokens
* :math:`c` = chunk size
* :math:`o` = overlap size
* :math:`d` = embedding dimension

Chunk Size Trade-offs
^^^^^^^^^^^^^^^^^^^^^

The choice of chunk size :math:`c` affects both retrieval quality and storage:

.. math::

   \text{Storage} = N_{chunks} \times d_{emb} \times 4 \text{ bytes} = \frac{n}{c - o} \times d_{emb} \times 4

**Empirical Guidelines** (from NVIDIA, 2025):

* **256-512 tokens**: Best for factoid QA, precise retrieval
* **512-1024 tokens**: Balanced for general RAG
* **1024+ tokens**: Better context but noisier embeddings

Embedding Generation
--------------------

Transformer Encoder Complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For a transformer encoder with :math:`L` layers, :math:`d` hidden dimension, and sequence 
length :math:`n`:

**Self-Attention Complexity:**

.. math::

   \text{Time}_{attention} = O(n^2 \cdot d)

**Feed-Forward Complexity:**

.. math::

   \text{Time}_{FFN} = O(n \cdot d^2)

**Total Per Layer:**

.. math::

   \text{Time}_{layer} = O(n^2 \cdot d + n \cdot d^2)

For typical embedding models where :math:`n < d` (e.g., n=512, d=768):

.. math::

   \text{Time}_{total} \approx O(L \cdot n \cdot d^2)

Practical Latency Comparison
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: Embedding Model Latencies (CPU, 100 tokens)
   :header-rows: 1
   :widths: 30 20 20 30

   * - Model
     - Dimension
     - Latency
     - Storage (1M docs)
   * - all-MiniLM-L6-v2
     - 384
     - ~2ms
     - ~1.5 GB
   * - BGE-base-en-v1.5
     - 768
     - ~10ms
     - ~3.0 GB
   * - E5-large-v2
     - 1024
     - ~50ms
     - ~4.0 GB
   * - BGE-M3
     - 1024
     - ~100ms
     - ~4.0 GB

Vector Indexing Algorithms
--------------------------

This is where the "magic" of sub-linear retrieval happens. Understanding these algorithms 
is crucial for production systems.

HNSW (Hierarchical Navigable Small World)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Algorithm Overview:**

HNSW constructs a multi-layer graph where:

* Layer 0 contains all vectors
* Higher layers contain exponentially fewer vectors (skip-list structure)
* Each vector connects to :math:`M` neighbors per layer

**Complexity:**

.. list-table::
   :header-rows: 1
   :widths: 25 40 35

   * - Operation
     - Time
     - Space
   * - Index Construction
     - :math:`O(N \cdot \log N \cdot M)`
     - :math:`O(N \cdot d + N \cdot M \cdot L_{max})`
   * - Query (Search)
     - :math:`O(\log N \cdot M \cdot ef)`
     - :math:`O(ef)`
   * - Insert (Single)
     - :math:`O(\log N \cdot M)`
     - :math:`O(d + M \cdot L_{max})`

Where:

* :math:`N` = number of vectors
* :math:`M` = connections per node (typically 16-64)
* :math:`ef` = search beam width (controls recall/speed trade-off)
* :math:`L_{max}` = maximum layer (typically :math:`\log N`)

**Memory Formula:**

.. math::

   \text{Memory}_{HNSW} = N \times (d \times 4 + M \times L_{avg} \times 8) \text{ bytes}

For 1M vectors with d=768, M=32:

.. math::

   \text{Memory} = 10^6 \times (768 \times 4 + 32 \times 4 \times 8) \approx 4.1 \text{ GB}

IVF (Inverted File Index)
^^^^^^^^^^^^^^^^^^^^^^^^^

**Algorithm Overview:**

IVF partitions the vector space into :math:`k` clusters via k-means, then searches only 
the :math:`nprobe` nearest clusters.

**Complexity:**

.. list-table::
   :header-rows: 1
   :widths: 25 40 35

   * - Operation
     - Time
     - Space
   * - Index Construction
     - :math:`O(N \cdot d \cdot k \cdot I)`
     - :math:`O(N \cdot d + k \cdot d)`
   * - Query (Search)
     - :math:`O(k \cdot d + nprobe \cdot N/k \cdot d)`
     - :math:`O(d)`
   * - Optimal nprobe
     - :math:`O(\sqrt{N} \cdot d)`
     - —

Where:

* :math:`k` = number of clusters (typically :math:`\sqrt{N}` to :math:`4\sqrt{N}`)
* :math:`I` = k-means iterations
* :math:`nprobe` = clusters to search

**With Product Quantization (PQ):**

PQ compresses vectors by splitting into :math:`m` subvectors and quantizing each to 
:math:`2^{nbits}` centroids:

.. math::

   \text{Compression Ratio} = \frac{d \times 32}{m \times nbits}

For d=768, m=96, nbits=8:

.. math::

   \text{Compression} = \frac{768 \times 32}{96 \times 8} = 32\times

HNSW vs. IVF Trade-offs
^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 20 20

   * - Metric
     - HNSW
     - IVF
     - IVF+PQ
     - Brute Force
   * - Query Time
     - :math:`O(\log N)`
     - :math:`O(\sqrt{N})`
     - :math:`O(\sqrt{N})`
     - :math:`O(N)`
   * - Recall@10
     - 95-99%
     - 85-95%
     - 75-90%
     - 100%
   * - Memory (1M, d=768)
     - ~4.1 GB
     - ~3.1 GB
     - ~0.1 GB
     - ~3.0 GB
   * - Build Time
     - Slow
     - Medium
     - Medium
     - None
   * - Dynamic Insert
     - Good
     - Poor
     - Poor
     - N/A

Cross-Encoder Reranking
-----------------------

Mathematical Formulation
^^^^^^^^^^^^^^^^^^^^^^^^

A cross-encoder computes:

.. math::

   s(q, d) = \text{MLP}(\text{BERT}([CLS] \oplus q \oplus [SEP] \oplus d \oplus [SEP]))

**Attention Complexity:**

For input length :math:`L = |q| + |d| + 3` (including special tokens):

.. math::

   \text{Time}_{cross-encoder} = O(L^2 \cdot d + L \cdot d^2)

**Comparison with Bi-Encoder:**

.. list-table::
   :header-rows: 1
   :widths: 25 35 40

   * - Architecture
     - Scoring Complexity
     - Total for k candidates
   * - Bi-Encoder
     - :math:`O(d)` (dot product)
     - :math:`O(L_q \cdot d^2 + k \cdot d)`
   * - Cross-Encoder
     - :math:`O(L^2 \cdot d)` (full attention)
     - :math:`O(k \cdot L^2 \cdot d)`

For k=100 candidates, L=256:

* **Bi-Encoder**: ~10ms (1 query encoding + 100 dot products)
* **Cross-Encoder**: ~5,000ms (100 full forward passes)

**Speedup Ratio:**

.. math::

   \frac{\text{Time}_{cross}}{\text{Time}_{bi}} = \frac{k \cdot L^2}{L_q + k} \approx k \cdot L

For k=100, L=256: **~25,000× slower** (theoretical), ~500× in practice due to batching.

LLM Generation Complexity
-------------------------

This section analyzes the generation phase, which dominates RAG latency.

Transformer Decoder Analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Prefill Phase (Prompt Processing):**

Processes the entire prompt in parallel:

.. math::

   \text{Time}_{prefill} = O(n_{layers} \cdot L_{prompt}^2 \cdot d)

**Decode Phase (Token Generation):**

Generates tokens autoregressively:

.. math::

   \text{Time}_{decode} = O(n_{layers} \cdot n_{tokens} \cdot L_{total} \cdot d)

With KV caching, each new token only attends to cached keys/values:

.. math::

   \text{Time}_{decode}^{cached} = O(n_{layers} \cdot n_{tokens} \cdot d)

KV Cache Memory Analysis
^^^^^^^^^^^^^^^^^^^^^^^^

**Memory Formula:**

.. math::

   \text{Memory}_{KV} = 2 \times n_{layers} \times n_{heads} \times L \times d_{head} \times \text{precision}

For Llama-2-7B (32 layers, 32 heads, d_head=128, fp16):

.. math::

   \text{Memory}_{KV} = 2 \times 32 \times 32 \times L \times 128 \times 2 = 524,288 \times L \text{ bytes}

* L=2048: **1.07 GB** per sequence
* L=8192: **4.29 GB** per sequence
* L=32768: **17.18 GB** per sequence

**KV Cache Trade-off:**

.. list-table::
   :header-rows: 1
   :widths: 25 35 40

   * - Strategy
     - Time (200 tokens)
     - Memory
   * - No Cache
     - :math:`O(n \cdot L^2)` → ~40s
     - :math:`O(L \cdot d)`
   * - Full Cache
     - :math:`O(n \cdot L)` → ~4s
     - :math:`O(n_{layers} \cdot L \cdot d)`
   * - Sliding Window
     - :math:`O(n \cdot W)` → ~2s
     - :math:`O(n_{layers} \cdot W \cdot d)`

End-to-End Latency Model
------------------------

Putting it all together:

.. math::

   T_{total} = T_{embed} + T_{search} + T_{rerank} + T_{prefill} + T_{decode}

**Typical Values (MS MARCO scale, V100 GPU):**

.. list-table::
   :header-rows: 1
   :widths: 25 25 25 25

   * - Component
     - Complexity
     - Latency
     - % of Total
   * - Query Embedding
     - :math:`O(L_q \cdot d^2)`
     - 10-20ms
     - 1-2%
   * - Vector Search (HNSW)
     - :math:`O(\log N)`
     - 5-50ms
     - 3-5%
   * - Document Fetch
     - :math:`O(k)`
     - 10-30ms
     - 2-3%
   * - Reranking (optional)
     - :math:`O(k \cdot L^2 \cdot d)`
     - 500-2000ms
     - 30-40%
   * - LLM Prefill
     - :math:`O(L^2 \cdot d)`
     - 100-300ms
     - 10-15%
   * - LLM Decode
     - :math:`O(n \cdot L \cdot d)`
     - 2000-8000ms
     - 40-50%

**Total**: 2.5-10 seconds depending on configuration.

Optimization Strategies
-----------------------

Based on the complexity analysis, here are high-impact optimizations:

**For Retrieval (41-47% of latency):**

1. **HNSW Parameter Tuning**: Increase ef_search for recall, decrease for speed
2. **Quantization**: IVF+PQ for memory-constrained deployments
3. **Caching**: Cache frequent queries (30% latency reduction)

**For Reranking (30-40% when used):**

1. **Distillation**: Train smaller cross-encoders (MiniLM-L6 vs BERT-base)
2. **Early Exit**: Stop reranking when confidence is high
3. **Batching**: Process multiple candidates in parallel

**For Generation (40-50% of latency):**

1. **KV Caching**: Essential for any production system
2. **Speculative Decoding**: 2-3× speedup with draft models
3. **Quantization**: INT8/INT4 for 2-4× memory reduction
4. **Context Compression**: Reduce prompt length with summarization

Scaling Laws
------------

How does performance scale with corpus size?

**Retrieval Scaling:**

.. math::

   T_{retrieval}(N) = c_1 \cdot \log N + c_2

For HNSW, doubling corpus size adds ~constant time (logarithmic scaling).

**Memory Scaling:**

.. math::

   M(N) = N \cdot (d \cdot 4 + \text{index overhead})

Linear scaling—10× corpus = 10× memory.

**Throughput Scaling:**

With batching, throughput scales sub-linearly due to memory bandwidth limits:

.. math::

   \text{Throughput}(batch) = \frac{batch}{\alpha + \beta \cdot batch}

Where :math:`\alpha` = fixed overhead, :math:`\beta` = per-query cost.

Summary
-------

.. important::

   **Key Takeaways:**
   
   1. **Retrieval is O(log N)** with HNSW—corpus size matters less than you think
   2. **Reranking is O(k × L²)**—limit candidate count k, not corpus size
   3. **LLM generation dominates**—optimize here for biggest gains
   4. **KV caching is mandatory**—10× speedup, but watch memory
   5. **Trade-offs are unavoidable**—HNSW vs IVF, accuracy vs speed

References
----------

1. Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and Robust Approximate Nearest Neighbor 
   Search Using Hierarchical Navigable Small World Graphs." *IEEE TPAMI*.

2. Johnson, J., Douze, M., & Jégou, H. (2019). "Billion-scale Similarity Search with GPUs." 
   *IEEE TBD*.

3. Pope, R., et al. (2023). "Efficiently Scaling Transformer Inference." *MLSys 2023*.

4. Milvus Documentation. "RAG Pipeline Latency Analysis." 2024.

5. Gao, L., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." 
   *arXiv:2312.10997*.

Next Steps
----------

* See :doc:`rag_overview` for conceptual architecture and practical configurations
* See :doc:`benchmarks_and_datasets` for evaluation metrics and benchmarks
* See :doc:`stage1_retrieval/index` for retrieval method details
* See :doc:`stage2_reranking/index` for reranking architectures