Retrieval in RAG: From Vector Search to Hybrid Systems

A practical introduction to retrieval in RAG systems, covering dense vector search, sparse lexical retrieval, hybrid retrieval, and the trade-offs behind real-world retrieval design.

10 minute read

Published:

Summary: “A practical introduction to retrieval in RAG systems, covering dense vector search, sparse lexical retrieval, hybrid retrieval, and the trade-offs behind real-world retrieval design.”

Introduction

In a Retrieval-Augmented Generation (RAG) system, retrieval determines what information is actually passed to the language model:

  • Chunking defines the unit of retrieval.
  • Embedding defines how text is represented.
  • Retrieval decides which pieces of information are selected.

Even if the language model is strong, it can only reason over the context it receives. If retrieval returns irrelevant, incomplete, or misleading context, the final answer will likely suffer.

Retrieval is not about finding everything relevant, but finding the right context under constraints.

In this post, I will discuss:

  • how vector search works in RAG
  • why sparse retrieval such as BM25 still matters
  • why hybrid retrieval is useful
  • how retrieval design affects recall, precision, and downstream answer quality

Where Retrieval Fits in the RAG Pipeline

A typical RAG pipeline looks like:

Query → Embedding → Retrieval → Reranking → LLM

Retrieval sits between representation and reasoning.

  • Embedding maps text into a vector space
  • Retrieval selects a small subset of candidates
  • Reranking refines their order
  • The LLM generates answers based on the selected context

Retrieval sits between representation and reasoning.

Its role is to:

  • reduce the search space from millions of documents to a small candidate set
  • maximize recall under latency constraints

In other words:

Retrieval is the stage where the system decides what it knows.


Vector Search: The Dense Retrieval Paradigm

Vector search retrieves documents based on semantic similarity in an embedding space.

How vector search works

At a high level, vector search consists of three steps:

  • The query is converted into an embedding vector
  • All documents are represented as vectors
  • Retrieval is performed by finding the nearest neighbors

Similarity is typically measured using cosine similarity or dot product. This means retrieval is fundamentally a geometric operation:

documents are ranked based on distance in a learned vector space.

ANN (Approximate Nearest Neighbor)

In practice, exact nearest neighbor search is infeasible:

  • embedding vectors are high-dimensional
  • corpora can contain millions of documents

Computing exact distances to all vectors would be too slow. To address this, most systems use Approximate Nearest Neighbor (ANN) algorithms.

Common approaches include:

  • HNSW (used in vector databases such as Qdrant)
  • IVF (used in libraries such as FAISS)

These methods trade off a small amount of accuracy for significant gains in speed.

Vector search works well because it captures semantic similarity:

  • it can match paraphrases
  • it is robust to wording variation
  • it generalizes beyond exact keywords

For example:

  • “heart attack” can match “myocardial infarction”
  • “side effects” can match “adverse reactions”

This makes it especially useful in natural language queries.

Limitations

However, vector search is not a complete retrieval solution. It relies solely on semantic similarity, which introduces several limitations:

  • it may miss exact matches (e.g., numbers, IDs, drug names)
  • it struggles with rare or domain-specific terms
  • embedding biases propagate directly into retrieval results

More importantly:

similarity in embedding space is only an approximation of relevance.

As a result, vector search alone can produce results that are semantically related but not actually useful.

This limitation motivates the need for additional retrieval signals, which is discussed next.


Sparse Retrieval: The Lexical Baseline

Before dense retrieval became popular, most search systems relied on sparse (lexical) retrieval methods.

The most widely used approach is BM25, which ranks documents based on keyword matching.

BM25 intuition

BM25 builds on two simple ideas:

  • terms that appear frequently in a document are important
  • rare terms across the corpus are more informative

In practice, it scores documents based on how well their words match the query, weighted by term frequency and inverse document frequency.

This makes BM25 highly effective for queries where exact wording matters.

Why sparse retrieval still matters

Despite the rise of vector search, sparse retrieval remains a strong baseline.

It provides signals that dense retrieval often misses:

  • exact matching (e.g., names, IDs, drug names, numbers)
  • deterministic behavior (results are predictable and interpretable)
  • robust baseline performance across many domains

For example:

  • “Aspirin 100mg” will reliably match documents containing the exact dosage
  • entity-heavy queries benefit from precise keyword matching

Failure modes

Sparse retrieval has clear limitations:

  • it cannot handle paraphrase (e.g., “heart attack” vs “myocardial infarction”)
  • it ignores semantic similarity
  • it fails when query and document use different wording

Hybrid Retrieval: Combining Dense and Sparse

Neither dense nor sparse retrieval is sufficient on its own.

Dense retrieval captures semantic similarity, but may miss exact matches.

Sparse retrieval captures exact matches, but cannot generalize across different expressions.

Hybrid retrieval combines both signals to improve robustness.

Why hybrid works

Dense and sparse methods model different aspects of relevance:

  • dense → meaning (semantic similarity)
  • sparse → exact match (lexical signals)

These signals are complementary rather than redundant.

In practice:

  • dense retrieval improves recall for paraphrased queries
  • sparse retrieval ensures precision for entities, numbers, and keywords

Hybrid retrieval works because it combines semantic generalization with lexical precision.

Common strategies

There are two common ways to combine dense and sparse retrieval.

Score fusion

Combine scores from both systems:

  • weighted sum of dense and sparse scores
  • requires normalization (scores are not directly comparable)

Challenges:

  • dense and BM25 scores have different distributions
  • requires tuning weights
  • less stable across datasets

Rank fusion

Combine rankings instead of scores. A common approach is Reciprocal Rank Fusion (RRF):

  • each system produces a ranked list
  • ranks are combined using a simple formula

Example (OpenSearch)

In practice, hybrid retrieval is often implemented using search engines such as OpenSearch.

A typical setup includes:

  • BM25 for sparse retrieval
  • vector search for dense retrieval
  • a fusion strategy to combine results

This allows systems to leverage both lexical and semantic signals in a single query.

When to use hybrid retrieval

Hybrid retrieval is particularly useful when:

  • the domain contains precise terminology (e.g., medical, legal)
  • queries include both intent and keywords
  • exact matches (IDs, drug names, numbers) are critical
  • pure vector search produces unstable or noisy results

Key takeaway

Hybrid retrieval is not an optimization, but a necessity in many real-world systems.

Effective retrieval is not about choosing one method, but combining complementary signals.


Retrieval Trade-offs

Retrieval design is fundamentally about trade-offs. There is no single “best” retrieval strategy — only choices that balance competing objectives.

Recall vs Precision

  • Recall: retrieving all relevant documents
  • Precision: retrieving only relevant documents

In RAG systems:

  • retrieval typically prioritizes recall
  • reranking is used to improve precision

Example:

  • top-3 → high precision, low recall
  • top-20 → high recall, lower precision

Key insight:

Retrieval should aim to avoid missing important information, even at the cost of introducing noise.

Latency vs Accuracy

More accurate retrieval often requires more computation:

  • exact search → more accurate but slower
  • ANN → faster but approximate
  • hybrid retrieval → more signals but higher cost

Trade-off:

  • lower latency improves responsiveness
  • higher accuracy improves answer quality

Cost vs Performance

Retrieval design also impacts system cost:

  • larger top-k → more tokens → higher LLM cost
  • more complex pipelines → higher infrastructure cost
  • reranking → additional model inference

Implication:

Better retrieval is not free — it shifts cost across the system.

Stability vs Flexibility

Different retrieval strategies behave differently:

  • sparse retrieval → stable and predictable
  • dense retrieval → flexible but less controllable
  • hybrid → more robust but more complex

Putting it together

In practice, modern RAG systems often adopt a multi-stage approach:

  • retrieval (high recall)
  • reranking (high precision)
  • generation (final reasoning)

Final takeaway:

Retrieval is not just a component, but a set of design decisions that balance recall, precision, latency, and cost.


Practical Design Guidelines

Start simple

  • use dense retrieval
  • retrieve top-k
  • inspect results

Add sparse when needed

  • exact matches are missed
  • domain terminology matters

Tune top-k

  • small k → precise but incomplete
  • large k → complete but noisy

Ensure consistency

Align:

  • embedding model
  • similarity metric
  • normalization

Use real queries

Test with realistic, domain-specific inputs.

Inspect results

Metrics help, but manual inspection reveals failure patterns.

Plan for multi-stage retrieval

  • retrieval → recall
  • reranking → precision

Key takeaway

Retrieval quality improves through iteration, not one-time design.


Common Failure Modes

Retrieval failures are rarely random. They are usually systematic and reproducible.

Missing obvious matches

Relevant documents exist but are not retrieved. Common causes:

  • embedding model mismatch
  • insufficient top-k
  • missing keyword signals

Retrieved documents are “close” in meaning but not useful. Typical in dense retrieval when:

  • queries are broad or ambiguous
  • embedding space is too smooth

Over-reliance on keywords

Sparse retrieval dominates results, leading to:

  • exact matches without context
  • poor handling of paraphrase

Domain mismatch

Retrieval fails on:

  • rare entities
  • domain-specific terminology
  • new or unseen concepts

Noisy candidate sets

Too many partially relevant results:

  • large top-k without filtering
  • hybrid retrieval without proper fusion
  • lack of reranking

Inconsistent behavior across queries

Same system performs well on some queries but poorly on others. Often caused by:

  • uneven data distribution
  • query–document mismatch
  • sensitivity to phrasing

Key insight

Retrieval failures reflect mismatches between signals, data, and task requirements — not just model weaknesses.

In practice, improving retrieval is less about replacing components, and more about understanding these failure patterns and addressing them systematically.

Conclusion

Retrieval is a foundational component of RAG systems, but it is often oversimplified.

Vector search enables semantic matching, while sparse retrieval provides precise lexical signals. In practice, effective systems combine both, and further refine results through multi-stage pipelines.

The key challenge is not choosing a single method, but balancing trade-offs:

  • recall vs precision
  • latency vs accuracy
  • cost vs performance

Final takeaway:

Retrieval determines what your system knows.
Designing it well is essential for building reliable RAG applications.

References