Chunking in RAG: Why It Matters More Than You Think

6 minute read

Published:

Introduction

In many Retrieval-Augmented Generation (RAG) systems, performance issues are often attributed to the choice of model or embedding. However, in practice, one of the most critical — and frequently overlooked — factors is how documents are chunked before indexing.

Chunking directly determines what the system can retrieve. If the segmentation is suboptimal, even the best embedding model and vector database will fail to return relevant context.

This post explores why chunking matters, common strategies, and practical trade-offs based on real-world RAG system design.


Where chunking fits in a RAG pipeline

A typical RAG pipeline consists of:

  1. Document ingestion
  2. Chunking
  3. Embedding
  4. Vector storage (e.g., Qdrant)
  5. Retrieval
  6. LLM generation

Chunking sits at the boundary between raw data and semantic representation. If chunking fails:

  • Relevant information may be split across chunks → recall loss
  • Irrelevant content may be grouped together → precision loss

Common chunking strategies

Core strategies (most commonly used)

1. Fixed-length chunking

The simplest approach is to split text into chunks of fixed size (e.g., 500 tokens). Advantages:

  • Easy to implement
  • Consistent chunk size
  • Works reasonably well for uniform text (text with consistent structure and relatively uniform semantic density)

Limitations:

  • Ignores semantic boundaries
  • May cut sentences or concepts in half
  • Can degrade retrieval quality

2. Recursive / rule-based chunking

Recursive chunking splits text using a hierarchy of rules, such as:

  • Paragraphs
  • Sentences
  • Tokens

If a chunk exceeds the target size, it is recursively split using finer-grained boundaries.

Advantages:

  • Preserves structure better than fixed-length splitting
  • More robust across different document types

Limitations:

  • Still heuristic-based
  • May not fully capture semantic coherence

3. Semantic chunking

Semantic chunking attempts to split documents based on semantic coherence, typically using techniques such as:

  • sentence embeddings
  • similarity between adjacent text segments
  • topic shift detection

Advantages:

  • Preserves context
  • Better alignment with human understanding

Limitations:

  • More complex
  • Requires heuristics or models
  • Can produce uneven chunk sizes

4. Sliding window / overlapping chunks

To mitigate boundary issues, overlapping chunks are often used.

Example:

  • Chunk size: 500 tokens
  • Overlap: 100 tokens

Advantages:

  • Reduces information loss at boundaries
  • Improves recall

Trade-offs:

  • Increased storage
  • Higher retrieval redundancy

Extended landscape (less commonly used but important)

Beyond the basic strategies, more advanced approaches have been explored:

  • Document-based chunking: splits text according to document structure such as sections, headers, or layout elements.
  • Hierarchical chunking: represents documents at multiple granularities (e.g., coarse and fine chunks) to support multi-level retrieval.
  • Late chunking: defers chunking until after retrieval, allowing the system to first retrieve larger contexts and split only when needed.
  • LLM-based chunking: uses language models to determine semantically meaningful boundaries instead of relying on fixed rules.
  • Adaptive / dynamic chunking: adjusts chunk size or boundaries based on content characteristics or downstream task requirements.

These approaches often emerge in more complex systems or research settings, where chunking is treated as an optimization problem rather than a preprocessing step.

Document structure-based chunking

While the previous strategies focus on how to split text, structure-based chunking emphasizes what constitutes a meaningful unit in structured documents.

In many real-world applications, documents are not plain text but structured formats such as PDFs, HTML pages, or Markdown files. In these cases, chunking should not rely solely on token length or sentence boundaries, but instead leverage the inherent document structure.

For example:

  • PDFs often contain sections, tables, and multi-column layouts
  • HTML documents include headings, lists, and semantic tags
  • Markdown files explicitly encode hierarchy through headings and formatting

Instead of treating these documents as flat text, structure-aware chunking uses elements such as:

  • section headers
  • paragraph boundaries
  • HTML tags (e.g., <h1>, <p>, <li>)
  • document hierarchy

to define chunk boundaries.

This approach offers several advantages:

  • Preserves logical grouping of information
  • Improves retrieval coherence
  • Reduces the risk of mixing unrelated content

In practice, structure-based chunking is often combined with recursive or semantic strategies, forming a hybrid approach that balances structure and meaning.

This is particularly important in domains where documents are complex and heterogeneous, such as web pages or medical documents, where naive chunking can significantly degrade retrieval quality.

The core trade-off: recall vs precision

Chunking fundamentally controls a key trade-off in RAG systems:

  • Large chunks → higher context completeness (recall)
  • Small chunks → higher relevance specificity (precision)

In practice:

  • Too small → fragmented information, weak answers
  • Too large → noisy retrieval, irrelevant context

The optimal chunk size depends on:

  • Document structure
  • Query type
  • LLM context window

Practical considerations (from real-world systems)

Based on production-like RAG setups, several patterns consistently emerge:

1. Chunking is often the first bottleneck

When retrieval quality is poor, the issue is frequently:

  • not embedding quality
  • not vector database performance

but improper chunking strategy


2. Metadata becomes more important as chunks get smaller

Smaller chunks lose global context. To compensate, metadata is critical:

  • source
  • section
  • document type

This enables filtering during retrieval (e.g., in Qdrant).


3. Chunking and embedding must be co-designed

Chunking defines what gets embedded.

If chunks are too long:

  • embeddings become less specific

If too short:

  • embeddings lose semantic completeness

How chunking interacts with vector databases

Vector databases such as Qdrant store embeddings at the chunk level.

This means:

  • Retrieval operates on chunks, not documents
  • Chunk design directly defines the retrieval granularity

In practice:

  • Better chunking → better nearest neighbor search
  • Poor chunking → irrelevant or incomplete results

This is why chunking should be considered a first-class design decision, not a preprocessing detail.


When to revisit your chunking strategy

You should reconsider chunking if you observe:

  • Relevant information not being retrieved
  • Answers missing key details
  • Retrieved chunks partially relevant but incomplete

These are often signals of:

  • chunk boundaries misaligned with semantics
  • chunk size not matching query patterns

Conclusion

Chunking is one of the most influential design choices in RAG systems.

It determines:

  • what gets embedded
  • what gets retrieved
  • and ultimately, what the LLM can reason over

Before tuning models or switching vector databases, it is often worth revisiting how documents are segmented.


References (optional)