Chunking in RAG: Why It Matters More Than You Think
Published:
Introduction
In many Retrieval-Augmented Generation (RAG) systems, performance issues are often attributed to the choice of model or embedding. However, in practice, one of the most critical — and frequently overlooked — factors is how documents are chunked before indexing.
Chunking directly determines what the system can retrieve. If the segmentation is suboptimal, even the best embedding model and vector database will fail to return relevant context.
This post explores why chunking matters, common strategies, and practical trade-offs based on real-world RAG system design.
Where chunking fits in a RAG pipeline
A typical RAG pipeline consists of:
- Document ingestion
- Chunking
- Embedding
- Vector storage (e.g., Qdrant)
- Retrieval
- LLM generation
Chunking sits at the boundary between raw data and semantic representation. If chunking fails:
- Relevant information may be split across chunks → recall loss
- Irrelevant content may be grouped together → precision loss
Common chunking strategies
Core strategies (most commonly used)
1. Fixed-length chunking
The simplest approach is to split text into chunks of fixed size (e.g., 500 tokens). Advantages:
- Easy to implement
- Consistent chunk size
- Works reasonably well for uniform text (text with consistent structure and relatively uniform semantic density)
Limitations:
- Ignores semantic boundaries
- May cut sentences or concepts in half
- Can degrade retrieval quality
2. Recursive / rule-based chunking
Recursive chunking splits text using a hierarchy of rules, such as:
- Paragraphs
- Sentences
- Tokens
If a chunk exceeds the target size, it is recursively split using finer-grained boundaries.
Advantages:
- Preserves structure better than fixed-length splitting
- More robust across different document types
Limitations:
- Still heuristic-based
- May not fully capture semantic coherence
3. Semantic chunking
Semantic chunking attempts to split documents based on semantic coherence, typically using techniques such as:
- sentence embeddings
- similarity between adjacent text segments
- topic shift detection
Advantages:
- Preserves context
- Better alignment with human understanding
Limitations:
- More complex
- Requires heuristics or models
- Can produce uneven chunk sizes
4. Sliding window / overlapping chunks
To mitigate boundary issues, overlapping chunks are often used.
Example:
- Chunk size: 500 tokens
- Overlap: 100 tokens
Advantages:
- Reduces information loss at boundaries
- Improves recall
Trade-offs:
- Increased storage
- Higher retrieval redundancy
Extended landscape (less commonly used but important)
Beyond the basic strategies, more advanced approaches have been explored:
- Document-based chunking: splits text according to document structure such as sections, headers, or layout elements.
- Hierarchical chunking: represents documents at multiple granularities (e.g., coarse and fine chunks) to support multi-level retrieval.
- Late chunking: defers chunking until after retrieval, allowing the system to first retrieve larger contexts and split only when needed.
- LLM-based chunking: uses language models to determine semantically meaningful boundaries instead of relying on fixed rules.
- Adaptive / dynamic chunking: adjusts chunk size or boundaries based on content characteristics or downstream task requirements.
These approaches often emerge in more complex systems or research settings, where chunking is treated as an optimization problem rather than a preprocessing step.
Document structure-based chunking
While the previous strategies focus on how to split text, structure-based chunking emphasizes what constitutes a meaningful unit in structured documents.
In many real-world applications, documents are not plain text but structured formats such as PDFs, HTML pages, or Markdown files. In these cases, chunking should not rely solely on token length or sentence boundaries, but instead leverage the inherent document structure.
For example:
- PDFs often contain sections, tables, and multi-column layouts
- HTML documents include headings, lists, and semantic tags
- Markdown files explicitly encode hierarchy through headings and formatting
Instead of treating these documents as flat text, structure-aware chunking uses elements such as:
- section headers
- paragraph boundaries
- HTML tags (e.g.,
<h1>,<p>,<li>) - document hierarchy
to define chunk boundaries.
This approach offers several advantages:
- Preserves logical grouping of information
- Improves retrieval coherence
- Reduces the risk of mixing unrelated content
In practice, structure-based chunking is often combined with recursive or semantic strategies, forming a hybrid approach that balances structure and meaning.
This is particularly important in domains where documents are complex and heterogeneous, such as web pages or medical documents, where naive chunking can significantly degrade retrieval quality.
The core trade-off: recall vs precision
Chunking fundamentally controls a key trade-off in RAG systems:
- Large chunks → higher context completeness (recall)
- Small chunks → higher relevance specificity (precision)
In practice:
- Too small → fragmented information, weak answers
- Too large → noisy retrieval, irrelevant context
The optimal chunk size depends on:
- Document structure
- Query type
- LLM context window
Practical considerations (from real-world systems)
Based on production-like RAG setups, several patterns consistently emerge:
1. Chunking is often the first bottleneck
When retrieval quality is poor, the issue is frequently:
- not embedding quality
- not vector database performance
but improper chunking strategy
2. Metadata becomes more important as chunks get smaller
Smaller chunks lose global context. To compensate, metadata is critical:
- source
- section
- document type
This enables filtering during retrieval (e.g., in Qdrant).
3. Chunking and embedding must be co-designed
Chunking defines what gets embedded.
If chunks are too long:
- embeddings become less specific
If too short:
- embeddings lose semantic completeness
How chunking interacts with vector databases
Vector databases such as Qdrant store embeddings at the chunk level.
This means:
- Retrieval operates on chunks, not documents
- Chunk design directly defines the retrieval granularity
In practice:
- Better chunking → better nearest neighbor search
- Poor chunking → irrelevant or incomplete results
This is why chunking should be considered a first-class design decision, not a preprocessing detail.
When to revisit your chunking strategy
You should reconsider chunking if you observe:
- Relevant information not being retrieved
- Answers missing key details
- Retrieved chunks partially relevant but incomplete
These are often signals of:
- chunk boundaries misaligned with semantics
- chunk size not matching query patterns
Conclusion
Chunking is one of the most influential design choices in RAG systems.
It determines:
- what gets embedded
- what gets retrieved
- and ultimately, what the LLM can reason over
Before tuning models or switching vector databases, it is often worth revisiting how documents are segmented.
References (optional)
- Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, 2020
- LangChain documentation on text splitters
- Qdrant documentation on vector storage and payload filtering
- Pinecone Blog – Chunking Strategies for RAG Systems