Open In App

Chunking Strategies

Last Updated : 04 Nov, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Chunking is the process of segmenting text into smaller, manageable portions based on length, structure or semantic meaning. It allows vector search to focus on precise information rather than entire documents. Understanding different chunking methods helps improve retrieval accuracy and model performance in Retrieval Augmented Generation pipelines.

Need for Chunking

Some of the reasons chunking is required in LangChain are:

  1. LLM Token Limitations: Long documents exceed token boundaries making direct processing impossible.
  2. Improved Retrieval Accuracy: Smaller segments allow retrieval pipelines to match context more precisely.
  3. Better Performance: Chunking reduces computation overhead and speeds up embedding searches.
  4. Context Preservation: Keeps relevant text together, reducing hallucinations and incorrect reasoning.
  5. Efficient Knowledge Access: Enables document querying without loading entire files into memory.

Chunking Strategies

Some of the chunking strategies are:

1. Fixed-Size Chunking: Splits text into equal-sized segments based on characters or tokens. This approach is easy to implement and performs well for plain text.

Python
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_text(text)

2. Recursive Character Splitter: Attempts to preserve structure by splitting using multiple fallback rules. It avoids breaking sentences abruptly and keeps chunks readable.

Python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=80)
chunks = splitter.split_text(text)

3. Token-Based Chunking: Uses model tokenizers to ensure chunks stay within maximum token lengths. It reduces potential truncation errors during inference.

Python
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=32)
chunks = splitter.split_text(text)

4. Sentence or Semantic Chunking: Groups content based on sentence endings using NLP tools. This is helpful when dealing with descriptive or narrative-heavy content.

Python
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(chunk_size=350)
chunks = splitter.split_text(text)

5. Document-Based Chunking: Breaks large documents like PDFs, web pages into sections or paragraphs using LangChain’s document loaders.

Python
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunks = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100
).split_documents(PyPDFLoader("sample.pdf").load())

Chunk Overlap

Chunk overlap refers to the technique of including a small portion of text from the end of one chunk at the beginning of the next chunk. This helps maintain continuity between chunks and prevents important information from being lost when text is split. It is especially useful when sentences or ideas span across multiple chunks. Some of the benefits of chunk overlap in LangChain are:

  1. Maintains Context Flow: Overlapping small portions of text ensures that important information crossing chunk boundaries is preserved.
  2. Reduces Context Loss: When a sentence spans two chunks, overlaps prevent missing meaning during retrieval.
  3. Improves Answer Accuracy: Retrieval models gain continuity, leading to clearer and more complete responses.
  4. Better Semantic Understanding: Overlaps enhance embeddings by preserving transitional phrases and linked ideas.

Selecting Chunk Sizes

Choosing the right chunk size depends on the type of document and the use case. If chunks are too large, the model may include unnecessary data. If chunk is too small, it may lose essential meaning. Some recommended chunk sizes in LangChain are:

  1. 300–500 Tokens: Useful for most general documents where moderate context is needed.
  2. 600–900 Tokens: Ideal for technical guides and manuals requiring deeper reference context.
  3. 100–200 Tokens: Effective for short chats, logs or small knowledge fragments.

Implementation

Stepwise implementation of Chunking:

Step 1: Install Required Libraries

Installing LangChain for chunking utilities.

Python
!pip install langchain

Step 2: Import Modules

Importing the character-based splitter.

Python
from langchain.text_splitter import CharacterTextSplitter

Step 3: Load the Document

Reading the input text file.

You can download document from here.

Python
text = open("sample_doc.txt", "r").read()

Step 4: Create a Fixed-Size Chunker

Defining chunk size and overlap values.

Python
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)

Step 5: Split the Text

Generating multiple fixed-length text segments.

Python
chunks = splitter.split_text(text)

Step 6: Print the Output

Displaying the number of chunks and the first sample.

Python
print(f"Total Chunks Created: {len(chunks)}")
print(chunks[0])

Output:

Total Chunks Created: 4

Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.

You can download the source code from here.

Applications

Some of the applications of chunking in LangChain are:

  1. Question Answering: Chunking ensures that only the most relevant text segments are passed to the model, resulting in accurate and context-aware answers.
  2. Document Summarization: Long reports and research papers can be divided into sections, allowing LLMs to condense information more effectively.
  3. Semantic Search: By chunking text into context-rich pieces, search engines retrieve more precise and meaningful results rather than broad document matches.
  4. Chatbots: Segmented knowledge bases provide chatbots with localized context, improving reply quality and reducing hallucinations.
  5. Knowledge Graphs: Chunked text can be transformed into nodes and edges, enabling reasoning across distributed concepts and relationships.

Explore