Chunking

Overview

Airweave automatically chunks documents before embedding to ensure:

Semantic coherence: Chunks represent complete ideas
Token limits: Chunks fit within embedding model limits (8192 tokens max)
Search quality: Granular chunks return precise results

Two specialized chunkers handle different content types:

Semantic Chunker

For natural language content (docs, emails, support tickets)Uses embedding similarity to find topic boundaries

Code Chunker

For source code filesUses AST parsing to chunk at function/class boundaries

Semantic Chunker

The SemanticChunker uses local embedding models to detect semantic boundaries without external API calls.

How It Works

Sentence Splitting

Splits document into sentences using delimiters:

SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]

Embedding Similarity

Computes embeddings for each sentence using a lightweight local model:

# Default: minishlab/potion-base-8M (8M params, ~0.5s/doc)
EMBEDDING_MODEL = "minishlab/potion-base-8M"

Compares similarity in a sliding window:

SIMILARITY_WINDOW = 10  # Compare 10 consecutive sentences

Boundary Detection

Identifies topic shifts when similarity drops below threshold:

SIMILARITY_THRESHOLD = 0.01  # Lower = larger chunks

Creates semantic groups (chunks) of related sentences.

Token Recounting

Recounts tokens using OpenAI’s tiktoken (cl100k_base) for accuracy:

chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)

Safety Net

Splits any oversized chunks (>8192 tokens) at exact token boundaries:

if chunk.token_count > MAX_TOKENS_PER_CHUNK:  # 8192
    # Use TokenChunker to force-split
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

All constants are defined in platform/chunkers/semantic.py:

MAX_TOKENS_PER_CHUNK

int

default:"8192"

Hard limit matching OpenAI’s text-embedding-3-small limitEnforced by TokenChunker safety net

SEMANTIC_CHUNK_SIZE

int

default:"4096"

Target size for semantic groups (soft limit)Tradeoff:

Larger = more context per chunk, fewer API calls
Smaller = more precise search results, more chunks

OVERLAP_TOKENS

int

default:"128"

Token overlap between consecutive chunks (reserved for future use)

EMBEDDING_MODEL

string

default:"minishlab/potion-base-8M"

Local embedding model for chunking decisionsAvailable options (sorted by speed):Model2Vec (included with chonkie[semantic]):

minishlab/potion-base-8M - 8M params, ~0.5s/doc, good quality ⭐
minishlab/potion-base-32M - 32M params, ~1s/doc, better quality
minishlab/potion-base-128M - 128M params, ~2-3s/doc, best Model2Vec

SentenceTransformer (requires: poetry add sentence-transformers):

all-MiniLM-L6-v2 - 33M params, ~1-2s/doc, good quality
all-MiniLM-L12-v2 - 66M params, ~2-3s/doc, better quality
all-mpnet-base-v2 - 110M params, ~3-5s/doc, best quality

This model is only for chunking (finding semantic boundaries). Final embeddings use your configured DENSE_EMBEDDER (OpenAI/Mistral/Local).

SIMILARITY_THRESHOLD

float

default:"0.01"

Threshold for detecting topic boundaries (0-1 range)Tradeoff:

Lower (0.001-0.01): Larger chunks, fewer splits, more context
Higher (0.05-0.1): Smaller chunks, more splits, precise boundaries

Default (0.01) balances context and granularity.

SIMILARITY_WINDOW

int

default:"10"

Number of consecutive sentences to compare for similarityLarger window = smoother chunking, slower processing

MIN_SENTENCES_PER_CHUNK

int

default:"1"

Minimum sentences per chunk (prevents tiny fragments)

MIN_CHARACTERS_PER_SENTENCE

int

default:"24"

Minimum characters to count as a sentence

Advanced Features

Savitzky-Golay Filter

Smooths similarity scores to reduce noisy boundaries:

FILTER_WINDOW = 5         # Window length for filter
FILTER_POLYORDER = 3      # Polynomial order
FILTER_TOLERANCE = 0.2    # Boundary detection tolerance

Reduces over-segmentation from minor similarity fluctuations.

Skip Window

Merges non-consecutive similar groups:

SKIP_WINDOW = 0  # 0=disabled, >0=merge similar groups

Currently disabled (0). Enable to merge related sections separated by short transitions.

Delimiter Handling

Configures how sentence delimiters are preserved:

SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]
INCLUDE_DELIMITER = "prev"  # Include with previous sentence

Options: "prev", "next", "none"

Two-Stage Pipeline

The semantic chunker uses a two-stage approach: Stage 1: Semantic boundary detection (local embedding model) Stage 1.5: Token recounting with tiktoken (OpenAI compatibility) Stage 2: Safety net for oversized chunks (force-split at token boundaries)

The TokenChunker safety net guarantees all chunks are ≤8192 tokens, even if semantic chunking produces large groups.

Example Workflow

from airweave.platform.chunkers.semantic import SemanticChunker

chunker = SemanticChunker()  # Singleton instance

# Batch processing
documents = [
    "Long document about machine learning...",
    "Technical guide to databases...",
    "Product requirements document..."
]

results = await chunker.chunk_batch(documents)

# results[0] = List of chunks for document 0
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")
    print(f"Range: {chunk['start_index']}-{chunk['end_index']}")
    print()

Code Chunker

The CodeChunker uses AST (Abstract Syntax Tree) parsing to chunk at logical code boundaries.

How It Works

Language Detection

Auto-detects programming language using Magika:

language="auto"  # Supports Python, JS, Java, Go, etc.

AST Parsing

Parses code into syntax tree nodes:

Functions
Classes
Methods
Modules

Chunks at natural boundaries between nodes.

Token Recounting

Recounts tokens with tiktoken:

# Chonkie's CodeChunker underestimates tokens
# (counts AST nodes, not whitespace/gaps)
chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)

Safety Net

Splits oversized chunks (>8192 tokens) at token boundaries:

if chunk.token_count > MAX_TOKENS_PER_CHUNK:
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

MAX_TOKENS_PER_CHUNK

int

default:"8192"

Hard limit enforced by TokenChunker safety net

CHUNK_SIZE

int

default:"2048"

Target chunk size for AST groupsNote: Can be exceeded by large AST nodes (e.g., 3000-line function). Safety net handles this.

TOKENIZER

string

default:"cl100k_base"

OpenAI’s tiktoken encoding for accurate token counting

Supported Languages

The CodeChunker auto-detects and supports:

Python
JavaScript / TypeScript
Java
Go
C / C++
Rust
Ruby
PHP
And more via tree-sitter grammars

For unsupported languages, the chunker falls back to token-based splitting.

Example Workflow

from airweave.platform.chunkers.code import CodeChunker

chunker = CodeChunker()  # Singleton instance

# Batch processing
code_files = [
    "def calculate_total(items):\n    return sum(item.price for item in items)\n\nclass Order:\n    ...",
    "function processPayment(amount) {\n    // ...\n}\n\nclass PaymentProcessor {\n    ..."
]

results = await chunker.chunk_batch(code_files)

# results[0] = List of chunks for code_files[0]
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")

Advantages Over Token-Based Chunking

Semantic Coherence
Search Quality
Context Preservation

AST-based (Code Chunker):

# Chunk 1: Complete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08
    return subtotal + tax

# Chunk 2: Complete class
class Order:
    def __init__(self, items):
        self.items = items

Token-based (naive):

# Chunk 1: Incomplete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08

# Chunk 2: Orphaned code
    return subtotal + tax

class Order:
    def __init__(self, items):

Token Counting

Both chunkers use tiktoken for accurate OpenAI token counting:

from airweave.platform.tokenizers import get_tokenizer

tokenizer = get_tokenizer("cl100k_base")

# Count tokens
token_count = len(tokenizer.encode(
    text, 
    allowed_special="all"  # Handle special tokens like <|endoftext|>
))

Performance Considerations

Singleton Pattern

Both chunkers use singletons to avoid reloading models:

# Models loaded once per pod, shared across all syncs
chunker = SemanticChunker()  # Returns same instance

Benefits:

No model reload overhead (~2-3s per sync)
Lower memory usage
Faster sync throughput

Async Thread Pool

Chonkie chunkers are synchronous, so we use thread pools:

from airweave.platform.sync.async_helpers import run_in_thread_pool

# Prevents blocking the event loop
results = await run_in_thread_pool(
    self._semantic_chunker.chunk_batch, 
    texts
)

Batch Processing

# Process multiple documents in one call
results = await chunker.chunk_batch([
    document_1,
    document_2,
    document_3
])

# Returns: [
#   [chunk1_doc1, chunk2_doc1],  # Document 1 chunks
#   [chunk1_doc2],                # Document 2 chunks
#   [chunk1_doc3, chunk2_doc3, chunk3_doc3]  # Document 3 chunks
# ]

Troubleshooting

Chunks exceeding 8192 tokens

Symptom:

SyncFailureError: PROGRAMMING ERROR: Chunk has 9500 tokens after 
TokenChunker fallback (max: 8192). TokenChunker failed to enforce hard limit.

Cause: Bug in TokenChunker safety net (should never happen)Solution: This indicates a chunker bug. Report to Airweave team.

Empty chunks after chunking

Symptom:

[CodeChunker] Skipping empty chunk - this may indicate a chunker bug

Cause: Edge case in AST parsing producing empty nodesSolution: Empty chunks are automatically filtered. No action needed.

CodeChunker fails to detect language

Symptom: Code chunked poorly as plain textCause: Unsupported language or ambiguous file extensionSolution:

Check if language is supported (Python, JS, Java, Go, etc.)
Falls back to token-based chunking (still works, less optimal)

SemanticChunker model loading slow

Symptom: First sync takes 2-3 seconds longerCause: Lazy model initialization on first useSolution: This is expected. Subsequent syncs reuse the loaded model (singleton).

Too many/too few chunks

Symptom: Chunks are too granular or too largeSolution: Adjust semantic chunker settings in platform/chunkers/semantic.py:Fewer, larger chunks:

SEMANTIC_CHUNK_SIZE = 6144  # Increase from 4096
SIMILARITY_THRESHOLD = 0.005  # Decrease from 0.01

More, smaller chunks:

SEMANTIC_CHUNK_SIZE = 2048  # Decrease from 4096
SIMILARITY_THRESHOLD = 0.05  # Increase from 0.01

Next Steps

Embeddings

Configure embedding models for chunked content

Transformers

Transform entities before chunking

Search

Use chunks in hybrid search queries

Development

Learn about chunking internals

Getting Started

Core Features

Data Sources

Integrations

Self-Hosting

Advanced

Contributing

Overview

Semantic Chunker

Code Chunker

Semantic Chunker

How It Works

Configuration

Advanced Features

Two-Stage Pipeline

Example Workflow

Code Chunker

How It Works

Configuration

Supported Languages

Example Workflow

Advantages Over Token-Based Chunking

Token Counting

Performance Considerations

Singleton Pattern

Async Thread Pool

Batch Processing

Troubleshooting

Next Steps

Embeddings

Transformers

Search

Development

Getting Started

Core Features

Data Sources

Integrations

Self-Hosting

Advanced

Contributing

​Overview

Semantic Chunker

Code Chunker

​Semantic Chunker

​How It Works

​Configuration

​Advanced Features

​Two-Stage Pipeline

​Example Workflow

​Code Chunker

​How It Works

​Configuration

​Supported Languages

​Example Workflow

​Advantages Over Token-Based Chunking

​Token Counting

​Performance Considerations

​Singleton Pattern

​Async Thread Pool

​Batch Processing

​Troubleshooting

​Next Steps

Embeddings

Transformers

Search

Development

Overview

Semantic Chunker

How It Works

Configuration

Advanced Features

Two-Stage Pipeline

Example Workflow

Code Chunker

How It Works

Configuration

Supported Languages

Example Workflow

Advantages Over Token-Based Chunking

Token Counting

Performance Considerations

Singleton Pattern

Async Thread Pool

Batch Processing

Troubleshooting

Next Steps