Overview
Airweave automatically chunks documents before embedding to ensure:- Semantic coherence: Chunks represent complete ideas
- Token limits: Chunks fit within embedding model limits (8192 tokens max)
- Search quality: Granular chunks return precise results
Semantic Chunker
For natural language content (docs, emails, support tickets)Uses embedding similarity to find topic boundaries
Code Chunker
For source code filesUses AST parsing to chunk at function/class boundaries
Semantic Chunker
TheSemanticChunker uses local embedding models to detect semantic boundaries without external API calls.
How It Works
Embedding Similarity
Computes embeddings for each sentence using a lightweight local model:Compares similarity in a sliding window:
Boundary Detection
Identifies topic shifts when similarity drops below threshold:Creates semantic groups (chunks) of related sentences.
Configuration
All constants are defined inplatform/chunkers/semantic.py:
Hard limit matching OpenAI’s
text-embedding-3-small limitEnforced by TokenChunker safety netTarget size for semantic groups (soft limit)Tradeoff:
- Larger = more context per chunk, fewer API calls
- Smaller = more precise search results, more chunks
Token overlap between consecutive chunks (reserved for future use)
Local embedding model for chunking decisionsAvailable options (sorted by speed):Model2Vec (included with chonkie[semantic]):
minishlab/potion-base-8M- 8M params, ~0.5s/doc, good quality ⭐minishlab/potion-base-32M- 32M params, ~1s/doc, better qualityminishlab/potion-base-128M- 128M params, ~2-3s/doc, best Model2Vec
poetry add sentence-transformers):all-MiniLM-L6-v2- 33M params, ~1-2s/doc, good qualityall-MiniLM-L12-v2- 66M params, ~2-3s/doc, better qualityall-mpnet-base-v2- 110M params, ~3-5s/doc, best quality
This model is only for chunking (finding semantic boundaries). Final embeddings use your configured
DENSE_EMBEDDER (OpenAI/Mistral/Local).Threshold for detecting topic boundaries (0-1 range)Tradeoff:
- Lower (0.001-0.01): Larger chunks, fewer splits, more context
- Higher (0.05-0.1): Smaller chunks, more splits, precise boundaries
Number of consecutive sentences to compare for similarityLarger window = smoother chunking, slower processing
Minimum sentences per chunk (prevents tiny fragments)
Minimum characters to count as a sentence
Advanced Features
Savitzky-Golay Filter
Savitzky-Golay Filter
Smooths similarity scores to reduce noisy boundaries:Reduces over-segmentation from minor similarity fluctuations.
Skip Window
Skip Window
Merges non-consecutive similar groups:Currently disabled (0). Enable to merge related sections separated by short transitions.
Delimiter Handling
Delimiter Handling
Configures how sentence delimiters are preserved:Options:
"prev", "next", "none"Two-Stage Pipeline
The semantic chunker uses a two-stage approach: Stage 1: Semantic boundary detection (local embedding model) Stage 1.5: Token recounting with tiktoken (OpenAI compatibility) Stage 2: Safety net for oversized chunks (force-split at token boundaries)The TokenChunker safety net guarantees all chunks are ≤8192 tokens, even if semantic chunking produces large groups.
Example Workflow
Code Chunker
TheCodeChunker uses AST (Abstract Syntax Tree) parsing to chunk at logical code boundaries.
How It Works
AST Parsing
Parses code into syntax tree nodes:
- Functions
- Classes
- Methods
- Modules
Configuration
Hard limit enforced by TokenChunker safety net
Target chunk size for AST groupsNote: Can be exceeded by large AST nodes (e.g., 3000-line function). Safety net handles this.
OpenAI’s tiktoken encoding for accurate token counting
Supported Languages
The CodeChunker auto-detects and supports:- Python
- JavaScript / TypeScript
- Java
- Go
- C / C++
- Rust
- Ruby
- PHP
- And more via tree-sitter grammars
For unsupported languages, the chunker falls back to token-based splitting.
Example Workflow
Advantages Over Token-Based Chunking
- Semantic Coherence
- Search Quality
- Context Preservation
AST-based (Code Chunker):Token-based (naive):
Token Counting
Both chunkers use tiktoken for accurate OpenAI token counting:Performance Considerations
Singleton Pattern
Both chunkers use singletons to avoid reloading models:- No model reload overhead (~2-3s per sync)
- Lower memory usage
- Faster sync throughput
Async Thread Pool
Chonkie chunkers are synchronous, so we use thread pools:Batch Processing
Troubleshooting
Chunks exceeding 8192 tokens
Chunks exceeding 8192 tokens
Symptom:Cause: Bug in TokenChunker safety net (should never happen)Solution: This indicates a chunker bug. Report to Airweave team.
Empty chunks after chunking
Empty chunks after chunking
Symptom:Cause: Edge case in AST parsing producing empty nodesSolution: Empty chunks are automatically filtered. No action needed.
CodeChunker fails to detect language
CodeChunker fails to detect language
Symptom: Code chunked poorly as plain textCause: Unsupported language or ambiguous file extensionSolution:
- Check if language is supported (Python, JS, Java, Go, etc.)
- Falls back to token-based chunking (still works, less optimal)
SemanticChunker model loading slow
SemanticChunker model loading slow
Symptom: First sync takes 2-3 seconds longerCause: Lazy model initialization on first useSolution: This is expected. Subsequent syncs reuse the loaded model (singleton).
Too many/too few chunks
Too many/too few chunks
Symptom: Chunks are too granular or too largeSolution: Adjust semantic chunker settings in More, smaller chunks:
platform/chunkers/semantic.py:Fewer, larger chunks:Next Steps
Embeddings
Configure embedding models for chunked content
Transformers
Transform entities before chunking
Search
Use chunks in hybrid search queries
Development
Learn about chunking internals