Skip to main content

Overview

Airweave automatically chunks documents before embedding to ensure:
  • Semantic coherence: Chunks represent complete ideas
  • Token limits: Chunks fit within embedding model limits (8192 tokens max)
  • Search quality: Granular chunks return precise results
Two specialized chunkers handle different content types:

Semantic Chunker

For natural language content (docs, emails, support tickets)Uses embedding similarity to find topic boundaries

Code Chunker

For source code filesUses AST parsing to chunk at function/class boundaries

Semantic Chunker

The SemanticChunker uses local embedding models to detect semantic boundaries without external API calls.

How It Works

1

Sentence Splitting

Splits document into sentences using delimiters:
SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]
2

Embedding Similarity

Computes embeddings for each sentence using a lightweight local model:
# Default: minishlab/potion-base-8M (8M params, ~0.5s/doc)
EMBEDDING_MODEL = "minishlab/potion-base-8M"
Compares similarity in a sliding window:
SIMILARITY_WINDOW = 10  # Compare 10 consecutive sentences
3

Boundary Detection

Identifies topic shifts when similarity drops below threshold:
SIMILARITY_THRESHOLD = 0.01  # Lower = larger chunks
Creates semantic groups (chunks) of related sentences.
4

Token Recounting

Recounts tokens using OpenAI’s tiktoken (cl100k_base) for accuracy:
chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)
5

Safety Net

Splits any oversized chunks (>8192 tokens) at exact token boundaries:
if chunk.token_count > MAX_TOKENS_PER_CHUNK:  # 8192
    # Use TokenChunker to force-split
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

All constants are defined in platform/chunkers/semantic.py:
MAX_TOKENS_PER_CHUNK
int
default:"8192"
Hard limit matching OpenAI’s text-embedding-3-small limitEnforced by TokenChunker safety net
SEMANTIC_CHUNK_SIZE
int
default:"4096"
Target size for semantic groups (soft limit)Tradeoff:
  • Larger = more context per chunk, fewer API calls
  • Smaller = more precise search results, more chunks
OVERLAP_TOKENS
int
default:"128"
Token overlap between consecutive chunks (reserved for future use)
EMBEDDING_MODEL
string
default:"minishlab/potion-base-8M"
Local embedding model for chunking decisionsAvailable options (sorted by speed):Model2Vec (included with chonkie[semantic]):
  • minishlab/potion-base-8M - 8M params, ~0.5s/doc, good quality ⭐
  • minishlab/potion-base-32M - 32M params, ~1s/doc, better quality
  • minishlab/potion-base-128M - 128M params, ~2-3s/doc, best Model2Vec
SentenceTransformer (requires: poetry add sentence-transformers):
  • all-MiniLM-L6-v2 - 33M params, ~1-2s/doc, good quality
  • all-MiniLM-L12-v2 - 66M params, ~2-3s/doc, better quality
  • all-mpnet-base-v2 - 110M params, ~3-5s/doc, best quality
This model is only for chunking (finding semantic boundaries). Final embeddings use your configured DENSE_EMBEDDER (OpenAI/Mistral/Local).
SIMILARITY_THRESHOLD
float
default:"0.01"
Threshold for detecting topic boundaries (0-1 range)Tradeoff:
  • Lower (0.001-0.01): Larger chunks, fewer splits, more context
  • Higher (0.05-0.1): Smaller chunks, more splits, precise boundaries
Default (0.01) balances context and granularity.
SIMILARITY_WINDOW
int
default:"10"
Number of consecutive sentences to compare for similarityLarger window = smoother chunking, slower processing
MIN_SENTENCES_PER_CHUNK
int
default:"1"
Minimum sentences per chunk (prevents tiny fragments)
MIN_CHARACTERS_PER_SENTENCE
int
default:"24"
Minimum characters to count as a sentence

Advanced Features

Smooths similarity scores to reduce noisy boundaries:
FILTER_WINDOW = 5         # Window length for filter
FILTER_POLYORDER = 3      # Polynomial order
FILTER_TOLERANCE = 0.2    # Boundary detection tolerance
Reduces over-segmentation from minor similarity fluctuations.
Merges non-consecutive similar groups:
SKIP_WINDOW = 0  # 0=disabled, >0=merge similar groups
Currently disabled (0). Enable to merge related sections separated by short transitions.
Configures how sentence delimiters are preserved:
SENTENCE_DELIMITERS = [". ", "! ", "? ", "\n"]
INCLUDE_DELIMITER = "prev"  # Include with previous sentence
Options: "prev", "next", "none"

Two-Stage Pipeline

The semantic chunker uses a two-stage approach: Stage 1: Semantic boundary detection (local embedding model) Stage 1.5: Token recounting with tiktoken (OpenAI compatibility) Stage 2: Safety net for oversized chunks (force-split at token boundaries)
The TokenChunker safety net guarantees all chunks are ≤8192 tokens, even if semantic chunking produces large groups.

Example Workflow

from airweave.platform.chunkers.semantic import SemanticChunker

chunker = SemanticChunker()  # Singleton instance

# Batch processing
documents = [
    "Long document about machine learning...",
    "Technical guide to databases...",
    "Product requirements document..."
]

results = await chunker.chunk_batch(documents)

# results[0] = List of chunks for document 0
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")
    print(f"Range: {chunk['start_index']}-{chunk['end_index']}")
    print()

Code Chunker

The CodeChunker uses AST (Abstract Syntax Tree) parsing to chunk at logical code boundaries.

How It Works

1

Language Detection

Auto-detects programming language using Magika:
language="auto"  # Supports Python, JS, Java, Go, etc.
2

AST Parsing

Parses code into syntax tree nodes:
  • Functions
  • Classes
  • Methods
  • Modules
Chunks at natural boundaries between nodes.
3

Token Recounting

Recounts tokens with tiktoken:
# Chonkie's CodeChunker underestimates tokens
# (counts AST nodes, not whitespace/gaps)
chunk.token_count = len(
    tiktoken_tokenizer.encode(chunk.text, allowed_special="all")
)
4

Safety Net

Splits oversized chunks (>8192 tokens) at token boundaries:
if chunk.token_count > MAX_TOKENS_PER_CHUNK:
    split_chunks = token_chunker.chunk_batch([chunk.text])

Configuration

MAX_TOKENS_PER_CHUNK
int
default:"8192"
Hard limit enforced by TokenChunker safety net
CHUNK_SIZE
int
default:"2048"
Target chunk size for AST groupsNote: Can be exceeded by large AST nodes (e.g., 3000-line function). Safety net handles this.
TOKENIZER
string
default:"cl100k_base"
OpenAI’s tiktoken encoding for accurate token counting

Supported Languages

The CodeChunker auto-detects and supports:
  • Python
  • JavaScript / TypeScript
  • Java
  • Go
  • C / C++
  • Rust
  • Ruby
  • PHP
  • And more via tree-sitter grammars
For unsupported languages, the chunker falls back to token-based splitting.

Example Workflow

from airweave.platform.chunkers.code import CodeChunker

chunker = CodeChunker()  # Singleton instance

# Batch processing
code_files = [
    "def calculate_total(items):\n    return sum(item.price for item in items)\n\nclass Order:\n    ...",
    "function processPayment(amount) {\n    // ...\n}\n\nclass PaymentProcessor {\n    ..."
]

results = await chunker.chunk_batch(code_files)

# results[0] = List of chunks for code_files[0]
for chunk in results[0]:
    print(f"Chunk: {chunk['text'][:100]}...")
    print(f"Tokens: {chunk['token_count']}")

Advantages Over Token-Based Chunking

AST-based (Code Chunker):
# Chunk 1: Complete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08
    return subtotal + tax

# Chunk 2: Complete class
class Order:
    def __init__(self, items):
        self.items = items
Token-based (naive):
# Chunk 1: Incomplete function
def calculate_total(items):
    subtotal = sum(item.price for item in items)
    tax = subtotal * 0.08

# Chunk 2: Orphaned code
    return subtotal + tax

class Order:
    def __init__(self, items):

Token Counting

Both chunkers use tiktoken for accurate OpenAI token counting:
from airweave.platform.tokenizers import get_tokenizer

tokenizer = get_tokenizer("cl100k_base")

# Count tokens
token_count = len(tokenizer.encode(
    text, 
    allowed_special="all"  # Handle special tokens like <|endoftext|>
))

Performance Considerations

Singleton Pattern

Both chunkers use singletons to avoid reloading models:
# Models loaded once per pod, shared across all syncs
chunker = SemanticChunker()  # Returns same instance
Benefits:
  • No model reload overhead (~2-3s per sync)
  • Lower memory usage
  • Faster sync throughput

Async Thread Pool

Chonkie chunkers are synchronous, so we use thread pools:
from airweave.platform.sync.async_helpers import run_in_thread_pool

# Prevents blocking the event loop
results = await run_in_thread_pool(
    self._semantic_chunker.chunk_batch, 
    texts
)

Batch Processing

# Process multiple documents in one call
results = await chunker.chunk_batch([
    document_1,
    document_2,
    document_3
])

# Returns: [
#   [chunk1_doc1, chunk2_doc1],  # Document 1 chunks
#   [chunk1_doc2],                # Document 2 chunks
#   [chunk1_doc3, chunk2_doc3, chunk3_doc3]  # Document 3 chunks
# ]

Troubleshooting

Symptom:
SyncFailureError: PROGRAMMING ERROR: Chunk has 9500 tokens after 
TokenChunker fallback (max: 8192). TokenChunker failed to enforce hard limit.
Cause: Bug in TokenChunker safety net (should never happen)Solution: This indicates a chunker bug. Report to Airweave team.
Symptom:
[CodeChunker] Skipping empty chunk - this may indicate a chunker bug
Cause: Edge case in AST parsing producing empty nodesSolution: Empty chunks are automatically filtered. No action needed.
Symptom: Code chunked poorly as plain textCause: Unsupported language or ambiguous file extensionSolution:
  1. Check if language is supported (Python, JS, Java, Go, etc.)
  2. Falls back to token-based chunking (still works, less optimal)
Symptom: First sync takes 2-3 seconds longerCause: Lazy model initialization on first useSolution: This is expected. Subsequent syncs reuse the loaded model (singleton).
Symptom: Chunks are too granular or too largeSolution: Adjust semantic chunker settings in platform/chunkers/semantic.py:Fewer, larger chunks:
SEMANTIC_CHUNK_SIZE = 6144  # Increase from 4096
SIMILARITY_THRESHOLD = 0.005  # Decrease from 0.01
More, smaller chunks:
SEMANTIC_CHUNK_SIZE = 2048  # Decrease from 4096
SIMILARITY_THRESHOLD = 0.05  # Increase from 0.01

Next Steps

Embeddings

Configure embedding models for chunked content

Transformers

Transform entities before chunking

Search

Use chunks in hybrid search queries

Development

Learn about chunking internals