Skip to main content

Overview

Airweave uses both dense and sparse embeddings for hybrid search:
  • Dense embeddings: Semantic similarity via neural networks (384-3072 dimensions)
  • Sparse embeddings: Keyword matching via BM25 (traditional search)
Combining both provides better search quality than either alone.
All three embedding variables are required in .env:
  • DENSE_EMBEDDER
  • EMBEDDING_DIMENSIONS
  • SPARSE_EMBEDDER

Available Dense Embedders

Airweave supports 4 dense embedding models out of the box:
fastembed_bm25
string
Provider: FastEmbed (Qdrant)Algorithm: BM25 (Best Matching 25)Model: Qdrant/bm25Configuration:
.env
SPARSE_EMBEDDER=fastembed_bm25
Features:
  • Traditional keyword search
  • No API key required
  • Fast, deterministic
  • Complements dense embeddings
Note: Currently the only sparse embedder supported. More coming soon.

Matryoshka Embeddings

OpenAI’s text-embedding-3-small and text-embedding-3-large support Matryoshka Representation Learning, allowing you to use fewer dimensions:
Matryoshka embeddings encode information hierarchically:
  • Most important information in early dimensions
  • Less critical information in later dimensions
  • Can truncate to fewer dimensions with minimal quality loss
Example: Use 512 dimensions instead of 1536 for 3x faster search and 67% less storage.
Set EMBEDDING_DIMENSIONS to any value up to the model’s maximum:text-embedding-3-small (max 1536):
EMBEDDING_DIMENSIONS=512   # Fast, lower quality
EMBEDDING_DIMENSIONS=1024  # Balanced
EMBEDDING_DIMENSIONS=1536  # Maximum quality
text-embedding-3-large (max 3072):
EMBEDDING_DIMENSIONS=768   # Fast, lower quality
EMBEDDING_DIMENSIONS=1536  # Balanced
EMBEDDING_DIMENSIONS=3072  # Maximum quality
DimensionsSearch SpeedStorageQuality
2566x faster83% lessGood
5123x faster67% lessBetter
10241.5x faster33% lessVery Good
1536BaselineBaselineExcellent
Recommendation: Start with 1536, reduce to 1024 if you need better performance.
Changing EMBEDDING_DIMENSIONS after indexing data requires complete re-indexing. All documents must be re-synced.
Airweave validates dimensions at startup against the database:
EmbeddingConfigError: Embedding config mismatch: 
embedding_dimensions: code=1024, db=1536. 
Changing embedding model or dimensions makes all synced 
data unsearchable — you would have to delete all data and resync.

Embedding Configuration Validation

Airweave performs strict validation at startup:
1

Environment Variable Check

Ensures all three required variables are set:
DENSE_EMBEDDER: str = settings.DENSE_EMBEDDER or ""
EMBEDDING_DIMENSIONS: int = settings.EMBEDDING_DIMENSIONS or 0
SPARSE_EMBEDDER: str = settings.SPARSE_EMBEDDER or ""
Error if missing:
EmbeddingConfigError: Required environment variable 'DENSE_EMBEDDER' 
is not set. Add it to your .env file.
Available options: openai_text_embedding_3_small, 
openai_text_embedding_3_large, mistral_embed, local_minilm
2

Registry Lookup

Validates embedder names exist in the registry:
dense_spec = dense_registry.get(DENSE_EMBEDDER)
sparse_spec = sparse_registry.get(SPARSE_EMBEDDER)
3

Dimension Validation

For Matryoshka models (OpenAI):
if EMBEDDING_DIMENSIONS > dense_spec.max_dimensions:
    raise EmbeddingConfigError(
        f"EMBEDDING_DIMENSIONS={EMBEDDING_DIMENSIONS} exceeds "
        f"max_dimensions={dense_spec.max_dimensions}"
    )
For fixed-dimension models (Mistral, Local):
if EMBEDDING_DIMENSIONS != dense_spec.max_dimensions:
    raise EmbeddingConfigError(
        f"Dense embedder '{DENSE_EMBEDDER}' does not support "
        f"Matryoshka dimensions — EMBEDDING_DIMENSIONS must be "
        f"exactly {dense_spec.max_dimensions}"
    )
4

Credential Check

Verifies required API keys are present:
if dense_spec.required_setting:  # e.g., "OPENAI_API_KEY"
    value = getattr(settings, dense_spec.required_setting, None)
    if not value:
        raise EmbeddingConfigError(
            f"Dense embedder '{DENSE_EMBEDDER}' requires setting "
            f"'{dense_spec.required_setting}' but it is not set."
        )
5

Database Reconciliation

Checks configuration against existing deployment metadata:
# First deployment: Create metadata row
if row is None:
    row = VectorDbDeploymentMetadata(
        dense_embedder=DENSE_EMBEDDER,
        embedding_dimensions=EMBEDDING_DIMENSIONS,
        sparse_embedder=SPARSE_EMBEDDER,
    )

# Existing deployment: Validate match
if row.dense_embedder != DENSE_EMBEDDER:
    raise EmbeddingConfigError(
        "Changing embedding model makes all synced data unsearchable"
    )

Embedding Implementation Details

OpenAI Embedder

class OpenAIDenseEmbedder:
    _MAX_TOKENS_PER_TEXT: int = 8192
    _MAX_TEXTS_PER_SUB_BATCH: int = 100
    _MAX_TOKENS_PER_REQUEST: int = 300_000
    _MAX_CONCURRENT_REQUESTS: int = 10
    
    async def embed_many(self, texts: list[str]) -> list[DenseEmbedding]:
        # Validate inputs
        token_counts = self._validate_inputs(texts)
        
        # Split into sub-batches (max 100 texts)
        sub_batches = [
            (texts[i:i+100], token_counts[i:i+100])
            for i in range(0, len(texts), 100)
        ]
        
        # Process batches concurrently
        tasks = [self._embed_sub_batch(batch, counts) 
                 for batch, counts in sub_batches]
        nested_results = await asyncio.gather(*tasks)
        
        return [emb for batch in nested_results for emb in batch]

Error Handling

All embedders translate provider-specific errors to common exceptions:
except openai.AuthenticationError as e:
    raise EmbedderAuthError(
        f"OpenAI authentication failed: {e}",
        provider="openai",
    ) from e

Adding Custom Embedders

To add a new embedding model:
1

Implement DenseEmbedderProtocol

Create a new embedder class in backend/airweave/domains/embedders/dense/:
custom_embedder.py
from airweave.domains.embedders.protocols import DenseEmbedderProtocol
from airweave.domains.embedders.types import DenseEmbedding

class CustomDenseEmbedder(DenseEmbedderProtocol):
    def __init__(self, *, api_key: str, model: str, dimensions: int):
        self._model = model
        self._dimensions = dimensions
        # Initialize client
    
    @property
    def model_name(self) -> str:
        return self._model
    
    @property
    def dimensions(self) -> int:
        return self._dimensions
    
    async def embed(self, text: str) -> DenseEmbedding:
        # Implement single-text embedding
        pass
    
    async def embed_many(self, texts: list[str]) -> list[DenseEmbedding]:
        # Implement batch embedding with validation
        pass
    
    async def close(self) -> None:
        # Release resources
        pass
2

Register in registry_data.py

Add spec to DENSE_EMBEDDERS list:
registry_data.py
from airweave.domains.embedders.dense.custom import CustomDenseEmbedder

DENSE_EMBEDDERS.append(
    DenseEmbedderSpec(
        short_name="custom_embedder",
        name="Custom Embedder",
        description="Custom embedding model",
        provider="custom",
        api_model_name="custom-model-v1",
        max_dimensions=768,
        max_tokens=4096,
        supports_matryoshka=False,
        embedder_class=CustomDenseEmbedder,
        required_setting="CUSTOM_API_KEY",
    )
)
3

Configure and use

.env
DENSE_EMBEDDER=custom_embedder
EMBEDDING_DIMENSIONS=768
CUSTOM_API_KEY=...

Performance Optimization

Concurrency Limits

Each embedder has tuned concurrency limits:
EmbedderMax ConcurrentBatch SizeTokens/Request
OpenAI Small/Large10100 texts300,000
Mistral5128 texts8,000
Local1064 textsN/A

Batching Strategy

Airweave automatically batches embedding requests during sync jobs to maximize throughput while respecting API limits.
# Sync processor batches chunks before embedding
async def _embed_chunks(self, chunks: list[Chunk]):
    # Extract text from all chunks
    texts = [chunk.text for chunk in chunks]
    
    # Single batched call to embedder
    dense_embeddings = await self.dense_embedder.embed_many(texts)
    sparse_embeddings = await self.sparse_embedder.embed_many(texts)
    
    # Pair results with chunks
    for chunk, dense, sparse in zip(chunks, dense_embeddings, sparse_embeddings):
        chunk.dense_embedding = dense.vector
        chunk.sparse_embedding = sparse.indices_and_values

Troubleshooting

Symptom:
EmbeddingConfigError: Required environment variable 'DENSE_EMBEDDER' is not set.
Solution: Add all three variables to .env:
DENSE_EMBEDDER=openai_text_embedding_3_small
EMBEDDING_DIMENSIONS=1536
SPARSE_EMBEDDER=fastembed_bm25
Symptom:
EmbeddingConfigError: EMBEDDING_DIMENSIONS=1024 exceeds max_dimensions=384 
for dense embedder 'local_minilm'.
Solution: Use correct dimensions for your model:
  • OpenAI small: up to 1536
  • OpenAI large: up to 3072
  • Mistral: exactly 1024
  • Local: exactly 384
Symptom:
EmbeddingConfigError: Embedding config mismatch: embedding_dimensions: 
code=1024, db=1536. Changing embedding model or dimensions makes all 
synced data unsearchable.
Solution: You have two options:
  1. Revert: Change .env back to original dimensions (1536)
  2. Re-index: Delete all data and re-sync:
    docker compose down --volumes
    ./start.sh
    # Re-create collections and sync
    
Symptom:
EmbedderConnectionError: Local embedding connection failed: 
[Errno 111] Connection refused
Solution:
  1. Ensure local embeddings container is running:
    docker ps | grep text2vec
    
  2. Check health:
    curl http://localhost:9878/health
    
  3. Restart if needed:
    docker compose restart text2vec-transformers
    
Symptom:
EmbedderRateLimitError: OpenAI rate limit exceeded, retry after 30.0s
Solutions:
  • Reduce concurrency: Lower _MAX_CONCURRENT_REQUESTS in embedder
  • Upgrade tier: Increase OpenAI rate limits
  • Switch models: Use Mistral or local embeddings
  • Wait: Airweave automatically retries with backoff

Next Steps

Chunking

Configure document chunking before embedding

Search API

Use embeddings in hybrid search queries

Configuration

See all embedding environment variables

Rate Limits

Configure source API rate limiting