Airweave provides AI-powered search capabilities that combine semantic understanding with keyword precision. Search across all your connected data sources through a unified interface with full control over the search pipeline.
Overview
When you query a collection, Airweave runs a multi-step search pipeline:
Query expansion : Generate variations to capture synonyms and related terms
Retrieval : Use keyword, neural, or hybrid methods to fetch candidates
Filtering : Apply structured metadata filters
Reranking : AI-powered reordering for higher precision
Answer generation : Return raw documents or synthesize a natural language response
All search parameters have sensible defaults. You can start with a simple query and add complexity as needed.
Quick Start
The simplest search requires only a query string:
from airweave import AirweaveSDK
client = AirweaveSDK( api_key = "YOUR_API_KEY" )
# Basic search
results = client.collections.search(
readable_id = "customer-support-x7k9m" ,
query = "How do I reset my password?"
)
# Access results
for result in results.results:
print ( f "Score: { result[ 'score' ] :.3f} " )
print ( f "Source: { result[ 'source_name' ] } " )
print ( f "Content: { result[ 'md_content' ][: 200 ] } ..." )
print ( f "URL: { result.get( 'url' , 'N/A' ) } " )
print ( "---" )
# Access AI-generated answer (if generate_answer was true)
if results.completion:
print ( f "Answer: { results.completion } " )
Search Strategies
Choose how Airweave searches your data with the retrieval_strategy parameter.
Hybrid (Default)
Neural
Keyword
Combines semantic understanding with keyword matching for best results. results = client.collections.search(
readable_id = "my-collection" ,
query = "authentication security vulnerabilities" ,
retrieval_strategy = "hybrid" # default
)
Use when: You want the best of both worlds - finds results by meaning AND exact keywords.Pure semantic search using embeddings. Understands meaning, not just keywords. results = client.collections.search(
readable_id = "my-collection" ,
query = "How to secure user login?" ,
retrieval_strategy = "neural"
)
Use when: The query is conversational or when you want conceptually similar results even if exact terms don’t match.Traditional BM25 keyword search. Fast and precise for exact term matching. results = client.collections.search(
readable_id = "my-collection" ,
query = "JWT token expiration" ,
retrieval_strategy = "keyword"
)
Use when: You need exact keyword matches or technical terms that shouldn’t be interpreted semantically.
Filtering Results
Apply structured metadata filters to narrow your search. Filters use a boolean logic structure with must (AND), should (OR), and must_not (NOT) conditions.
Filter by Source
# Filter to specific source
results = client.collections.search(
readable_id = "my-collection" ,
query = "deployment issues" ,
filter = {
"must" : [{
"key" : "source_name" ,
"match" : { "value" : "GitHub" } # Case-sensitive!
}]
}
)
# Multiple sources (OR)
results = client.collections.search(
readable_id = "my-collection" ,
query = "customer feedback" ,
filter = {
"must" : [{
"key" : "source_name" ,
"match" : { "any" : [ "Zendesk" , "Intercom" , "Slack" ]}
}]
}
)
Date Range Filters
from datetime import datetime, timezone, timedelta
# Last 7 days
results = client.collections.search(
readable_id = "my-collection" ,
query = "bug reports" ,
filter = {
"must" : [{
"key" : "created_at" ,
"range" : {
"gte" : (datetime.now(timezone.utc) - timedelta( days = 7 )).isoformat()
}
}]
}
)
# Specific date range
results = client.collections.search(
readable_id = "my-collection" ,
query = "Q1 analytics" ,
filter = {
"must" : [{
"key" : "updated_at" ,
"range" : {
"gte" : "2024-01-01T00:00:00Z" ,
"lt" : "2024-04-01T00:00:00Z"
}
}]
}
)
Exclude Results
# Exclude resolved items
results = client.collections.search(
readable_id = "my-collection" ,
query = "open tickets" ,
filter = {
"must_not" : [{
"key" : "status" ,
"match" : { "any" : [ "resolved" , "closed" , "done" ]}
}]
}
)
Complex Filters
Combine multiple conditions:
results = client.collections.search(
readable_id = "my-collection" ,
query = "critical bugs" ,
filter = {
"must" : [
# Only from GitHub
{
"key" : "source_name" ,
"match" : { "value" : "GitHub" }
},
# From last 30 days
{
"key" : "created_at" ,
"range" : {
"gte" : (datetime.now(timezone.utc) - timedelta( days = 30 )).isoformat()
}
}
],
# NOT resolved
"must_not" : [{
"key" : "status" ,
"match" : { "value" : "resolved" }
}]
}
)
AI Features
Query Expansion
Generate query variations to improve recall. Enabled by default.
# With expansion (default)
results = client.collections.search(
readable_id = "my-collection" ,
query = "customer churn analysis" ,
expand_query = True # default
)
# Without expansion (faster, exact query only)
results = client.collections.search(
readable_id = "my-collection" ,
query = "customer churn analysis" ,
expand_query = False
)
Reranking
LLM-based reordering for improved relevance. Adds ~10 seconds of latency.
# With reranking (default, more accurate)
results = client.collections.search(
readable_id = "my-collection" ,
query = "authentication methods" ,
rerank = True # default
)
# Without reranking (faster)
results = client.collections.search(
readable_id = "my-collection" ,
query = "authentication methods" ,
rerank = False
)
Reranking adds about 10 seconds to your search. Disable it if you need fast results for interactive applications.
Answer Generation
Generate AI-synthesized answers from search results. Enabled by default.
# Generate answer (default)
results = client.collections.search(
readable_id = "my-collection" ,
query = "What are our refund policies?" ,
generate_answer = True # default
)
print ( f "Answer: { results.completion } " )
# Answer: According to the customer support documentation,
# refunds are processed within 5-7 business days...
# Raw results only (faster)
results = client.collections.search(
readable_id = "my-collection" ,
query = "refund policies" ,
generate_answer = False
)
for result in results.results:
print (result[ 'md_content' ])
Filter Interpretation (Beta)
Beta Feature : Filter interpretation can occasionally filter too narrowly. Verify result counts.
Automatically extract structured filters from natural language queries.
# AI interprets "last week" and "Asana" as filters
results = client.collections.search(
readable_id = "my-collection" ,
query = "open Asana tickets from last week" ,
interpret_filters = True
)
# AI understands: Asana source, open status, last 7 days
# Another example
results = client.collections.search(
readable_id = "my-collection" ,
query = "critical bugs from GitHub this month" ,
interpret_filters = True
)
# AI extracts: GitHub source, critical priority, current month date range
Navigate through large result sets with limit and offset.
# First 50 results
results = client.collections.search(
readable_id = "my-collection" ,
query = "documentation" ,
limit = 50 ,
offset = 0
)
# Next 50 results
results = client.collections.search(
readable_id = "my-collection" ,
query = "documentation" ,
limit = 50 ,
offset = 50
)
Maximum number of results to return (1-1000)
Number of results to skip for pagination
Streaming Search
For real-time results, use the streaming endpoint with Server-Sent Events (SSE).
import asyncio
async def stream_search ():
async for event in client.collections.search_stream(
readable_id = "my-collection" ,
query = "deployment procedures"
):
if event[ "type" ] == "result" :
print ( f "Result: { event[ 'data' ] } " )
elif event[ "type" ] == "completion" :
print ( f "Answer: { event[ 'data' ] } " )
elif event[ "type" ] == "done" :
print ( "Search complete" )
break
asyncio.run(stream_search())
Complete Example
Here’s a comprehensive search using all available parameters:
from airweave import AirweaveSDK
from datetime import datetime, timezone, timedelta
client = AirweaveSDK( api_key = "YOUR_API_KEY" )
results = client.collections.search(
readable_id = "customer-support-x7k9m" ,
query = "customer feedback about pricing" ,
# Search strategy
retrieval_strategy = "hybrid" ,
# Filters
filter = {
"must" : [
{ "key" : "source_name" , "match" : { "any" : [ "Zendesk" , "Slack" ]}},
{ "key" : "created_at" , "range" : {
"gte" : (datetime.now(timezone.utc) - timedelta( days = 30 )).isoformat()
}}
],
"must_not" : [
{ "key" : "status" , "match" : { "value" : "resolved" }}
]
},
# AI features
expand_query = True ,
rerank = True ,
generate_answer = True ,
# Pagination
limit = 50 ,
offset = 0
)
# Process results
print ( f "Found { len (results.results) } results \n " )
if results.completion:
print ( f "AI Answer: { results.completion } \n " )
for i, result in enumerate (results.results[: 5 ], 1 ):
print ( f "Result { i } :" )
print ( f " Score: { result[ 'score' ] :.3f} " )
print ( f " Source: { result[ 'source_name' ] } " )
print ( f " Content: { result[ 'md_content' ][: 150 ] } ..." )
print ( f " URL: { result.get( 'url' , 'N/A' ) } " )
print ()
Search Parameters Reference
Parameter Type Default Description querystring required Search query text (max 2048 tokens) retrieval_strategystring "hybrid""hybrid", "neural", or "keyword"filterobject nullStructured metadata filters expand_queryboolean trueGenerate query variations interpret_filtersboolean falseExtract filters from natural language rerankboolean trueLLM-based reranking generate_answerboolean trueGenerate AI answer limitinteger 1000Max results (1-1000) offsetinteger 0Skip results for pagination
Response Structure
{
"results" : [
{
"entity_id" : "abc123-def456-789012" ,
"source_name" : "GitHub" ,
"md_content" : "# Password Reset Guide \n\n To reset your password..." ,
"metadata" : {
"file_path" : "docs/auth/password-reset.md" ,
"last_modified" : "2024-03-15T09:30:00Z"
},
"score" : 0.92 ,
"breadcrumbs" : [ "docs" , "auth" , "password-reset.md" ],
"url" : "https://github.com/company/docs/blob/main/docs/auth/password-reset.md"
}
],
"completion" : "To reset your password, navigate to the login page..."
}
Next Steps
Collections Learn about organizing data with collections
Webhooks Get notified when syncs complete