FastEmbed
A local ONNX-based embedding pipeline using Qdrant's FastEmbed library with real BAAI/bge-small-en-v1.5 model inference. Demonstrates CPU-first local embedding generation with zero-cost attribution, real cosine similarity search, and adaptive top-k selection based on score distribution.
Environment variables
This example requires OPENAI_API_KEY (for LLM synthesis only), WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys. Embedding generation is always local and free.
Architecture
Key Code
ONNX-based local embeddings with @tool(embedding)
FastEmbed uses ONNX runtime for CPU-optimized inference, recorded as embedding tool calls.
@waxell.tool(tool_type="embedding", name="init_fastembed_model")
def init_fastembed_model(model_name: str = "BAAI/bge-small-en-v1.5") -> dict:
from fastembed import TextEmbedding
model = TextEmbedding(model_name=model_name)
return {"model": model_name, "backend": "onnx", "_model_obj": model}
@waxell.tool(tool_type="embedding", name="embed_passages")
def embed_passages(model, texts: list[str]) -> dict:
embeddings = list(model.passage_embed(texts))
embeddings_array = np.array(embeddings)
return {"count": len(texts), "dimensions": embeddings_array.shape[1],
"_embeddings": embeddings_array}
@waxell.tool(tool_type="embedding", name="embed_query")
def embed_query(model, query: str) -> dict:
embeddings = list(model.query_embed(query))
return {"dimensions": len(np.array(embeddings[0])), "_embedding": np.array(embeddings[0])}
Adaptive top-k decision and coverage evaluation
Score distribution drives how many results to return.
@waxell.decision(name="choose_top_k", options=["narrow_3", "broad_5"])
def choose_top_k(scores: list[float]) -> dict:
score_spread = max(scores) - min(scores)
if score_spread > 0.15:
return {"chosen": "narrow_3", "reasoning": f"Large spread ({score_spread:.3f})"}
return {"chosen": "broad_5", "reasoning": f"Tight cluster ({score_spread:.3f})"}
@waxell.reasoning_dec(step="evaluate_coverage")
def evaluate_coverage(results: list[dict], total_docs: int) -> dict:
coverage_ratio = len(results) / total_docs if total_docs else 0
avg = sum(r["score"] for r in results) / len(results) if results else 0
return {
"thought": f"Checking coverage: {len(results)}/{total_docs} docs",
"evidence": [f"Coverage ratio: {coverage_ratio:.0%}", f"Average score: {avg:.3f}"],
"conclusion": "Good coverage" if coverage_ratio >= 0.4 and avg > 0.5 else "Limited",
}
What this demonstrates
@waxell.observe-- single-agent pipeline with full decorator coverage@waxell.tool-- ONNX model init, passage embedding, and query embedding withtool_type="embedding"@waxell.retrieval-- real cosine similarity search withsource="fastembed"@waxell.decision-- adaptive top-k selection based on score distribution@waxell.reasoning_dec-- coverage evaluation across retrieved documentswaxell.score()-- average cosine similarity and coverage ratio scores- Zero-cost ONNX inference -- CPU-optimized embedding with no API costs
- FastEmbed-specific -- passage_embed vs query_embed asymmetric encoding
Run it
# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.fastembed_agent --dry-run
# Live (real OpenAI for synthesis only)
export OPENAI_API_KEY="sk-..."
python -m app.demos.fastembed_agent