Skip to main content

FastEmbed

A local ONNX-based embedding pipeline using Qdrant's FastEmbed library with real BAAI/bge-small-en-v1.5 model inference. Demonstrates CPU-first local embedding generation with zero-cost attribution, real cosine similarity search, and adaptive top-k selection based on score distribution.

Environment variables

This example requires OPENAI_API_KEY (for LLM synthesis only), WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys. Embedding generation is always local and free.

Architecture

Key Code

ONNX-based local embeddings with @tool(embedding)

FastEmbed uses ONNX runtime for CPU-optimized inference, recorded as embedding tool calls.

@waxell.tool(tool_type="embedding", name="init_fastembed_model")
def init_fastembed_model(model_name: str = "BAAI/bge-small-en-v1.5") -> dict:
from fastembed import TextEmbedding
model = TextEmbedding(model_name=model_name)
return {"model": model_name, "backend": "onnx", "_model_obj": model}

@waxell.tool(tool_type="embedding", name="embed_passages")
def embed_passages(model, texts: list[str]) -> dict:
embeddings = list(model.passage_embed(texts))
embeddings_array = np.array(embeddings)
return {"count": len(texts), "dimensions": embeddings_array.shape[1],
"_embeddings": embeddings_array}

@waxell.tool(tool_type="embedding", name="embed_query")
def embed_query(model, query: str) -> dict:
embeddings = list(model.query_embed(query))
return {"dimensions": len(np.array(embeddings[0])), "_embedding": np.array(embeddings[0])}

Adaptive top-k decision and coverage evaluation

Score distribution drives how many results to return.

@waxell.decision(name="choose_top_k", options=["narrow_3", "broad_5"])
def choose_top_k(scores: list[float]) -> dict:
score_spread = max(scores) - min(scores)
if score_spread > 0.15:
return {"chosen": "narrow_3", "reasoning": f"Large spread ({score_spread:.3f})"}
return {"chosen": "broad_5", "reasoning": f"Tight cluster ({score_spread:.3f})"}

@waxell.reasoning_dec(step="evaluate_coverage")
def evaluate_coverage(results: list[dict], total_docs: int) -> dict:
coverage_ratio = len(results) / total_docs if total_docs else 0
avg = sum(r["score"] for r in results) / len(results) if results else 0
return {
"thought": f"Checking coverage: {len(results)}/{total_docs} docs",
"evidence": [f"Coverage ratio: {coverage_ratio:.0%}", f"Average score: {avg:.3f}"],
"conclusion": "Good coverage" if coverage_ratio >= 0.4 and avg > 0.5 else "Limited",
}

What this demonstrates

  • @waxell.observe -- single-agent pipeline with full decorator coverage
  • @waxell.tool -- ONNX model init, passage embedding, and query embedding with tool_type="embedding"
  • @waxell.retrieval -- real cosine similarity search with source="fastembed"
  • @waxell.decision -- adaptive top-k selection based on score distribution
  • @waxell.reasoning_dec -- coverage evaluation across retrieved documents
  • waxell.score() -- average cosine similarity and coverage ratio scores
  • Zero-cost ONNX inference -- CPU-optimized embedding with no API costs
  • FastEmbed-specific -- passage_embed vs query_embed asymmetric encoding

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.fastembed_agent --dry-run

# Live (real OpenAI for synthesis only)
export OPENAI_API_KEY="sk-..."
python -m app.demos.fastembed_agent

Source

dev/waxell-dev/app/demos/fastembed_agent.py