Vector Search: ANN v3 Explained: How turbopuffer Hits 200ms

TL;DR

Vector Search is mostly about making agent behavior predictable and auditable.
Make tools safe: schemas, validation, retries/timeouts, and idempotency.
Ground answers with retrieval (RAG) and measure reliability with evals.
Add observability so you can answer: what happened and why.

If you’ve been building anything with embeddings (RAG, semantic search, recommendations), you know the dirty secret: vector search gets expensive and slow at scale.

So when turbopuffer published “ANN v3: 200ms p99 query latency over 100 billion vectors”, people paid attention.

They’re claiming:

Up to 100B vectors in a single search index
~200ms p99 latency targets
High throughput (they mention >1k QPS as the kind of scale they’re designing for)
And they do it by treating vector search as a bandwidth problem, not just a “fast CPU” problem

Source (original deep dive): https://turbopuffer.com/blog/ann-v3

First: what does “100 billion vectors” even mean?

A vector is just a list of numbers that represents something (text, image, user profile, product, etc.). If you embed a document into a 1024-dimensional vector, you get 1024 numbers.

Now scale that:

100,000,000,000 vectors
1024 dimensions per vector
2 bytes per dimension (fp16)

That’s on the order of hundreds of terabytes of dense vector data.

At that point, your challenge is not “how do I compute a dot product quickly?” It’s: how do I move enough data through the system fast enough.

The core insight: vector search is usually memory-bandwidth bound

A lot of people assume vector search is compute-heavy.

But turbopuffer makes a key point: the “core kernel” of vector search is like a dot product (or similar distance metric). Each dimension is used basically once for that calculation.

That means the workload has low arithmetic intensity:

you do a small amount of math per byte fetched
so performance is dominated by how fast you can fetch vectors (from SSD, RAM, cache), not how clever your CPU instructions are

How they approach it: “approximation + refinement”

To make “100B vectors” feasible, you must avoid reading most vectors for most queries.

The general strategy is:

Quickly narrow down where the answer likely is (cheap approximation)
Then do more exact scoring on a small candidate set (refinement)

1) Hierarchical clustering (to narrow the search space)

Instead of comparing your query vector against everything, you compare it against cluster centroids first, find the most promising clusters, then search within those clusters.

Why it matters:

Reduces the “search space” dramatically
Improves locality (vectors near each other stored together)
Helps with cold-start / object storage access patterns (fewer round-trips)

2) Quantization (to reduce bandwidth)

Even if you only search a subset, reading full-precision vectors is expensive. Quantization compresses vectors so you can do a “cheap pass” first, then refine with higher quality data.

At scale, compression isn’t optional: you trade tiny accuracy loss for huge throughput wins.

Why founders and builders should care (practical takeaways)

Takeaway 1: RAG works at small scale — the real game is retrieval efficiency

Most teams prototype RAG with a small corpus and naive settings. Then reality hits: more data, more users, multi-tenancy, and latency SLOs.

If your product depends on retrieval, retrieval is your product. You need to think in terms of caching hierarchy, bandwidth budgets, and tail latency (p95/p99).

Takeaway 2: your bottleneck is often NOT the model

In many production AI apps, data plumbing becomes the bottleneck: retrieval latency, DB costs, network egress, cache miss penalties.

Takeaway 3: benchmark by p99, not “average”

If your app is interactive, p99 is what users feel. That’s why the 200ms p99 target is a big deal.

What to try next (actionable)

Measure retrieval latency end-to-end (include network + DB + rerank)
Set explicit targets: p50 / p95 / p99
Test approximate settings (candidate pool + reranker, quantization options)
Add caching intelligently (hot queries, hot tenants)

Vineet’s expertise spans cloud-native architectures, data-driven machine learning models, and innovative blockchain implementations. Passionate about leveraging technology to drive business transformation, he combines technical mastery with a forward-thinking approach to deliver scalable, secure, and cutting-edge solutions. With a strong commitment to innovation, Vineet empowers businesses to thrive in an ever-evolving digital landscape.

ANN v3 Explained: How turbopuffer Hits 200ms p99 Over 100 Billion Vectors (And Why It Matters)

TL;DR

First: what does “100 billion vectors” even mean?

The core insight: vector search is usually memory-bandwidth bound

How they approach it: “approximation + refinement”

1) Hierarchical clustering (to narrow the search space)

2) Quantization (to reduce bandwidth)

Why founders and builders should care (practical takeaways)

Takeaway 1: RAG works at small scale — the real game is retrieval efficiency

Takeaway 2: your bottleneck is often NOT the model

Takeaway 3: benchmark by p99, not “average”

What to try next (actionable)

Author’s Bio

More posts

KittenTTS: Tiny Open-Source Text-to-Speech That Runs on CPU

Web 4.0 Explained: Conway, x402, and the Internet Built for AI Agents

Simile Raises $100M to Simulate Human Behavior — Why This Could Be the Missing Layer for AI Agents

DialogLab: Simulating and Testing Dynamic Human‑AI Group Conversations (Google Research + UIST 2025)