If you’ve been building anything with embeddings (RAG, semantic search, recommendations), you know the dirty secret: vector search gets expensive and slow at scale.
So when turbopuffer published “ANN v3: 200ms p99 query latency over 100 billion vectors”, people paid attention.
They’re claiming:
- Up to 100B vectors in a single search index
- ~200ms p99 latency targets
- High throughput (they mention >1k QPS as the kind of scale they’re designing for)
- And they do it by treating vector search as a bandwidth problem, not just a “fast CPU” problem
Source (original deep dive): https://turbopuffer.com/blog/ann-v3
First: what does “100 billion vectors” even mean?
A vector is just a list of numbers that represents something (text, image, user profile, product, etc.). If you embed a document into a 1024-dimensional vector, you get 1024 numbers.
Now scale that:
- 100,000,000,000 vectors
- 1024 dimensions per vector
- 2 bytes per dimension (fp16)
That’s on the order of hundreds of terabytes of dense vector data.
At that point, your challenge is not “how do I compute a dot product quickly?” It’s: how do I move enough data through the system fast enough.
The core insight: vector search is usually memory-bandwidth bound
A lot of people assume vector search is compute-heavy.
But turbopuffer makes a key point: the “core kernel” of vector search is like a dot product (or similar distance metric). Each dimension is used basically once for that calculation.
That means the workload has low arithmetic intensity:
- you do a small amount of math per byte fetched
- so performance is dominated by how fast you can fetch vectors (from SSD, RAM, cache), not how clever your CPU instructions are
How they approach it: “approximation + refinement”
To make “100B vectors” feasible, you must avoid reading most vectors for most queries.
The general strategy is:
- Quickly narrow down where the answer likely is (cheap approximation)
- Then do more exact scoring on a small candidate set (refinement)
1) Hierarchical clustering (to narrow the search space)
Instead of comparing your query vector against everything, you compare it against cluster centroids first, find the most promising clusters, then search within those clusters.
Why it matters:
- Reduces the “search space” dramatically
- Improves locality (vectors near each other stored together)
- Helps with cold-start / object storage access patterns (fewer round-trips)
2) Quantization (to reduce bandwidth)
Even if you only search a subset, reading full-precision vectors is expensive. Quantization compresses vectors so you can do a “cheap pass” first, then refine with higher quality data.
At scale, compression isn’t optional: you trade tiny accuracy loss for huge throughput wins.
Why founders and builders should care (practical takeaways)
Takeaway 1: RAG works at small scale — the real game is retrieval efficiency
Most teams prototype RAG with a small corpus and naive settings. Then reality hits: more data, more users, multi-tenancy, and latency SLOs.
If your product depends on retrieval, retrieval is your product. You need to think in terms of caching hierarchy, bandwidth budgets, and tail latency (p95/p99).
Takeaway 2: your bottleneck is often NOT the model
In many production AI apps, data plumbing becomes the bottleneck: retrieval latency, DB costs, network egress, cache miss penalties.
Takeaway 3: benchmark by p99, not “average”
If your app is interactive, p99 is what users feel. That’s why the 200ms p99 target is a big deal.
What to try next (actionable)
- Measure retrieval latency end-to-end (include network + DB + rerank)
- Set explicit targets: p50 / p95 / p99
- Test approximate settings (candidate pool + reranker, quantization options)
- Add caching intelligently (hot queries, hot tenants)

