Blog
·4 min read

TurboVec on 100M RAG Vectors

A benchmark-driven look at what happens when vector compression moves from 10M scale to 100M scale.

ragvector-searchperformance

TurboVec is getting attention for a simple reason:

A 10M-document float32 index can take around 31GB.

TurboVec can fit it into about 4GB.

I wanted to test the next question:

What happens at 100M RAG vectors?

Not a toy dataset.

Not a small local benchmark.

I used the MS MARCO Web Search vector dataset and built TurboVec indexes at 1M, 10M, and full 100M scale.

The full float32 payload was 310GB.

The 4-bit TurboVec index was 37GB.

Setup

I ran the benchmark on AWS EC2.

DetailValue
InstanceAWS EC2 i7i.8xlarge
DatasetMS MARCO Web Search vectors
Vectors101,070,374
Dimensions768
Raw float32 size310.5GB
Checkpoints1M, 10M, full
Indexes testedTurboVec 2-bit, TurboVec 4-bit
MetricRecall@10 against ground truth

The goal was not to create a perfect vector database benchmark.

The goal was narrower:

How much can TurboVec compress a 100M-vector corpus, and what tradeoff shows up in recall and search latency?

Result

This was the full-scale result:

MethodIndex sizeCompressionRecall@10p50 search
float32310GB1.0xexact-
TurboVec 2-bit18.9GB15.7x0.6083.19s
TurboVec 4-bit37.4GB7.9x0.9145.67s

The storage result is strong.

310GB of float32 vectors became a 37GB 4-bit index.

That is the number I would pay attention to for RAG systems.

The 2-bit result is more eye-catching:

18.9GB for 100M vectors

But the recall drop is large.

For most retrieval pipelines, I would not treat 0.608 recall@10 as the main result.

The 4-bit index is the more practical one in this run.

It kept the full 100M-vector index under 40GB and still reached 91.4% recall@10.

Scaling

I also measured the same setup at 1M and 10M checkpoints.

For 4-bit TurboVec:

VectorsIndex sizeRecall@10p50 search
1M370MB0.89953.7ms
10M3.7GB0.904562ms
101M37.4GB0.9145.67s

The compression scaled cleanly.

The latency also scaled with the corpus size.

That part matters.

At 100M vectors, the index was much smaller than float32, but the search was still operating over a much larger corpus.

The storage win was clear.

The latency still reminded me I was searching 100M vectors.

Measurement Notes

The latency numbers were measured with 300 sampled queries, repeated 3 times.

Recall@10 used 1000 sampled queries against the provided ground truth.

The full 4-bit index took about 66 minutes to build.

After loading and preparing the full 4-bit index, the benchmark process used about 73 GiB RSS.

That memory number is query-side prepared RSS, not the persisted index size.

The persisted 4-bit index file was 37.4GB.

2-Bit vs 4-Bit

The useful comparison was not just "smaller is better."

It was:

Bit widthWhat happened
2-bitVery small index, large recall drop
4-bitLarger index, much better recall

2-bit compressed the full corpus to 18.9GB.

That is hard to ignore.

But for RAG, the 4-bit result looked more realistic.

It used about twice the storage of 2-bit, but recall@10 moved from 0.608 to 0.914.

That is the tradeoff I would rather start from.

Code

The benchmark code is available on GitHub.

What This Does Not Claim

This benchmark does not claim:

  • TurboVec is faster than FAISS
  • every RAG workload will get the same recall
  • 4-bit is always the right setting
  • p50 latency here represents a production serving system
  • compression removes the need to think about retrieval architecture

I did not benchmark FAISS in this run.

I also did not test a full RAG application with reranking, filtering, batching, caching, or query-specific routing.

This benchmark measured one thing:

What happens when the same 100M-vector corpus is indexed with 2-bit and 4-bit TurboVec?

Takeaway

The best result was not the smallest index.

The best result was the practical tradeoff.

For this dataset, 4-bit TurboVec compressed 310GB of float32 vectors into a 37GB index while keeping recall@10 at 0.914.

That is the part I found useful.

The index got much smaller.

The search cost did not disappear.

For large RAG systems, that is probably the right way to think about compression:

not as magic,

but as another knob between storage, recall, and latency.