TL;DR — We've made the DOJ's release of the Epstein files semantically searchable with one-off data pipelines per query. You can watch the full demo (~3 min) here. We think this is a powerful tool and we want to share it with responsible journalists, researchers and government officials. If you or someone you know should have access, please reach out directly at founders@overstandlabs.com.
We took everything we know about large-scale data systems and applied it to the DOJ's public release of the Epstein files. The result: a system that can query the entire corpus — including messy scans, exhibits, and OCR-heavy documents — in minutes. And no, it's not built using traditional RAG.
The rest of this post walks through how we did this under the hood, and is moderately technical.
Why we didn't use traditional embedding-based RAG
Most of us have been there — setting up vector embeddings, experimenting with chunking strategies, maybe getting something that works well enough in a demo. We've been through that too. And in our experience, these systems sorta work... until they really don't.
The issue we kept running into wasn't just about retrieval. It was about what happens before retrieval — the "compression" step, when you take a large, messy document and collapse it down into a vector. At some point, you have to ask: does that representation actually preserve what matters? We're not convinced it does, especially at scale.
Now throw in a corpus like the Epstein files — scanned exhibits, heavy redactions, inconsistent naming conventions, OCR artifacts, fragmented references. In that environment, embedding-based retrieval gets brittle fast.
Specifically, three things tend to go wrong:
- 1. Context fragments. A meaningful signal might span several sections of a document, but retrieval only surfaces pieces of it.
- 2. Entities drift. "HRH Prince Andrew," "Prince Andrew," and "The Duke of York" don't reliably resolve to the same person.
- 3. False confidence. The system returns plausible-sounding snippets and acts certain — even when it hasn't actually reasoned across the full corpus.
For casual or exploratory search, that's probably fine. For investigative work, it isn't. We needed a system that could evaluate the corpus much more deterministically when the question called for it — not one that samples it and hopes for the best.
The pre-processing pipeline
Rather than starting with retrieval, we built a structured data layer underneath everything. The pre-processing pipeline runs in four stages:
Ingestion
Provenance tracking and deduplication from day one.
Transcription
OCR and structured descriptions so every document becomes searchable text.
Extraction
People, organizations, locations, dates, and relationships pulled into structured form.
Resolution
Obvious aliases normalized so that references actually cohere across documents.
The end result is a curated, queryable dataset — not just a pile of embedded chunks. And that's the input into the next stage that involves the user query.
From intent to post-query pipeline
When a user submits a query, we don't retrieve snippets and summarize them. Instead, we translate the intent behind that query into a one-off pipeline that runs across the full corpus. You got that right — the full corpus. We're not sampling here — there's no random subset; we're running the query against every document, individually.
Now in this one-off pipeline a lot of the work is pretty deterministic (i.e. filters, joins, aggregations). We bring in LLM calls only where genuine judgment is needed — interpretation, synthesis, that kind of thing — and we parallelize them heavily.
This gives us:
- Speed — most queries resolve in minutes, across the full corpus. When doing deterministic operations, we're using frameworks that handle high data scale (i.e. Polars / PySpark depending on the data). When doing LLM-based operations, we're heavily parallelizing our requests.
- Traceability — every answer links back to a source document.
- Reliability — less drift, fewer hallucinations.
The best way to think about it: this behaves more like a data pipelining system with inference as a specific transformation — alongside many other deterministic ones — inside it, than a chatbot sitting on top of a pile of PDFs.
And the icing on the cake here is composing this pipeline is itself also an act of inference — so we can go from user intent to written pipeline to executed pipeline in just a few minutes.
For the public interest — please reach out
There are people out there fighting the good fight with respect to Jeffrey Epstein and his co-conspirators. We want them to have a tool like this — to do deep research through these files, pull out instances from documents reliably, and make a compelling, fact-based case against an individual or organization very quickly.
If you or someone you know is in the press, in law enforcement, or in government, please reach out at founders@overstandlabs.com, or request access here.