Wall covered in old documents and newspapers
Back to Blog

The Epstein Files — We Made Them Rigorously Searchable With One-Off Data Pipelines

Overstand Team · February 26, 2026

Key Learnings

  • Traditional embedding-based RAG breaks down on messy, large-scale document corpora like the Epstein files.
  • A structured pre-processing pipeline — ingestion, transcription, extraction, resolution — creates a queryable dataset, not just embedded chunks.
  • Each user query generates a one-off pipeline that runs against the full corpus, not a sampled subset.
  • Deterministic operations handle filtering and aggregation; LLM calls are reserved for genuine judgment — interpretation and synthesis.
  • The result is speed (minutes, not hours), traceability (every answer links to source documents), and reliability (fewer hallucinations).

TL;DR — We've made the DOJ's release of the Epstein files semantically searchable with one-off data pipelines per query. You can watch the full demo (~3 min) here. We think this is a powerful tool and we want to share it with responsible journalists, researchers and government officials. If you or someone you know should have access, please reach out directly at founders@overstandlabs.com.

Epstein Files search interface demo showing query input and document reasoning

We took everything we know about large-scale data systems and applied it to the DOJ's public release of the Epstein files. The result: a system that can query the entire corpus — including messy scans, exhibits, and OCR-heavy documents — in minutes. And no, it's not built using traditional RAG.

The rest of this post walks through how we did this under the hood, and is moderately technical.

Why we didn't use traditional embedding-based RAG

Most of us have been there — setting up vector embeddings, experimenting with chunking strategies, maybe getting something that works well enough in a demo. We've been through that too. And in our experience, these systems sorta work... until they really don't.

The issue we kept running into wasn't just about retrieval. It was about what happens before retrieval — the "compression" step, when you take a large, messy document and collapse it down into a vector. At some point, you have to ask: does that representation actually preserve what matters? We're not convinced it does, especially at scale.

Now throw in a corpus like the Epstein files — scanned exhibits, heavy redactions, inconsistent naming conventions, OCR artifacts, fragmented references. In that environment, embedding-based retrieval gets brittle fast.

Specifically, three things tend to go wrong:

For casual or exploratory search, that's probably fine. For investigative work, it isn't. We needed a system that could evaluate the corpus much more deterministically when the question called for it — not one that samples it and hopes for the best.

The pre-processing pipeline

Rather than starting with retrieval, we built a structured data layer underneath everything. The pre-processing pipeline runs in four stages:

1

Ingestion

Provenance tracking and deduplication from day one.

2

Transcription

OCR and structured descriptions so every document becomes searchable text.

3

Extraction

People, organizations, locations, dates, and relationships pulled into structured form.

4

Resolution

Obvious aliases normalized so that references actually cohere across documents.

The end result is a curated, queryable dataset — not just a pile of embedded chunks. And that's the input into the next stage that involves the user query.

From intent to post-query pipeline

When a user submits a query, we don't retrieve snippets and summarize them. Instead, we translate the intent behind that query into a one-off pipeline that runs across the full corpus. You got that right — the full corpus. We're not sampling here — there's no random subset; we're running the query against every document, individually.

Now in this one-off pipeline a lot of the work is pretty deterministic (i.e. filters, joins, aggregations). We bring in LLM calls only where genuine judgment is needed — interpretation, synthesis, that kind of thing — and we parallelize them heavily.

This gives us:

The best way to think about it: this behaves more like a data pipelining system with inference as a specific transformation — alongside many other deterministic ones — inside it, than a chatbot sitting on top of a pile of PDFs.

And the icing on the cake here is composing this pipeline is itself also an act of inference — so we can go from user intent to written pipeline to executed pipeline in just a few minutes.

For the public interest — please reach out

There are people out there fighting the good fight with respect to Jeffrey Epstein and his co-conspirators. We want them to have a tool like this — to do deep research through these files, pull out instances from documents reliably, and make a compelling, fact-based case against an individual or organization very quickly.

If you or someone you know is in the press, in law enforcement, or in government, please reach out at founders@overstandlabs.com, or request access here.

Frequently Asked Questions

See the Epstein Files demo, or learn more about how Overstand works across industries.