Issue 03 14 min read May 2026

RAG THAT WON'T LIE

I kept seeing the same problem statement online. Build a retrieval system over ten million documents that doesn't make things up. I had never built one, so over a weekend I gave it a shot, on my laptop, over all of Wikipedia. Here is how far I got, the real bugs, and the part where it learned to say "I don't know."

I kept seeing the same problem statement online: design a RAG system for ten million documents with near zero hallucination. Ten steps, lots of confident arrows, no code. I had never built one end to end, and I wanted to know whether I could. So over a weekend I gave it a shot, on my own laptop, over all of Wikipedia, to find out what those confident arrows actually cost.

The rules I gave myself: run it all on my own machine, no hosted vector database, no API key, no framework doing the hard part behind a function call. Just the real pipeline, from a raw Wikipedia dump to a system that answers questions and, more importantly, refuses to answer when it shouldn't. Here is the shape of the whole thing before we walk through it.

RAG DATA FLOW · 10M DOCS WIKIPEDIA 25GB · 10M docs USER VECTOR INDEX embeddings BM25 INDEX keywords METADATA text · title · url 1 INGEST normalize write chunks + embeddings raw dump 2 HYBRID RETRIEVE question read top-100 3 RERANK candidates 4 SCORE top-5 7 gate conf. 5 GENERATE context only YES 6 CITED ANSWER grounded + cited answer INSUFFICIENT no answer returned NO source text 8 EVAL SET test queries 9 CACHE query results 10 TRACE LOG every query test queries cache trace data flow read / feedback support / aux

The same ten layers as a data-flow diagram. A question flows through retrieval, scoring, and the gate to a cited answer, or stops at insufficient evidence. The stores underneath (eval, cache, trace) never stop running.

Ten layers. Seven of them get a question to an answer. The other three keep that answer honest, fast, and explainable once real traffic hits it. The problem statement drew all ten as clean arrows. When I tried to build them, each one taught me something, and most broke in a way the arrows never warned me about. Let me walk through what I hit.

The 25 gigabytes you can't open

Wikipedia gives you everything in one file. It is about 25 gigabytes compressed, and it expands to over a hundred. You cannot load that into memory and you cannot open it in an editor. It is also not clean text. Every article is wrapped in markup: templates, infoboxes, citation tags, tables, link syntax. The first job is turning that mess into plain readable text, one article at a time, without ever holding the whole thing in memory.

raw wikitext → clean text

{{Infobox scientist | name = Albert Einstein | birth_date = {{birth date|1879|3|14}} }} '''Einstein''' was a [[theoretical physicist]] who developed <ref>{{cite book|...}}</ref>

Einstein was a theoretical physicist who developed the theory of relativity.

I tried to write my own parser first. It worked, and it was correct, and after half an hour it had processed about twenty thousand pages out of twenty-two million. At that rate it was going to finish in roughly twenty-three days. The library I was leaning on was careful and thorough, which is exactly why it was slow.

So I threw it away and used wikiextractor, a tool built for this one job that runs across all ten of my CPU cores at once. Same machine, same dump. It finished in about three hours.

Parsing speed · same dump, same laptop
custom parser~23 days
wikiextractor~3 hours

Both produce clean text. Only one of them finishes this month.

Lesson one: someone already wrote the boring part. The win was deleting my code, not writing more of it.

Cutting Wikipedia into index cards

A full article is too big to be a unit of search. If you ask "how do cells divide" and the system hands you the entire biology article, you still have to find the one paragraph that answers you. So you cut every article into small pieces, roughly 384 tokens each, which is about three hundred words. Each piece overlaps the next one by a little, so a sentence that lands on a boundary does not get split clean in half and lost.

Every piece keeps a label: which article it came from, the title, the link. That label is what lets the system cite a source later instead of just asserting things. Wikipedia's dump unpacked into 18.8 million pages, but most of those are redirects, stubs, disambiguation pages, and bare lists. After dropping all of that and slicing what was left, I had 13.98 million chunks of real prose.

One article becomes many overlapping chunks
one long article chunk 1 · 384 tokens chunk 2 · 384 tokens chunk 3 · 384 tokens 50-token overlap

The overlap means a sentence on a boundary lands whole in at least one chunk.

You don't search books. You search pages. The chunk is the page.

Turning words into coordinates

This is the step that makes the whole thing possible, and it is the one that sounds like magic until you say it plainly. A computer cannot tell that "magma erupting from a mountain" and "volcanic eruption" mean the same thing. They share no words. So you run every chunk through a small model that reads the text and outputs a list of 384 numbers. Think of it as a coordinate. Chunks that mean similar things land near each other. Chunks about different things land far apart.

Once everything is a coordinate, "find me text about volcanoes" becomes "find the coordinates closest to this one." Meaning turns into distance, and distance is something a machine is very good at.

I used a small open model called bge-small, ran it on the M4's GPU, and stored the results at half precision to save space. Here is what the build actually weighed.

Why coordinates work · close means similar
"volcanic eruption" "magma bursts from a mountain" "the cat slept on the mat" far apart

Same meaning, nearby coordinates. Different meaning, far away. The model never sees the words, only the distance.

Source dump~25 GB
Articles extracted18.8M
Usable chunks13.98M
Chunks embedded (this build)750K
Vector size384 numbers
Query speedunder 1 second

The embedding job kept dying overnight. I would start it, go to sleep, and wake up to a half-finished run. The laptop was going to sleep and killing the process, and the usual command to keep it awake does not stop the lid-close sleep. I eventually stopped at 750 thousand chunks, which is far more than enough for a system that actually works. The pipeline scales to all 14 million. That part is just hours of patience I decided not to spend.

Meaning became math. The laptop slept through half of it anyway.

Two librarians, not one

Here is a thing I did not expect to matter so much. Searching by meaning is not enough on its own. If you ask for "Python 3.12" or a specific name or an error code, meaning-search can drift, because the embedding smooths over exact tokens. So the system runs two searches at once.

One is the meaning search I just described. The other is old-fashioned keyword search, the kind that matches exact words and weighs rare words more heavily. Then it merges the two ranked lists, and anything both searches agree on rises to the top. Finally a slower, sharper model re-reads the question against each top candidate and reorders them properly.

The clearest example: I asked "what is the capital of France" and the top result was the Paris article, found by the meaning search, not the keyword one. The chunk never used the word "capital." Keyword search alone would have missed it. Meaning search alone would have missed an exact-token query. Together they cover each other.

Ask two librarians who disagree, then trust what they agree on.

The part that won't lie

This is the whole reason the project exists. A normal language model, asked a question, will answer. That is the problem. It will answer even when it has nothing real to go on, and it will sound just as confident either way. I wanted the opposite. I wanted a system that would rather say "I don't know" than invent something.

There are two guards. The first one runs before the model is even called. After reranking, if the best match scores below a threshold, the system stops right there and refuses. No model, no answer, no chance to make something up. The second guard is the prompt itself: the model is told to answer only from the passages it is given, and to say "insufficient evidence" if the answer is not there.

The first guard · rerank score vs the gate (threshold 0.5)
Who developed the theory of relativity?9.07PASS
photosynthesis in plants8.73PASS
What did Einstein eat for breakfast?0.31REFUSE
asdfghjkl qwerty nonsense−0.16REFUSE

Real reranker scores. Above the line the model answers. Below it, the system never even asks.

Here is what that looks like in practice. Two questions, both about Einstein, both find Einstein in the corpus. Only one gets an answer.

RAG-that-wont-lie · ask

Who developed the theory of relativity?

Albert Einstein developed the theory of relativity.

source: Theory of relativity · en.wikipedia.org

What did Albert Einstein eat for breakfast?

I don't have enough reliable information to answer that.

The second answer is the one I am proud of. Einstein is everywhere in the corpus, so retrieval happily pulls back pages about him. But none of them mention breakfast. A system optimized to be helpful would stitch together a plausible sentence. This one notices the passages do not actually contain the answer, and stops.

A system that says "I don't know" is worth more than one that's confidently wrong.

The bug that wasn't where I looked

Then it started crashing. Every time I ran more than one question in a row, the whole thing fell over with a macOS dialog: "Python quit unexpectedly." No error message, no traceback, just a hard exit.

I guessed. I blamed the GPU first, because the GPU had caused me trouble earlier, so I moved the work to the CPU. Still crashed. Then I blamed the number format, switched it, dropped to a smaller model. Still crashed. I was two wrong theories deep and getting nowhere.

Three theories. One was right.
blame the GPU blame float16 read the crash report libomp thread clash

Two guesses cost me an evening. The crash report had the answer the whole time.

So I stopped guessing and read the crash report, which I should have done first. It named the exact library and the exact function.

EXC_BAD_ACCESS (SIGSEGV) in libomp.dylib
  __kmp_suspend_64
  __kmp_fork_barrier
  __kmp_launch_worker

The two libraries I was using, one for vector search and one for the model, each ship their own copy of the same threading runtime. Loaded together in one process, the two copies collide on a shared thread barrier and the process dies. It had nothing to do with the GPU or the number format. Those were red herrings. The crashes only looked random because it was a race between threads.

The fix was forcing everything to a single thread and telling the runtime to tolerate the duplicate. Three lines of configuration at the top of the file. Finding it took the one thing I had skipped.

The fix took one line. Finding it took the crash report I should've read first.

Zero hallucination is not a model setting. It is a decision, made at every stage, to ground the answer in something real and to refuse when you can't.

What it actually returns

Here is the real thing running. Not a mockup, the actual output. The answer, the citation marker dropped inline, and the exact source it pulled from. Every claim traces back to a passage you can open.

rag-that-wont-lie · ask
$ ask "Who developed the theory of relativity?"
Albert Einstein developed the theory of relativity. [1]
[1] Theory of relativity
    https://en.wikipedia.org/wiki?curid=30001
confidence 0.90 · retrieval 816ms · generation 7.1s · model Qwen2.5

An answer with a citation is where most demos stop. A system you would actually run needs three more layers. They are layers 8, 9 and 10 of the blueprint, not extras, and they are the difference between a thing that works in a screen recording and a thing that holds up when real questions hit it. I built all three. Here is each one.

Evals: knowing it still behaves

Every change I made could silently break the system. A different chunk size, a smaller model, a new confidence threshold. The answer to "did that just make it worse?" cannot be a vibe. It has to be a number I can re-run.

The usual move is to grab a public benchmark like Natural Questions or TriviaQA. That does not work here, because my corpus is only the first 750 thousand chunks of Wikipedia. Most of those benchmarks' gold answers live in articles I never embedded, so the system would "fail" questions it was never given the material for, and the score would be measuring my sampling, not my pipeline. So I wrote an honest set instead, in three buckets: questions whose answer is genuinely in the corpus, real-topic questions whose answer is not (Einstein is in there, his breakfast is not), and pure gibberish. The first bucket has to answer with a citation. The other two have to refuse. The number I actually watch is the refusal rate on the last two buckets, because that is "won't lie" expressed as a percentage.

The eval set · three buckets, expected behaviour
Answerable   fact is in the corpus (relativity, photosynthesis, anarchism)ANSWER + CITE
Unanswerable on topic   real entity, fact absent ("Einstein's breakfast")REFUSE
Adversarial   pure gibberishREFUSE

Real run: 100% answered and cited on the first bucket, 100% refused on the other two. The suite runs end to end with one command.

A demo proves it worked once. An eval proves it still works after you touched it.

Caching: not paying twice

One answer costs around ten seconds of model time on my laptop. In the real world the same questions come back constantly, the popular ones over and over. Paying full price for each repeat is wasteful and slow. So before any query touches retrieval or the model, it checks a cache.

The key is a normalized hash of the question: I lowercase it and collapse the whitespace before hashing, so "Who developed relativity?" and "who developed relativity" land on the same entry. A hit returns the stored answer and its sources straight from a small local database, in about a millisecond, with a one-day expiry so stale answers do not live forever. No retrieval, no rerank, no generation. At scale, this is most of your latency and most of your bill, gone, for the slice of traffic that repeats.

Caching · the same question, second time
first ask~10 s
repeat (cached)0.001 s

A normalized-hash lookup in SQLite. The second ask never reaches the model.

The fastest model call is the one you never have to make.

Observability: no black boxes

When an answer looks wrong, "the model did it" is not an explanation. I wanted to be able to open up any single answer, weeks later, and see exactly why it came out the way it did. So every query writes one trace line as it runs.

The trace records the whole path: the top hits from the keyword search, the top hits from the meaning search, the order after they were fused, the reranked top five with their actual scores, the gate's verdict and the score it gated on, and the time each stage took. It is written as one JSON object per query, so I can tail the log or grep it later. When the gate refuses something it should have answered, or answers something it should have refused, the trace tells me whether retrieval missed it, the reranker misjudged it, or the threshold is wrong. That is the difference between guessing and knowing.

logs/traces.jsonl · one query
query "Who developed the theory of relativity?"
bm25 top 30001_0000 · 13758_0004 · 14909_0007
faiss top 30001_0000 · 736_0011 · 13758_0002
reranked 30001 (9.07) · 13758 (8.04) · 736 (7.11)
gate PASS (best 9.07 ≥ 0.5)
latency retrieval 816ms · generation 7.1s
every answer reconstructable, after the fact

If you can't explain why it answered, you can't promise it won't lie.

The problem statement made it look like ten clean arrows. What I actually got was ten working layers, three crashes, a parser I threw away, and a laptop that kept falling asleep. All ten are built and tested, not just the ones that make a good screenshot. None of that mess showed up in the original ten-step answer, and all of it is where the actual learning was.

What I would do next: embed all 14 million chunks instead of stopping at 750 thousand, and swap in a bigger model so the citations point to the exact sentence rather than the passage. The whole thing is vibe-coded with Claude, every script and every step is public, and you can read or run it here: github.com/Ajinkya259/RAG-that-wont-lie.

The next time a tool answers you instantly and confidently, ask yourself what it would have done if it didn't actually know. Would you be able to tell?

© 2026 Ajinkya Sambare ← All writing