April 28, 2026

Building ReflectRAG - What Tutorials Won’t Tell You About RAG

A self-correcting RAG system from scratch. No LangChain. No LangGraph. Just raw Python, FAISS, and the OpenAI SDK.

Self-Correcting RAG#

I've been building a self-correcting RAG system from scratch for a while now. No LangChain. No LangGraph. Just raw Python, FAISS, and the OpenAI SDK.

This isn't a tutorial. It's a log of what I've actually hit; the gaps between "load PDFs, chunk them, embed them" and what a system that handles failure actually looks like.

The Problem#

Most RAG tutorials stop at the happy path. You get the 10-line version: load document, embed chunks, retrieve top-K, pass to LLM, get answer. It works on the demo PDF. It looks clean.

Then you ask it something that isn't covered in the document, and it confidently makes something up. Or you ask something slightly ambiguous and the retrieval returns five chunks that are all adjacent to the answer but none of them contain it. Or the LLM cites [Source 3] for a fact that doesn't appear in Source 3 at all.

The quality of your retrieval is significantly more important than the specific LLM model used for generation.

That's the thing nobody says early enough. You can swap gpt-4o-mini for gpt-4o and get marginally better answers. But if you're feeding the LLM irrelevant context, better generation quality doesn't fix garbage input. The problem is upstream.

That's what drove the design of ReflectRAG - two reflection loops that catch the two most common failure modes before the user sees the result. One loop handles bad retrieval. The other handles hallucination.

What I learned#

The Data Pipeline Has More Sharp Edges Than You'd Expect#

The first thing I built was the ingestion pipeline; PDF loading, chunking, embedding, FAISS index construction. It's the part that feels like plumbing. But it's where every downstream problem either gets set up or avoided.

Chunking isn't just splitting text. I built a recursive splitter that tries paragraph boundaries first (\n\n), then sentence boundaries, then word boundaries. The 800-character limit is a ceiling, not a hard cut. A sentence won't get broken mid-thought just because we hit the threshold.

Overlap is necessary, not optional. Chunk 2 starts with the last 100 characters of Chunk 1. Without this, a fact spanning a chunk boundary disappears from both chunks. You won't know it's missing until someone asks about it and the system confidently returns nothing.

Metadata is the citation chain. Every chunk carries source, page, and chunk_index from the moment it's created. When the system cites [Source 2], that reference traces back to a real page in a real file, not a guess.

ℹ️ Note

The chunk_index field isn't for humans, it's a positional guarantee. During index construction, chunk #42 in the Document list must be the same chunk whose vector sits at position #42 in the FAISS index. Lose that alignment and retrieval silently returns wrong text.

FAISS Is a Library, Not a Database#

FAISS stores vectors. Only vectors. It has no concept of strings, page numbers, or anything human-readable. When you call index.search(query_vector, k=5), you get back distances and integer indices. Indices into the list you maintain separately, a pickled array of Document objects that maps position-for-position to the FAISS index.

# Retrieval — the positional contract in practice
distances, indices = _INDEX.search(query_vector, TOP_K)

retrieved_docs = []
for idx in indices[0]:
    if idx != -1 and idx < len(_CHUNKS):
        retrieved_docs.append(_CHUNKS[idx])  # position match

That _CHUNKS[idx] lookup is the entire link between "this vector is near your query" and "here is the actual text that was embedded." Break the alignment, add vectors to FAISS without adding the corresponding Document to the list, and retrieval silently returns the wrong chunk. No error. No warning. Just wrong text.

🔴 Caution

Adding new documents to an existing FAISS index without re-pickling the chunk list is a silent correctness bug. FAISS will happily return indices that point to stale or completely wrong chunks. Always treat the index and the chunk list as a single unit.

FAISS Always Returns `k` Results - Even When Nothing Is Relevant#

This is the retrieval gap that makes naive RAG unreliable, and the tutorials treat it as a footnote.

You ask for 5 nearest neighbors. You get 5 nearest neighbors. Cosine similarity has no concept of "nothing here is close enough." The returned chunks go straight into the context window, the LLM sees them, and if they're not relevant it either fabricates an answer from noise or confidently answers the wrong question with plausible text.

The fix is a relevance grading step. In ReflectRAG, every retrieved chunk goes through grade_relevance() — a call to gpt-4o-mini with a structured prompt:

class GradeResult(BaseModel):
    """Structured output for the relevance grader."""
    relevant: str = Field(description="Must be exactly 'yes' or 'no'.")

 # Per-chunk grading with Pydantic-enforced output
result = call_llm_json(
    messages=messages,
    model=LLM_MODEL_GRADING,
    response_model=GradeResult,
    temperature=0.0
)

Not a free-text response that you parse with string matching - a structured output that OpenAI's API guarantees will match the schema. The grader is intentionally fail-open: if the API call fails for a specific chunk, that chunk passes through. Better to have noise in the context than to silently drop something that might have been useful.

📌 Important

Relevance grading is the quality gate that separates functional RAG from naive RAG. Without it, the context window is at the mercy of cosine similarity, which will always return something, regardless of whether that something is actually relevant.

Two Models for a Reason#

The cost structure of the pipeline is deliberate. Not everything needs gpt-4o.

Task	Model	Why
Query rewriting	`gpt-4o-mini`	Simple reformulation, no reasoning, temperature 0
Relevance grading	`gpt-4o-mini`	Binary yes/no, paying for judgment not quality
Hallucination checking	`gpt-4o-mini`	Binary verdict, cheaper than generation
Answer generation	`gpt-4o`	Output a human reads and trusts, quality matters

Grading 5 chunks at 400 characters each costs a few dozen tokens. Running that 5 times per query on gpt-4o-mini is negligible. The generation call on gpt-4o is where the real cost sits. You spend the expensive tokens exactly once, on output that matters.

💡 Tip

Set temperature=0.0 on grading and rewriting nodes. You want deterministic, consistent judgment - not creative variation. Save non-zero temperatures for the generation step where some flexibility in phrasing actually improves the answer.

The Self-Correction Loops#

This is what separates naive RAG from something genuinely reliable, and it's what I'm building next in the orchestrator.

The architecture has two reflection loops:

The amber loop (relevance re-routing): After grading, if zero chunks pass, the system routes back to query rewriting. The query wasn't specific enough. Try again with a better formulation. Max 2 rewrites before giving up, otherwise a bad document corpus leads to infinite rewrites.

The red loop (hallucination retry): After generation, a hallucination checker compares every claim in the answer against the context. The prompt: "Does every claim in this answer appear in the context? yes/no." If hallucinated claims are detected, regenerate. Max 3 retries before accepting the best available answer.

The LLM-as-judge pattern is interesting in practice. You're using GPT to critique GPT. Judging whether a specific claim appears in a provided text passage is considerably easier than generating a coherent cited answer. The cognitive load is different. The grading model looks at a claim, looks at the context, asks a binary question. That binary task is where gpt-4o-mini is reliable enough.

⚠️ Warning

Without loop guards, a model that consistently hallucinates, perhaps because the context genuinely doesn't support a good answer, will retry indefinitely, burning API budget with no improvement. Always cap retries. After 3 attempts, return the best generation you have and surface the hallucination signal to the caller.

The industry has moved toward more sophisticated approaches. Self-RAG embeds reflection tokens directly in generation, the model emits signals like [IsSup] mid-output. CRAG adds a decision gate before generation that can trigger a web search fallback if context quality is low. Those architectures have merit, but they require fine-tuned models or significantly more complex orchestration. The two-loop approach in ReflectRAG is implementable in plain Python, which is the point.

—-

FAISS vs. Managed Vector Databases#

Once you build with raw FAISS, the limitations become obvious. It handles vector search extremely well. Everything else is your problem.

Feature	FAISS (raw library)	Managed Vector DB (Qdrant, pgvector)
Vector search speed	Excellent	Competitive
Real-time updates	Requires full reindex	Native upserts
Metadata filtering	Manual, post-search	Pre-filtered during search
Persistence	DIY (pickle, files)	Built-in
Concurrent writes	Unsafe without locks	Handled
Horizontal scaling	Manual sharding	Automatic
Operational overhead	High	Low to medium

For ReflectRAG - a local project that indexes one PDF at a time, none of FAISS's gaps matter yet. But in a production system with multiple users and frequently updated documents, you'd either build a substantial wrapper around FAISS or switch to something that ships those properties out of the box. Although. FAISS is excellent for learning and constrained environments where you want raw control. For production RAG with real update requirements, a proper vector database is usually the more practical choice.

The tradeoff is operational complexity vs. engineering control. Neither is universally correct. It depends on what you're optimizing for.

Every LLM call in ReflectRAG is wrapped in try/except. Not because I've seen every failure mode, but because LLM APIs are network calls to systems I don't control. Rate limits happen. Timeout errors happen. Malformed responses happen. The fail-open pattern, where grading failures default to "include the document", is a specific design choice, not an accident.

The Honest Assessment From the Middle of the Build#

The ingestion-to-generation path is done and working. Data pipeline, retrieval, grading, answer generation, these nodes exist, they're tested independently, and they return what they should.

What's still untested at scale: the retry loops, the orchestrator connecting everything, the FastAPI layer, the frontend. I haven't seen how the system behaves when retrieval consistently fails, or when the LLM keeps hallucinating past the retry limit. Those are questions you can't answer in isolation, you can only answer them by wiring up the full loop and watching it run.

That's the thing about building node by node. Each piece feels solid in isolation. You mock the LLM responses, the tests pass, everything looks clean. Then you wire the nodes together in a real pipeline and find that state transitions you thought were obvious are subtly wrong, that the retry counter doesn't reset where you assumed, that the fail-open behavior in grading interacts with the hallucination loop in a way you didn't anticipate.

The most valuable thing so far hasn't been the code. It's the forced understanding of how each layer connects to the next. Why FAISS returns wrong chunks. Why relevance grading is necessary. Why hallucination checking is harder than it looks. None of that comes from calling RetrievalQA.from_chain_type(). It comes from building the seams yourself and watching what falls through.

The second half is where the architecture either holds together or it doesn't.

The Takeaway#

If you're building RAG pipeline and want it to actually work under real conditions, these are the things worth getting right from the start:

Build the relevance grader before you touch generation. FAISS always returns results. Assume none of them are relevant until proven otherwise.
Cap your retry loops. A hallucination checker without a max-retry guard is a runaway cost machine waiting for the right bad query.
Use two models, not one. Binary classification tasks (grading, hallucination checking) don't need your most expensive model. Save gpt-4o for output that a human actually reads.
Treat software engineering as non-optional. Error handling, centralized config, pure functions, testability with mocked LLM calls, these compound over the length of the project.
Build without the frameworks first, at least once. Not because LangChain is bad, but because you need to understand what it's abstracting before you can reason about what's going wrong inside it.

Support my work

If this post was useful, consider supporting my open source work and independent writing.

Sponsor on GitHub Buy me a coffee

Back to Blogs