A self-correcting RAG system from scratch. No LangChain. No LangGraph. Just raw Python, FAISS, and the OpenAI SDK.
I've been building a self-correcting RAG system from scratch for a while now. No LangChain. No LangGraph. Just raw Python, FAISS, and the OpenAI SDK.
This isn't a tutorial. It's a log of what I've actually hit; the gaps between "load PDFs, chunk them, embed them" and what a system that handles failure actually looks like.
Most RAG tutorials stop at the happy path. You get the 10-line version: load document, embed chunks, retrieve top-K, pass to LLM, get answer. It works on the demo PDF. It looks clean.
Then you ask it something that isn't covered in the document, and it confidently makes something up. Or you ask something slightly ambiguous and the retrieval returns five chunks that are all adjacent to the answer but none of them contain it. Or the LLM cites [Source 3] for a fact that doesn't appear in Source 3 at all.
The quality of your retrieval is significantly more important than the specific LLM model used for generation.
That's the thing nobody says early enough. You can swap gpt-4o-mini for gpt-4o and get marginally better answers. But if you're feeding the LLM irrelevant context, better generation quality doesn't fix garbage input. The problem is upstream.
That's what drove the design of ReflectRAG - two reflection loops that catch the two most common failure modes before the user sees the result. One loop handles bad retrieval. The other handles hallucination.
The first thing I built was the ingestion pipeline; PDF loading, chunking, embedding, FAISS index construction. It's the part that feels like plumbing. But it's where every downstream problem either gets set up or avoided.
Chunking isn't just splitting text. I built a recursive splitter that tries paragraph boundaries first (\n\n), then sentence boundaries, then word boundaries. The 800-character limit is a ceiling, not a hard cut. A sentence won't get broken mid-thought just because we hit the threshold.
Overlap is necessary, not optional. Chunk 2 starts with the last 100 characters of Chunk 1. Without this, a fact spanning a chunk boundary disappears from both chunks. You won't know it's missing until someone asks about it and the system confidently returns nothing.
Metadata is the citation chain. Every chunk carries source, page, and chunk_index from the moment it's created. When the system cites [Source 2], that reference traces back to a real page in a real file, not a guess.
The chunk_index field isn't for humans, it's a positional guarantee. During index construction, chunk #42 in the Document list must be the same chunk whose vector sits at position #42 in the FAISS index. Lose that alignment and retrieval silently returns wrong text.
FAISS stores vectors. Only vectors. It has no concept of strings, page numbers, or anything human-readable. When you call index.search(query_vector, k=5), you get back distances and integer indices. Indices into the list you maintain separately, a pickled array of Document objects that maps position-for-position to the FAISS index.
# Retrieval — the positional contract in practice
distances, indices = _INDEX.search(query_vector, TOP_K)
retrieved_docs = []
for idx in indices[0]:
if idx != -1 and idx < len(_CHUNKS):
retrieved_docs.append(_CHUNKS[idx]) # position match
That _CHUNKS[idx] lookup is the entire link between "this vector is near your query" and "here is the actual text that was embedded." Break the alignment, add vectors to FAISS without adding the corresponding Document to the list, and retrieval silently returns the wrong chunk. No error. No warning. Just wrong text.
Adding new documents to an existing FAISS index without re-pickling the chunk list is a silent correctness bug. FAISS will happily return indices that point to stale or completely wrong chunks. Always treat the index and the chunk list as a single unit.
k Results - Even When Nothing Is Relevant#This is the retrieval gap that makes naive RAG unreliable, and the tutorials treat it as a footnote.
You ask for 5 nearest neighbors. You get 5 nearest neighbors. Cosine similarity has no concept of "nothing here is close enough." The returned chunks go straight into the context window, the LLM sees them, and if they're not relevant it either fabricates an answer from noise or confidently answers the wrong question with plausible text.
The fix is a relevance grading step. In ReflectRAG, every retrieved chunk goes through grade_relevance() — a call to gpt-4o-mini with a structured prompt:
class GradeResult(BaseModel):
"""Structured output for the relevance grader."""
relevant: str = Field(description="Must be exactly 'yes' or 'no'.")
# Per-chunk grading with Pydantic-enforced output
result = call_llm_json(
messages=messages,
model=LLM_MODEL_GRADING,
response_model=GradeResult,
temperature=0.0
)
Not a free-text response that you parse with string matching - a structured output that OpenAI's API guarantees will match the schema. The grader is intentionally fail-open: if the API call fails for a specific chunk, that chunk passes through. Better to have noise in the context than to silently drop something that might have been useful.
Relevance grading is the quality gate that separates functional RAG from naive RAG. Without it, the context window is at the mercy of cosine similarity, which will always return something, regardless of whether that something is actually relevant.
The cost structure of the pipeline is deliberate. Not everything needs gpt-4o.
| Task | Model | Why |
|---|---|---|
| Query rewriting | gpt-4o-mini | Simple reformulation, no reasoning, temperature 0 |
| Relevance grading | gpt-4o-mini | Binary yes/no, paying for judgment not quality |
| Hallucination checking | gpt-4o-mini | Binary verdict, cheaper than generation |
| Answer generation | gpt-4o | Output a human reads and trusts, quality matters |
Grading 5 chunks at 400 characters each costs a few dozen tokens. Running that 5 times per query on gpt-4o-mini is negligible. The generation call on gpt-4o is where the real cost sits. You spend the expensive tokens exactly once, on output that matters.
Set temperature=0.0 on grading and rewriting nodes. You want deterministic, consistent judgment - not creative variation. Save non-zero temperatures for the generation step where some flexibility in phrasing actually improves the answer.
This is what separates naive RAG from something genuinely reliable, and it's what I'm building next in the orchestrator.
The architecture has two reflection loops:
The amber loop (relevance re-routing): After grading, if zero chunks pass, the system routes back to query rewriting. The query wasn't specific enough. Try again with a better formulation. Max 2 rewrites before giving up, otherwise a bad document corpus leads to infinite rewrites.
The red loop (hallucination retry): After generation, a hallucination checker compares every claim in the answer against the context. The prompt: "Does every claim in this answer appear in the context? yes/no." If hallucinated claims are detected, regenerate. Max 3 retries before accepting the best available answer.
The LLM-as-judge pattern is interesting in practice. You're using GPT to critique GPT. Judging whether a specific claim appears in a provided text passage is considerably easier than generating a coherent cited answer. The cognitive load is different. The grading model looks at a claim, looks at the context, asks a binary question. That binary task is where gpt-4o-mini is reliable enough.
Without loop guards, a model that consistently hallucinates, perhaps because the context genuinely doesn't support a good answer, will retry indefinitely, burning API budget with no improvement. Always cap retries. After 3 attempts, return the best generation you have and surface the hallucination signal to the caller.
The industry has moved toward more sophisticated approaches. Self-RAG embeds reflection tokens directly in generation, the model emits signals like [IsSup] mid-output. CRAG adds a decision gate before generation that can trigger a web search fallback if context quality is low. Those architectures have merit, but they require fine-tuned models or significantly more complex orchestration. The two-loop approach in ReflectRAG is implementable in plain Python, which is the point.
—-
Once you build with raw FAISS, the limitations become obvious. It handles vector search extremely well. Everything else is your problem.
| Feature | FAISS (raw library) | Managed Vector DB (Qdrant, pgvector) |
|---|---|---|
| Vector search speed | Excellent | Competitive |
| Real-time updates | Requires full reindex | Native upserts |
| Metadata filtering | Manual, post-search | Pre-filtered during search |
| Persistence | DIY (pickle, files) | Built-in |
| Concurrent writes | Unsafe without locks | Handled |
| Horizontal scaling | Manual sharding | Automatic |
| Operational overhead | High | Low to medium |
For ReflectRAG - a local project that indexes one PDF at a time, none of FAISS's gaps matter yet. But in a production system with multiple users and frequently updated documents, you'd either build a substantial wrapper around FAISS or switch to something that ships those properties out of the box. Although. FAISS is excellent for learning and constrained environments where you want raw control. For production RAG with real update requirements, a proper vector database is usually the more practical choice.
The tradeoff is operational complexity vs. engineering control. Neither is universally correct. It depends on what you're optimizing for.
Every LLM call in ReflectRAG is wrapped in try/except. Not because I've seen every failure mode, but because LLM APIs are network calls to systems I don't control. Rate limits happen. Timeout errors happen. Malformed responses happen. The fail-open pattern, where grading failures default to "include the document", is a specific design choice, not an accident.
The ingestion-to-generation path is done and working. Data pipeline, retrieval, grading, answer generation, these nodes exist, they're tested independently, and they return what they should.
What's still untested at scale: the retry loops, the orchestrator connecting everything, the FastAPI layer, the frontend. I haven't seen how the system behaves when retrieval consistently fails, or when the LLM keeps hallucinating past the retry limit. Those are questions you can't answer in isolation, you can only answer them by wiring up the full loop and watching it run.
That's the thing about building node by node. Each piece feels solid in isolation. You mock the LLM responses, the tests pass, everything looks clean. Then you wire the nodes together in a real pipeline and find that state transitions you thought were obvious are subtly wrong, that the retry counter doesn't reset where you assumed, that the fail-open behavior in grading interacts with the hallucination loop in a way you didn't anticipate.
The most valuable thing so far hasn't been the code. It's the forced understanding of how each layer connects to the next. Why FAISS returns wrong chunks. Why relevance grading is necessary. Why hallucination checking is harder than it looks. None of that comes from calling RetrievalQA.from_chain_type(). It comes from building the seams yourself and watching what falls through.
The second half is where the architecture either holds together or it doesn't.
If you're building RAG pipeline and want it to actually work under real conditions, these are the things worth getting right from the start:
gpt-4o for output that a human actually reads.
If this post was useful, consider supporting my open source work and independent writing.