I did not want to open this series with a static technical article full of terms, architecture diagrams, and pipeline boxes.
Because the most interesting thing about RAG is not how many diagrams you have seen, or how many new labels you can repeat back. The interesting part is this: when a real question comes in, how does the system actually move, where does it make decisions, where does it stay conservative, and where does it spend extra effort for quality?
So for this first piece, I wanted to invert the usual order.
Not definition first.
Not architecture first.
Not lessons first.
Instead, I wanted to put a real interactive RAG workbench in front of you.
You can ask it a live question.
You can switch query_mode.
You can decide whether this run should turn on workflow, tool routing, or retry loop.
And when you send the request, you will not only get an answer. You will also see:
- how the planner rewrote the question
- which retrieval path it took
- whether workflow actually entered the run
- whether tool routing was activated
- whether the retry loop really retried
- what the citation checks, runtime budget, and observability layers recorded along the way
That is also the attitude I want to establish for this entire series:
Production RAG is not just a system that answers questions. It is a system whose path, constraints, and reasoning surface can be inspected.
Before we judge the answer, we should look at the path
If you think of RAG as “retrieve some documents, send them to a model, and return the answer”, you will hit a limit very quickly.
You might know whether the answer sounds correct.
But you still do not know how the system arrived there.
That becomes a real problem the moment the system starts moving toward production.
You begin to care about questions like:
- Was this request running in
fastorcreativemode? - Why did the planner rewrite the query?
- Why was workflow not used?
- Why did the retry loop stop without retrying?
- Why did citation checking fail even though an answer was still returned?
- Why did the runtime budget prevent a more expensive route?
That is where a production RAG system really diverges from a toy one.
What this Part 1 is doing inside the larger series
This is not the overview article, and it is not the foundational teaching article either.
It is better understood as the entry workbench for the entire series.
You can think of it as:
- a live demo
- a RAG trace viewer
- a production controls showcase
- a shared reference point that later articles can keep returning to
The later pieces will unpack the design behind these capabilities:
- why we need multiple
query_modepaths - why
workflowshould not simply be on by default - when
tool routingis actually worth introducing - how
retry loopchanges both answer quality and latency - why
citation check,faithfulness, andruntime budgetare not optional if you care about production RAG
But before we get there, I would rather let you watch the system once with your own eyes.
Watch what a real request actually does on its way through.
How to use this workbench
There are three ways to read this page.
1. As a normal conversational interface
Just ask questions and feel the difference between modes.
2. As a runtime control panel
Switch:
query_modeuse_workflowuse_tool_routinguse_retry_loop
Then observe how the request path changes.
3. As a trace viewer
Do not stop at the final answer.
Focus instead on:
- how the planner rewrites the question
- how many sources retrieval actually returned
- whether workflow, tool routing, and retry loop were active
- how citation checking and runtime budget shaped the result
If this piece helps you develop an intuition for those layers, then this series has started in the right place.
In the next article, I will return to the more basic — and more often misunderstood — question:
is RAG really still just chunking plus embeddings?