Breaking the Deterministic Trap: Why I Abandoned the 'Hybrid' RAG Compromise

My 'Hybrid' RAG approach was supposed to save tokens and latency. Instead, it created a deterministic trap that frustrated users. Here is why I moved to Pure Agentic RAG.

Photo by Sebastian Schuster / Unsplash

I used to be a firm advocate for the "Hybrid" RAG compromise.

When I first transitioned my meal planner from a static "Push" model to an autonomous "Pull" model (as I wrote in From Push to Pull), I was worried about the "cold start" problem. I didn't want my agents to sit there "thinking" for an extra turn every single time they needed a recipe. So, I implemented a compromise: I would Push a small, high-confidence set of recipes (10 items) based on the user's initial query, and only let the agent Pull more if those weren't enough.

It felt like a smart engineering trade-off. It saved latency. It saved tokens. And it was a complete failure for my users.

Here is how a single bug report exposed the "Deterministic Vector Trap" and why I finally decided to go "Pure Agentic."

The User Report: "It's Always the Same Meals"

The feedback was consistent and frustrating: "Every time I ask for a 'plan for the week,' the tool gives me the exact same five dishes. It doesn't matter how many new recipes I add to my database."

I checked the logs. I verified the exclusion logic (which prevents using recipes from the last three weeks). The code was working perfectly. The "Push" logic was fetching the 10 most "relevant" recipes, and the agent was picking 5 of them.

Then I looked at the search terms.

When a user asks for something specific like "spicy Korean chicken," semantic search (Vector RAG) is a miracle. It finds exactly what you want. But when a user uses a generic, exploratory query like "generate a plan for the week," semantic search becomes a deterministic trap.

The Deterministic Vector Trap

In a Vector DB, a query is just a coordinate in a high-dimensional space. The phrase "plan for the week" doesn't have a culinary meaning; it has a mathematical location.

Every time the user sends that generic request, the embedding model generates the exact same vector. My "Hybrid" logic then fetches the exact same 10 nearest neighbors.

Even if the user has 500 recipes, the top 10 results for that specific meaningless vector stay the same. By "Pushing" these 10 recipes into the agent's context upfront, I was effectively blindfolding the AI. It couldn't "think" of other recipes because I had already told it: "Here are the most relevant items."

Solving the Cold Start with Agency

To break this trap, I had to stop trying to be "helpful" at the orchestrator level. I had to embrace Pure Agentic RAG.

Instead of pre-fetching recipes, I now send the agent into the loop with an empty pool and a set of specialized tools. This moves the "intelligence" from a static Go function to the LLM itself.

Before: The Hybrid (Push + Pull)

The orchestrator made a guess, often poisoning the well with irrelevant or repetitive data.

After: The Pure Agentic (Pull Only)

The agent analyzes the query first. If it's generic, the agent knows it needs to explore.

graph TD
    A[User: 'Plan for the week'] --> B[Analyst Agent]
    B -->|Intent: Generic| C[Tool: search_recipes_random]
    C -->|Result: Diverse Pool| B
    B -->|Final Decision| D[Meal Plan]
    
    E[User: 'Spicy Chicken'] --> f[Analyst Agent]
    f -->|Intent: Specific| G[Tool: search_recipes_semantic]
    G -->|Result: Targeted Pool| f
    f -->|Final Decision| H[Meal Plan]

Splitting the Search: Semantic vs. Random

The key to making Pure Agentic RAG work isn't just giving the agent a search tool; it's giving it choices.

I split my single search_recipes tool into two distinct capabilities:

search_recipes_semantic(query) (The Scalpel): Uses vector embeddings. Great for specific dietary needs, cuisines, or ingredients (e.g., "spicy chicken", "low-carb").
search_recipes_random(limit) (The Shotgun): Uses a simple SQL ORDER BY RANDOM() query with an exclusion list. This is the antidote to the deterministic trap, used for broad requests like "plan for the week".

I updated the Analyst's prompt to explain the difference. Now, when the user says "plan for the week," the agent realizes the query lacks semantic direction and chooses the Random tool to gather a diverse base of inspiration.

The Hidden Dependency Trap (Don't Forget the Reviewer)

When I first mapped out this migration, I was hyper-focused on the Analyst agent—the one responsible for generating the initial plan. I was ready to rip out the old search_recipes tool and replace it with the split tools.

Then I realized I was about to hit a compile-time wall.

In my architecture, I use a generic ExecuteAgentLoop (which I wrote about in Refactoring for Autonomy). The engine itself is beautifully decoupled—it just takes a list of tools and handlers as parameters. Technically, I could have given the Analyst new tools while leaving the PlanReviewer on the old ones.

Even though the agents are independent, they share a critical piece of infrastructure: the RecipeSearcher interface.

When I started updating the interface to support both semantic and random searches, I hit a massive compile-time wall: Import Cycles. My shared package (which defined the interface) needed to return recipe.Recipe structs. But the recipe package needed to import shared to access AgentMeta data for logging. By moving to a Pure Agentic model where multiple domains need to interact with search, my shared contracts became a tangled mess.

I couldn't just "split a tool." I had to perform a domain-driven refactor. I had to extract the core data structure into a new, isolated value package (value.Recipe) so both shared and recipe could depend on it without depending on each other. Moving to Pure Agentic RAG isn't just about changing one prompt; it forces you to enforce strict boundary layers in your application architecture.

Trusting the Context Window

The final step in going "Pure Agentic" was letting go of my own code's state management.

Previously, my Go orchestrator was maintaining a recentlyUsed array during the multi-turn loop to ensure the agent didn't fetch the same recipe twice in a single conversation. It looked like this:

// Before: Babysitting the state with closures
recentlyUsed := recipesRecentlyUsed
searchHandler := func(ctx context.Context, toolCall llm.ToolCall) (llm.Message, []value.Recipe, error) {
    recipes, msg, _ := HandleRecipeSearch(ctx, searcher, toolCall, recentlyUsed)
    
    // Go-side tracking to prevent the LLM from seeing duplicates
    for _, r := range recipes {
        recentlyUsed = append(recentlyUsed, r.ID)
    }
    return msg, recipes, nil
}

Once I split the tools, I realized this was redundant babysitting. A high-reasoning model can read its own tool-call history. If it calls search_recipes_random and sees a meal it already evaluated, it simply ignores it.

By removing the Go-side closure state, the engine code became significantly cleaner. We only pass the "Historical" exclusions (meals eaten last week) to the database layer, and we let the LLM manage its immediate thoughts.

// After: Trusting the LLM's context window
handlers := map[string]ToolHandler[[]value.Recipe]{
    searchRecipesSemanticTool.Name: func(ctx context.Context, toolCall llm.ToolCall) (llm.Message, []value.Recipe, error) {
        // No local state appending. Just straight pass-through.
        return HandleRecipeSemanticSearch(ctx, a.searcher, toolCall, recipesRecentlyUsed)
    },
    // ... random handler ...
}

The Cost of Purity

Is it slower? Yes. Every request now takes at least one extra LLM turn. To accommodate this, I had to increase the safety ceiling of my agent loop from 5 turns to 15 turns.

Giving the agent more "slack" allows it to be exhaustive—it can now perform 5-7 searches, reject bad results, and re-fetch until it finds exactly what the user needs. It's the difference between a "rushed" agent that gives up early and an "exhaustive" one that truly explores the database.

But as soon as I ran my first live test with the new 15-turn ceiling, the application crashed with a 413 Rate Limit Exceeded error.

The Token Bloat Reality

The error message was brutal: Limit 12000, Requested 12302. I had hit the strict Tokens Per Minute (TPM) limit on Groq's free tier.

Because the agent was now executing multiple search loops to explore its options, the entire conversation history (including the massive JSON outputs of previous searches) was being passed back to the LLM on every turn. The context window was expanding exponentially.

I couldn't lower the 15-turn limit, but I could reduce the size of the payload. I realized the Analyst agent doesn't need to know the exact Ingredients or step-by-step Instructions to select a meal plan—it only needs the Title, Tags, PrepTime, and Servings.

I had to introduce a strict filtering helper in my Go handlers to strip the full value.Recipe struct before marshaling it into JSON for the tool response:

// The structural tax for Pure Agentic RAG
func simplifyForTool(recipes []value.Recipe) []value.Recipe {
    var content []value.Recipe
    for _, r := range recipes {
        content = append(content, value.Recipe{
            ID:       r.ID,
            Title:    r.Title,
            PrepTime: r.PrepTime,
            Tags:     r.Tags,
            Servings: r.Servings,
        })
    }
    return content
}

Crucially, I also had to update the struct tags in value/recipe.go to use omitempty so the stripped fields didn't still consume tokens as empty arrays or strings ("ingredients": null).

This is the hidden cost of purity. If you trust the model's agency, you must be ruthlessly disciplined with your data structures to avoid blowing up your context window.

Lessons in Intellectual Honesty

In my previous posts, I advocated for the Hybrid model as a pragmatic middle ground. I was wrong. Pragmatism in AI architecture often means "trying to hide the LLM's nature from itself."

By trying to avoid the "cold start," I accidentally built a system that was consistently mediocre for the most common user behavior. Moving to Pure Agentic RAG was a lesson in trusting the model's ability to reason about its own data needs, even if it requires a strict diet of tokens to pull it off.

If you are building RAG systems and your users are complaining about "repetitive" or "stale" results, before you spend days tuning your re-rankers or tweaking embedding dimensions, you might want to look at your orchestrator. Consider whether your "helpful" pre-fetching logic is actually pushing them into a trap.

References & Resources

Taming the Pull: Infinite Loops and Context Bloat - How I managed the side effects of giving agents more autonomy.
From Push to Pull: Giving My AI Agents Agency - The original (and now corrected) advocacy for the Hybrid approach.
Agentic RAG Migration Plan - The internal design document for this transition.