Taming the Pull: Infinite Loops, Context Bloat, and the Blessing of Free Tiers

Moving to an autonomous 'Pull' architecture sounded great until my agent hit an infinite loop. Here is how I solved context bloat, fixed the loop, and why the free tier saved my wallet.

Taming the Pull: Infinite Loops, Context Bloat, and the Blessing of Free Tiers
Photo by Hans Westbeek / Unsplash

In my previous post about moving from a "Push" to a "Pull" architecture, I wrote about the joy of finally giving an LLM agent the steering wheel. Instead of statically injecting data into a prompt, I built a generic ExecuteAgentLoop that let the agent iteratively search for its own data using tools.

The theory was sound. The agent was autonomous. And then, as soon as it hit production, the pipeline violently crashed:

{
  "error": {
    "message": "Request too large... on tokens per minute (TPM): Limit 12000, Requested 13073",
    "code": "rate_limit_exceeded"
  }
}

Groq's 12,000 Tokens Per Minute (TPM) limit had slammed the brakes on my autonomous agent. But as I dug into the logs, I realized something important: hitting the free tier limit was the best thing that could have happened.

Had I been using a paid tier with massive limits, this agent would have silently burned through a massive bill. The hard constraint forced me to stop, look at the architecture, and uncover two massive traps I had inadvertently set for myself.

Here is how I investigated the crash, the "ah-ha" moment that solved the context bloat, and the mechanical guardrails required to keep autonomous agents from spinning out of control.

The Context Bloat and the DTO Trap

Because the agent was in a conversational loop, every new turn meant sending the entire history of the previous turns back to the API.

I was returning up to 10 Recipe structs in a single tool call. But these weren't just titles and ingredients; they included massive, HTML-extracted cooking instructions. At 150-200 tokens per recipe, Turn 1 was fine. By Turn 3, the compounding history was pushing payloads over 10,000 tokens.

My first instinct was the classic enterprise developer reflex: create Data Transfer Objects (DTOs). I started drafting RecipeSummary and ChefRecipe structs to strip out the instructions before passing the data to the LLM.

// The initial, over-engineered fix
type RecipeSummary struct {
    ID       string   `json:"id"`
    Title    string   `json:"title"`
    Tags     []string `json:"tags"`
    // Look at me, I'm hiding data from the LLM!
}

But then I had an "ah-ha" moment: YAGNI (You Aren't Gonna Need It).

The Analyst and Chef agents never actually used the instructions for meal planning or generating shopping lists. I was carrying around a heavy, expensive piece of data that no feature required.

Because my project uses a document-store architecture—storing recipes as raw JSON blobs inside an SQLite TEXT column—I didn't need complex DTOs or a painful database migration. I just deleted the Instructions field from the primary Go domain model entirely.

// Before
type Recipe struct {
	ID           string   `json:"id"`
	Title        string   `json:"title"`
	Ingredients  []string `json:"ingredients"`
	Instructions []string `json:"instructions"` // The culprit
}

// After
type Recipe struct {
	ID          string   `json:"id"`
	Title       string   `json:"title"`
	Ingredients []string `json:"ingredients"`
}

Because of how Go's json.Unmarshal works, it silently ignores the instructions key when reading old records from the database. It was an instant, zero-effort, backwards-compatible schema migration. Context bloat slashed in half, no DTOs required.

The Infinite Loop

I deployed the fix, confident the TPM errors were gone.

They weren't.

The payloads were smaller, but the agent was still hitting the ceiling. It wasn't the size of the request anymore; it was the frequency. The agent was caught in an infinite loop.

When I tracked the tool calls, I found a subtle but fatal bug in my RecipeService. To prevent the agent from picking the same meals twice, I pass an excludeIDs list to the semantic search. If the database ran low on un-seen recipes, my service was trying to be "helpful":

// The fatal flaw: trying to be too helpful
if len(recipeIds) < searchLimit {
    log.Printf("Warning: Recipe pool exhausted. Dropping exclusions.")
    // Let's just return the same recipes they already rejected!
    recipeIds, err = s.vectorRepo.FindSimilar(ctx, queryEmbedding, searchLimit, nil)
}

This created a catastrophic feedback loop:

  1. The agent searches for 5 unique recipes.
  2. The service runs out of fresh options, drops exclusions, and returns 5 recipes the agent has already seen.
  3. The agent analyzes them, realizes they violate its strict "5 unique recipes" system prompt, rejects them, and queries the tool again.
  4. The service returns the exact same duplicates.
  5. The agent rejects them again.

The LLM and the Go service were locked in a fight to the death, looping infinitely until Groq's TPM limit stepped in as the executioner.

Adding Mechanical Guardrails

Autonomy is powerful, but software still needs circuit breakers.

First, I removed the "helpful" exclusion dropping. It is much better for the service to return an empty array and let the LLM know the search genuinely failed, rather than lying to it and feeding it duplicates.

Second, I added a mechanical guardrail to the generic engine. An LLM loop should never be trusted to terminate itself.

func ExecuteAgentLoop[T any](/*...*/) (llm.ContentResponse, []T, []shared.ToolCallMeta, error) {
	// ...
	const maxTurns = 5
	turnCount := 0

	for {
		if turnCount >= maxTurns {
			return llm.ContentResponse{}, nil, nil, fmt.Errorf("agent exceeded maximum tool execution turns (%d)", maxTurns)
		}
		turnCount++
        
        // ... execute LLM and tools ...
    }
}

If the agent gets confused or the tools fail to provide the right data, it halts after 5 turns, throws a Go error, and exits gracefully.

The Tool Hallucination Trap

With the loops capped and the context bloat resolved, the pipeline was finally stable. Until I hit another crash entirely.

This time, it wasn't a rate limit. It was a 400 Bad Request from Groq with a highly specific error: tool_use_failed.

When the Analyst agent successfully finished its tool loops and tried to output the final JSON plan, it started its response with the requested top-level key: "selected_recipes_audit".

Because Llama 3 on Groq's backend uses a strict, internal XML-like syntax for function calling (e.g., <function="name">), the LLM sometimes hallucinates that JSON keys are actually function names if tools are still present in its context window. It tried to invoke a tool called selected_recipes_audit. Groq's backend intercepted it, realized that tool didn't exist in my registered schema, and instantly killed the generation.

The fix wasn't in Go; it was in the prompt. I had to build a mechanical switch to force the LLM out of "tool execution mode".

I updated the output instructions with a bold, explicit directive:

### Output Format

When you are ready to provide your final plan, **DO NOT call any tools**. 
Instead, reply with a standard message containing ONLY a raw JSON object...

Refined Alerting for the "Pull" Era

Finally, I had to update my own internal monitoring. I had a "Context Bloat Alert" set to fire whenever a prompt exceeded 4,000 tokens.

In the "Push" era, 4,000 tokens was a massive anomaly. In the "Pull" era, 4,000 tokens is just a Tuesday. Between the base system prompts, the rolling conversation history, and the recipe payloads, hitting 4,000 tokens is now a standard part of a multi-turn tool loop.

To reduce noise while still maintaining a safety margin for Groq's 12,000 TPM limit, I increased my alert threshold to 8,000 tokens. I also took a final pass at my agent prompts—specifically the PlanReviewer—stripping out redundant examples and analysis steps to keep the base overhead as lean as possible.

Constraints Breed Better Architecture

When I started building this meal planner, the free API tier felt like a limitation. Today, I view it as an architectural blessing.

Without that strict 12,000 TPM limit, my infinite loop would have deployed to production, happily spinning CPU cycles and burning API credits until I noticed a spike in a billing dashboard days later.

Moving to a "Pull" architecture shifts your complexity from orchestration logic to context management and loop safety. Give your agents the steering wheel, but never forget to install the brakes.

References & Resources