LLM Agents Fail Silently on Structured Data. Here's What I Found While Building a Production Pipeline

When I started building Valentiz, a multi-agent outreach automation system, I assumed the hard problems would be orchestration: routing outputs between agents, managing state across workflow steps, handling API rate limits.

Those were all solvable in a day or two.

The problem that took weeks to properly fix was simpler and more fundamental: LLM agents fail silently, and on structured data, that failure propagates downstream before you notice it.

The Setup

The pipeline takes a contact (name, company, role, LinkedIn URL) and produces a personalised outreach email. Claude handles reasoning: deciding whether the contact is worth reaching, what angle to take, and drafting the message. n8n handles orchestration: triggering Claude API calls at each decision point, routing outputs between steps, and writing results to Airtable.

The structured data problem appears at the boundary between Claude’s output and n8n’s routing logic. Claude returns a JSON object. n8n reads specific fields from that object and decides what to do next. If the JSON is wrong, malformed, or missing expected fields, the downstream step either errors out or (worse) silently uses a default value and continues.

How It Fails

LLMs are good at following structured output instructions most of the time. “Return a JSON object with these fields” works reliably, until it doesn’t.

The failure modes I encountered, roughly in order of frequency:

Extra text around the JSON. Claude occasionally returns a preamble like “Here is the JSON as requested:” before the object. JSON.parse() throws. n8n catches the error, logs it, and the workflow stops. This is the visible failure and the easy one to fix.

Wrong field names. Asked for contact_score, returned contactScore or score. The downstream step reads contact_score, gets undefined, treats it as falsy, and routes the contact to the “skip” branch. No error is logged. The contact is silently discarded.

Nested when flat was expected. Asked for { "decision": "yes", "reason": "..." }, got { "decision": { "value": "yes" }, "reason": "..." }. The downstream check reads decision === "yes", which evaluates to false because it’s comparing a string to an object. The contact is silently skipped.

Truncated output. Long reasoning chains occasionally hit token limits mid-JSON. The output is syntactically invalid. JSON.parse() throws and the workflow stops.

The visible failures (parse errors) are annoying but easy to catch. The silent failures are the real problem, because by the time you notice that your pipeline is discarding contacts it should be handling, you’ve lost a week of data and have no idea which contacts were affected.

The Silent Part Is the Problem

Here’s what makes silent failures particularly bad in agentic systems: they don’t look like failures.

The workflow runs. Steps complete. Airtable gets updated. Everything looks healthy in the n8n execution log. But the output is wrong because a silent failure several steps earlier caused a mis-route, and the system has been confidently wrong ever since.

This is different from a database query returning null. That’s a defined, catchable state. A JSON field with the wrong name looks exactly like a JSON field that was intentionally absent, and your code has no way to distinguish them without explicit validation.

The Fix: Validation Interception

The solution I settled on was a validation step between every Claude API call and every downstream routing decision.

The validator takes three inputs: the raw Claude output, the expected schema (field names, types, required fields), and a fallback action. It does three things:

Parse the output, stripping any preamble text and extracting the first valid JSON block
Validate the parsed object against the expected schema, checking for required fields, correct types, and reasonable value ranges
If validation passes, return the object. If it fails, log the failure with the raw output and trigger the fallback action

The fallback is almost always “send to manual review queue” rather than “discard” or “retry”. Retry logic for LLM outputs is tricky because you don’t always know whether a bad output is a transient error or a systematic prompt failure. Manual review lets me inspect the pattern.

In n8n, I implemented this as a Code node that runs between every Claude API call and the next routing decision. It’s about 40 lines of JavaScript and it runs on every execution.

What the Validation Catches

After running with validation for two weeks, the failure distribution looked like this:

Extra text around JSON: 3% of calls. All caught and cleaned automatically.
Wrong field names (camelCase vs snake_case): 1.5% of calls. Caught, logged, sent to review.
Structural mismatches: 0.4% of calls. Caught, logged, sent to review.
Truncated output: 0.2% of calls. Caught, retried once with a higher token limit.

About 5% of calls had some form of output issue. Without validation, those 5% would have been silent failures scattered through the results.

The other finding: the failure rate was not evenly distributed. Certain prompt structures produced malformed output at 10-15% rates while others were nearly perfect. The validation data let me identify and fix those prompts, which brought the overall failure rate down to under 1%.

What This Means for Agentic Systems

A few principles I’d apply to any production LLM pipeline based on this:

Treat every LLM output as untrusted input. The same validation discipline you’d apply to user input from a web form applies to LLM output. It’s generated, not typed, but it’s still external data that your system can’t guarantee.

Structured output APIs help but don’t eliminate the problem. Claude’s JSON mode and similar features from other providers significantly reduce malformed output. They don’t eliminate it, and they don’t protect against semantic errors (correct JSON structure, wrong values).

Silent failures are worse than loud ones. Design your pipeline to fail loudly. An error that stops the workflow is better than a mis-route that silently corrupts your results.

Log raw outputs. When validation fails, log the raw LLM output alongside the failure. Six weeks later, when you’re diagnosing why a particular class of contacts was systematically mis-handled, that raw output log is the only thing that tells you what actually happened.

Building the validation layer took about three days and was the highest-leverage engineering work in the entire project. It turned a pipeline that was 95% reliable (not good enough for production) into one that’s 99%+ reliable with clear failure visibility for the remaining cases.