What a Pipeline Actually Is
An AgentDyne pipeline is a Directed Acyclic Graph (DAG) of agents. Each node is an agent. Each edge passes output from one agent as input to the next.
Failure Mode 1: Timeout Cascades (31% of failures)
The most common failure. A pipeline with a 5-minute timeout distributed across 6 nodes works fine 90% of the time. The 10% where one node takes longer cascades: remaining nodes never get scheduled.
Fix: Set pipeline timeout generously.
pipeline_timeout = (sum of expected node latencies) x 2.5
For a 6-node pipeline with 45-second median per node: timeout = (6 × 45) × 2.5 = 675 seconds.
Also: enable continue_on_failure: true on non-critical nodes.
Failure Mode 2: Output Schema Mismatch (28% of failures)
Node A produces JSON that Node B cannot parse. Example: Fact Checker outputs {"claims": [...], "verified_count": 2}. Summary Generator expects {"verified_claims": [...]}. The key name differs. Node B hallucinates.
Fix: Declare output schemas for every agent node. When an agent's output is validated against its declared schema before being passed to the next node, mismatches surface immediately.
Failure Mode 3: Non-Idempotent Nodes (17% of failures)
Pipelines retry on transient failures. If Node B writes to a database and then retries, you get duplicate records.
Fix: Design every node for idempotency. Pass an execution_id through the pipeline and use it as a deduplication key.
Output Schemas Matter More Than System Prompts
Counter-intuitive finding: improving output schemas improved pipeline reliability more than improving system prompts.
A system prompt change requires re-prompting and re-evaluating quality. An output schema change forces the model to conform to a structure — models are surprisingly good at this even with mediocre system prompts.
Rule of thumb: Spend 20% of iteration time on system prompts and 80% on output schemas, data contracts, and error handling.
Monitoring Your Pipeline
| Metric | Healthy | Warning | Alert |
|---|---|---|---|
| Success rate | >95% | 85-95% | <85% |
| P95 latency | <120% of baseline | 120-200% | >200% |
| Node failure rate | <5% | 5-15% | >15% |
| continue_on_failure activations | <2% | 2-10% | >10% |