Scenario Testing with .test.json

AI pipelines are non-deterministic -- LLM outputs vary between runs. But the structure of your pipeline is deterministic: given a particular outcome at each node, the same path should always be followed. Dippin's scenario testing lets you inject context values and assert on execution paths, giving you deterministic tests for non-deterministic systems.

The Core Idea

A .test.json file sits next to your .dip file. It holds an array of test scenarios, each declaring what context values to inject and what execution behavior to expect (visited nodes, path ordering, status).

The simulator walks the workflow graph, uses your injected values to evaluate edge conditions, and reports which nodes were visited and in what order. Your test asserts the result matches expectations.

Auto-discovery

dippin test pipeline.dip automatically looks for pipeline.test.json in the same directory. No configuration needed.

A Workflow to Test

Take a real example from the Dippin repository: code_quality_sweep.dip. This workflow runs three LLM providers in parallel to analyze a codebase, synthesizes findings, fans out into three work streams (fix bugs, write docs, write tests), then finishes with a quality gate that can restart the whole process.

The key structural elements:

workflow CodeQualitySweep
  goal: "Analyze the dippin-lang codebase with three LLM providers in parallel..."
  start: ScanCodebase
  exit: Done

  # Phase 1: Scan
  agent ScanCodebase
    label: "Map the codebase"
    ...

  # Phase 2: Three-provider parallel analysis
  parallel AnalysisFan -> AnalyzeAnthropic, AnalyzeGemini, AnalyzeOpenAI
  fan_in AnalysisJoin <- AnalyzeAnthropic, AnalyzeGemini, AnalyzeOpenAI

  # Phase 3: Synthesize findings
  agent Synthesize
    ...

  # Phase 4: Three parallel work streams
  parallel WorkFan -> FixBugs, WriteDocs, WriteTests
  fan_in WorkJoin <- FixBugs, WriteDocs, WriteTests

  # Phase 5: Quality gate with retry
  agent QualityGate
    goal_gate: true
    ...

  edges
    ...
    QualityGate -> Done         when ctx.outcome = success
    QualityGate -> Synthesize   when ctx.outcome = fail  restart: true
    QualityGate -> Done

The quality gate has three outgoing edges: success goes to Done, failure restarts from Synthesize, and an unconditional fallback also goes to Done. Three distinct execution paths, three test scenarios.

The Test File

Here's the code_quality_sweep.test.json from the repository:

{
  "tests": [
    {
      "name": "quality gate passes -- all branches traversed",
      "scenario": {"outcome": "success"},
      "expect": {
        "status": "success",
        "visited": [
          "ScanCodebase",
          "AnalyzeAnthropic", "AnalyzeGemini", "AnalyzeOpenAI",
          "Synthesize",
          "FixBugs", "WriteDocs", "WriteTests",
          "QualityGate", "Done"
        ],
        "path_contains": ["ScanCodebase", "Synthesize", "QualityGate", "Done"]
      }
    },
    {
      "name": "quality gate fails -- restarts from Synthesize",
      "scenario": {"outcome": "fail"},
      "expect": {
        "visited": ["QualityGate", "Synthesize"],
        "path_contains": ["QualityGate", "Synthesize"]
      }
    },
    {
      "name": "all three analysis providers run",
      "scenario": {"outcome": "success"},
      "expect": {
        "path_contains": [
          "AnalyzeAnthropic", "AnalyzeGemini",
          "AnalyzeOpenAI", "AnalysisJoin"
        ]
      }
    },
    {
      "name": "no outcome -- unconditional fallback to Done",
      "scenario": {},
      "expect": {
        "status": "success",
        "visited": ["QualityGate", "Done"]
      }
    },
    {
      "name": "branch filter -- only Gemini analysis",
      "scenario": {"outcome": "success"},
      "branch": ["AnalyzeGemini"],
      "expect": {
        "status": "success",
        "visited": ["AnalyzeGemini"],
        "not_visited": ["AnalyzeAnthropic", "AnalyzeOpenAI"]
      }
    }
  ]
}

Anatomy of a Test Case

Each test case has three parts:

FieldPurpose
nameHuman-readable description shown in test output
scenarioContext values to inject. {"outcome": "success"} sets ctx.outcome to "success" at every node.
expectAssertions about the simulation result
branchOptional. Filters parallel fan-out to only these branches.

Expectation Fields

The expect object supports five assertion types:

FieldWhat it checks
statusOverall simulation status: "success" or "fail"
visitedNode names that must appear in the execution path
not_visitedNode names that must NOT appear in the execution path
path_containsNode names that must appear in order (not necessarily adjacent)
immediately_afterObject mapping node names: {"A": "B"} asserts B appears right after A

Running Tests

$ dippin test examples/code_quality_sweep.dip
PASS  quality gate passes -- all branches traversed
PASS  quality gate fails -- restarts from Synthesize
PASS  all three analysis providers run
PASS  all three work streams run
PASS  no outcome -- unconditional fallback to Done
PASS  branch filter -- only Gemini analysis

6/6 passed  examples/code_quality_sweep.dip

Verbose Mode

Add --verbose to see the full execution path for each scenario. Invaluable when debugging a failing test:

$ dippin test --verbose examples/code_quality_sweep.dip
PASS  quality gate passes -- all branches traversed
  path: ScanCodebase -> AnalysisFan -> AnalyzeAnthropic -> AnalysisJoin
        -> AnalyzeGemini -> AnalysisJoin -> AnalyzeOpenAI -> AnalysisJoin
        -> Synthesize -> WorkFan -> FixBugs -> WorkJoin -> WriteDocs
        -> WorkJoin -> WriteTests -> WorkJoin -> QualityGate -> Done
...

Writing Your Own Tests

Step 1: Identify the branches

Look at your edge conditions. Each when clause creates a branch. You need at least one test scenario per branch. Think about:

  • The success path (all conditions evaluate to "success")
  • The failure path (what happens when a node fails)
  • The fallback path (what happens when no condition matches)
  • Parallel branch isolation (individual fan-out branches)

Step 2: Write the scenario injection

The scenario object sets context values the simulator uses when evaluating edge conditions. Keys correspond to variable names in when clauses, without the ctx. prefix:

// Edge condition in .dip file:
QualityGate -> Done  when ctx.outcome = success

// Corresponding scenario injection in .test.json:
"scenario": {"outcome": "success"}

Step 3: Assert on structure, not content

The key insight: you assert on which nodes were visited and in what order, never on LLM response content. The tests stay deterministic because you're testing the graph's routing logic, not the models' output.1

Pitfall: not_visited fragility

Be careful with not_visited. If your workflow has retry loops with restart: true edges, the simulator's loop-breaking may visit nodes you don't expect. Prefer positive assertions (visited, path_contains) when possible. Use not_visited sparingly.2

The branch Filter

For workflows with parallel fan-out, you can test individual branches in isolation with the branch field:

{
  "name": "branch filter -- only Gemini analysis",
  "scenario": {"outcome": "success"},
  "branch": ["AnalyzeGemini"],
  "expect": {
    "status": "success",
    "visited": ["AnalyzeGemini"],
    "not_visited": ["AnalyzeAnthropic", "AnalyzeOpenAI"]
  }
}

The branch array lists the parallel targets to include. All other fan-out branches get skipped, letting you test branch-specific behavior without noise from other parallel paths.

Edge Coverage

Add --coverage to see which edges your test suite covers:

$ dippin test --coverage examples/code_quality_sweep.dip
6/6 passed  examples/code_quality_sweep.dip

Edge coverage: 24/25 edges covered (96.0%)
Uncovered edges:
  QualityGate -> Synthesize  when ctx.outcome = fail  restart: true

The uncovered edge tells you exactly what test case to add. Write a scenario that triggers that condition to reach 100%.

JSON Output for CI

For CI integration, use --format json to get machine-readable results:

$ dippin test --format json examples/code_quality_sweep.dip
{
  "file": "examples/code_quality_sweep.dip",
  "total": 6,
  "passed": 6,
  "failed": 0,
  "results": [
    {"name": "quality gate passes -- all branches traversed", "status": "pass"},
    {"name": "quality gate fails -- restarts from Synthesize", "status": "pass"},
    ...
  ]
}

What's Next?

You know how to write scenario tests that verify your pipeline's routing logic deterministically. Related topics:

Notes

  1. The simulator doesn't call any LLMs. It walks the graph using your injected context values to pick edges, which is why tests run in milliseconds and cost nothing. The simulation implementation is in simulate/.
  2. The not_visited fragility was discovered during field testing with the Tracker team. Retry loops with restart: true create cycles, and the simulator breaks them after a bounded number of iterations -- but the nodes visited during those iterations can surprise you. See the testing reference for details on loop-breaking behavior.