Test-run scoring

How BugBrain decides a test result verdict — PASS, FAIL, or INCONCLUSIVE — using outcome oracles and a confidence layer that enforces honesty, so INCONCLUSIVE never counts as a pass and flaky cases are flagged.

When a run finishes, the most important question is simple: did the thing work? BugBrain answers that with a deliberately small, honest vocabulary of verdicts — and a layer of checks designed to make sure a verdict means what it says. The guiding principle is that "we couldn't tell" is never disguised as "it passed."

The three verdicts#

Every check resolves to one of three outcomes:

PASS — the expected outcome was confirmed. The agent reached the situation it set out to test and saw the right result.
FAIL — the expected outcome was contradicted. Something that should have worked didn't, and there's evidence to back it up.
INCONCLUSIVE — BugBrain could not confidently determine pass or fail. For example, the flow was blocked, a page didn't load, a precondition was missing, or the evidence was too weak to make a call.

That third verdict is the one that makes BugBrain trustworthy. A lot of automated testing quietly treats "I didn't observe a problem" as a pass. BugBrain refuses to: INCONCLUSIVE is its own outcome and never counts as a pass. If the agent couldn't actually verify the thing, it says so.

How a verdict is decided#

Two mechanisms produce and police the verdict:

Oracles are the judgment functions that decide whether an outcome is correct. Different oracles look at different signals — did the data stay consistent, did the page meet accessibility expectations, did the screen match what it should look like, did the behavior match what the app claims to do. Each contributes to whether a check should pass or fail.
A trust and confidence layer then enforces honesty on top of those judgments. It applies a confidence floor — a verdict has to clear a minimum bar of certainty to be reported as a definitive PASS or FAIL — and it actively controls the rate of false passes (calling something good when it isn't). When a result can't clear that bar, it's reported as INCONCLUSIVE rather than nudged to a pass.

The net effect: a PASS from BugBrain is a confident, evidence-backed statement, not a default.

What it means for you#

When you read results, treat the three verdicts differently:

FAIL is a real signal worth acting on — go to the issue detail for the evidence and root cause.
INCONCLUSIVE is a prompt to look at why the agent couldn't tell. Often it points at an environment or setup problem — login didn't succeed, the start URL was wrong, a dependency was down — rather than a bug in the feature itself. The run viewer's action log shows exactly where it stalled.
PASS means the check was genuinely confirmed, so you can rely on it.

Flaky cases#

A check that swings between outcomes across runs — passing one time and failing the next without the app changing — is flaky. BugBrain flags these so a one-off blip doesn't get treated as a hard regression, and so you can fix the underlying instability rather than chasing ghosts.

Honest by design

The whole scoring model is built around one rule: never claim a pass you can't back up. That's why INCONCLUSIVE exists as a first-class result and why the confidence floor sits between the agent and the verdict.

Issue confidence & dedupHow findings are scored and deduplicated into single issues.Test runsRead verdicts and the action log in the run viewer.

Frequently asked questions

What are the possible verdicts?

Each check resolves to PASS, FAIL, or INCONCLUSIVE. PASS means the expected outcome was confirmed, FAIL means it was contradicted, and INCONCLUSIVE means BugBrain couldn't confidently determine either way.

Does INCONCLUSIVE count as a pass?

No — deliberately. INCONCLUSIVE is its own outcome and is never folded into the pass count. Counting "couldn't tell" as a pass would hide real problems, which is exactly what BugBrain is built to avoid.

Why would a check come back INCONCLUSIVE?

Usually because the agent couldn't reach the situation it needed to judge — a blocked flow, a page that didn't load, a missing precondition, or evidence too weak to call. It's an honest "not enough to say," not a failure of your app.

How does BugBrain avoid false passes?

Outcome oracles judge each result, and a trust layer enforces a confidence floor and controls the rate of false passes. A result the system isn't sure about is reported as INCONCLUSIVE rather than quietly passed.