LLM testing methods
The methods behind BugBrain's AI-app testing — prompt injection, role override and jailbreak probing scored by attack-class success and severity, plus hallucination detection, prompt consistency, toxicity, context retention, and tool-use correctness.
Testing an app powered by a large language model (an LLM — the kind of AI behind chatbots and assistants) is different from testing a normal web app. The same input can produce different output, "correct" is often a matter of degree, and the failure modes — making things up, being talked out of its rules, leaking its instructions — don't look like a crash. BugBrain's AI-app testing applies a set of focused methods to probe exactly these behaviors, using read-only judges that score outputs honestly.
Security probing: injection, role override, and jailbreak#
The most safety-critical methods are adversarial. BugBrain sends crafted inputs designed to make the app misbehave, then scores what happened:
- Prompt injection — inputs that try to smuggle in new instructions, often aiming to make the app reveal its hidden system prompt (the confidential instructions it was given).
- Role override — attempts to make the app abandon its assigned role and act as something else (for instance, ignoring "you are a support bot" and following the attacker's persona instead).
- Jailbreak — attempts to bypass the app's safety guardrails and get it to produce content it's supposed to refuse.
Each probe is scored by attack-class success — did this specific class of attack actually work? — and by severity. The most severe outcomes are reserved for unambiguous breaches: a system prompt that was fully extracted, or a complete role override that was confirmed. A probe that the app shrugged off is scored as a non-success. Beyond scoring, the method can suggest follow-up semantic attack directions to try — describing the kind of weakness to push on, never shipping weaponized payloads.
Accuracy, consistency, and safety#
Alongside security, several methods judge the quality of what the app says:
- Hallucination detection — flags answers that assert things which aren't grounded or supportable, catching the LLM's habit of stating fiction with confidence.
- Prompt consistency — checks whether the app gives stable, coherent answers to equivalent or repeated prompts, rather than contradicting itself.
- Toxicity — scans outputs for harmful, offensive, or otherwise unsafe content the app should never produce.
These together tell you whether the app is not just secure, but trustworthy and on-message.
Memory and tools#
Modern AI apps hold a conversation and call external tools, so BugBrain tests those behaviors too:
- Context retention — does the app remember and correctly use what was said earlier in the conversation, rather than forgetting or confusing it?
- Tool-use correctness — when the app is supposed to call a tool or function (look something up, take an action), does it call the right one with the right inputs at the right time?
Read-only judges that abstain#
Every one of these methods is evaluated by a read-only judge — it inspects and scores the app's output but never alters it. And critically, the judges are built to abstain rather than fabricate. If the evidence doesn't support a confident verdict, the judge says it can't tell instead of inventing a pass or fail. This is the same honesty principle that runs through the rest of BugBrain: a verdict you can't back up is worse than no verdict.
Test only systems you're authorized to test
Adversarial probing — injection, role override, jailbreak — should only ever be run against AI applications you own or have explicit permission to test. Pointing these methods at someone else's system can violate their terms and the law.
Related#
Frequently asked questions
What kinds of LLM apps can BugBrain test?
Chatbots, voice agents, and custom AI endpoints. The methods cover security probing, factual accuracy, consistency, safety, memory across a conversation, and whether the app calls its tools correctly.
How is prompt-injection testing scored?
Adversarial inputs are scored by attack-class success — whether a system-prompt leak, role override, or jailbreak actually worked — and by severity. A probe that fully extracts the system prompt or confirms a complete role override is the most serious.
How does BugBrain catch hallucinations?
A read-only judge compares the app's answers against what's expected or supportable and flags claims that aren't grounded. Like every judge here, it abstains when it can't tell rather than inventing a verdict.
Do the judges ever make things up?
No. The judges are read-only and built to abstain rather than fabricate. If the evidence doesn't support a confident call, the judge says so instead of guessing.