Every score is earned, not assumed. Here is exactly how each metric is calculated — every deduction is traceable to a specific browser action, screenshot, or bug found during your test run.
Measures whether each test step actually did something. A step that leaves the page unchanged is not a test — it's observation.
| Condition | Deduction |
|---|---|
| Before/after screenshots identical on an interactive step | −20 per step |
| Step marked PASS but only passive inspection occurred | −15 per step |
| Navigation action failed or timed out | −25 per step |
| Step description contains vague language | −10 per step |
Clicked "Sign Up", page navigated to /dashboard, screenshots differ, URL changed.
2 of 4 steps had identical screenshots. Nothing visibly changed after interaction.
Counts real bugs found — broken forms, missing elements, crashes, 404s, validation failures. Zero bugs only earns 100 if interactions were genuinely real.
| Condition | Deduction |
|---|---|
| Critical bug (broken form, 404, crash, hard fail) | −30 per bug |
| High bug (broken feature, missing required element) | −20 per bug |
| Medium bug (validation not working, missing feedback) | −15 per bug |
| Low/minor bug (UI glitch, slow response) | −10 per bug |
| Zero bugs but no real interactions occurred | Capped at 70 |
Zero bugs found. All interactions were real with confirmed assertions.
3 critical bugs found: form crash, 404 on submit, missing required field.
Detects browser-level UI problems: console errors, elements that could not be found, and elements that were blocked or non-interactable.
| Condition | Deduction |
|---|---|
| Console error detected during step execution | −10 per error |
| Expected element not found on the page | −15 per element |
| Element found but not interactable (disabled, overlay) | −10 per element |
Zero console errors. All elements found and clickable on first attempt.
3 console errors (−30), 1 missing element (−15). Score: 55.
The percentage of steps that genuinely passed — not just steps that did not error out. Three conditions must all be true for a step to count as a real pass.
| A step is a genuine PASS only if ALL are true | |
|---|---|
| Real interaction occurred (click, fill, submit) | Required |
| Before/after screenshots are visually different | Required |
| Step status is PASS (no error returned) | Required |
All 5 steps: clicked element, page changed, status PASS. 5/5 genuine passes.
4 steps all PASS but screenshots identical every time. Zero genuine passes.
Tracks whether forms were fully exercised end-to-end: filled with real data, submitted, and the response confirmed. Skipped if the page has no forms.
| Condition | Score |
|---|---|
| Form filled + submitted + success/error confirmed | 100 |
| Submitted but response not checked | 70 |
| Filled but never submitted | 20 |
| Form page detected but never interacted with | 0 |
| No form on this page | N/A |
Measures how smoothly the automation engine ran: timeouts, missing elements, fallback recoveries, and retries all indicate a harder-to-test page.
| Condition | Deduction |
|---|---|
| Step timed out waiting for element or page load | −20 per step |
| Target element could not be found on the page | −20 per step |
| Fallback recovery strategy was used instead of primary action | −15 per step |
| Retry was required to complete a step | −10 per retry |
Every element found on first try, no timeouts, no retries needed.
2 timeouts (−40), 1 fallback used (−15), 1 retry needed (−10). Score: 35.
Each metric contributes a different percentage. The weights reflect how much each metric matters for real-world website quality.
If Form Reliability is N/A (no form on the tested page), its 10% is redistributed equally — 5% to Flow Reliability and 5% to Success Rate.
Every step had a real interaction, a confirmed assertion, and visually different screenshots. Zero deductions across all metrics.
Minor issues only. Most steps interacted genuinely. A few retries or low-severity bugs were found.
Several steps produced no visible change, or medium-severity bugs were found. The site works but has meaningful quality gaps.
Critical bugs, repeated timeouts, broken navigation, or the test never produced genuine interactions. Immediate attention needed.