The anti-black-box

Criterion — the AI interview coach that shows its work

Practice the interview. See the standard. Challenge the score.

Most interview tools hand you a number. A score you can't trace and can't contest isn't an assessment — it's a verdict. Criterion is built the other way around: every score links to the exact words you said that earned it, and if you disagree, you can challenge it and watch it re-evaluate in the open — revise up, hold, lower, or escalate to human review. The whole product is one argument made in interface form: an AI judgment people can read, trust, and talk back to.

Fully interactive — run the whole flow: home → setup → interview → score → debrief, then try the signature challenge on the scorecard. Opens in desktop by default; use the toggle for the mobile layout. Interview video is shown as a placeholder here; the live build embeds it.

The problem: an unexplained verdict teaches nothing

AI assessment usually arrives as a black box — an opaque number with no traceability and no recourse.

For something as high-stakes and personal as interview prep, a bare score does the opposite of help. "6.2 — Communication" tells you nothing you can act on. You can't see which sentence cost you the point, you can't tell whether the system even heard the strongest thing you said, and you certainly can't argue with it. Worse, an unexplained verdict quietly trains the wrong lesson: people start performing for an invisible grader instead of getting better at the actual thing. The first design decision was to treat opacity itself as the bug to fix.

The black box · what most tools ship

Communication 6.2

No evidence. No confidence level. No way to ask why, and no way to disagree. A verdict.

The glass box · what Criterion ships

Communication 8.0
scored Confidence: High weight 20%

Evidence matched to the exact words you said — and a challenge button if you think it's wrong.

The thesis: make the AI's judgment legible and accountable

Two non-negotiable properties. Every score traces to evidence. Every score is contestable.

Everything in Criterion descends from one product thesis: the anti-black-box. If a score can't be traced and can't be argued with, it isn't an assessment — it's a verdict, and verdicts don't help people improve. So design becomes the mechanism of accountability, not decoration on top of a model. The interface is where the judgment is forced into the open: shown with the evidence behind it, the confidence level attached to it, and a live path to challenge it. Trust here isn't a tone of voice. It's a structural property you can poke at.

Answer real questions

A live interviewer adapts to what you say — typed or spoken. No trick questions, no hidden rules.

See the standard

Every score links to the exact words you said that earned it, and to the evaluation criteria behind it. Nothing is scored against something you can't see.

Challenge the score

Disagree? Make your case and watch the score re-evaluate in the open — up, down, hold, or sent for human review.

The core decision: a rubric you can argue with

"Challenge score" isn't a feedback gimmick. It's the interaction that turns a graded subject into a participant in the evaluation.

When you challenge a score, Criterion re-reads your transcript against the same rubric, in the open: "Re-evaluating against the transcript & rubric…", then it tells you what it found. Crucially, the outcomes are honest — a challenge can revise the score up, hold it where it is, lower it, or return human review recommended. Roughly one in four challenges hold or lower the score. That asymmetry is the whole point: if challenging only ever flattered you, it would be theatre. Because it can also disagree with you, a revision actually means something.

Handling ambiguity · before

challenged 5/10

Confidence: Medium · weight 15%

"You think it's too low. You named the assumption and a test to resolve it — make your case."

Re-evaluating against the rubric → criterion updated

5 7 score revised

Score revised: 5 → 7. In Q5 you named the assumption and a test to resolve it. That evidence is now evidence matched to this criterion, and confidence rose to High. The scorecard total recomputes live from the weighted rubric — headline 6.7 → 7.0.

revise up hold human review recommended

This is the signature interaction — run it yourself in the prototype above. The challenge state resets to a clean slate each visit, with per-criterion "Restore original" as in-session undo, so the flow is always replayable.

Honesty about uncertainty is the trust-builder

Confidence levels and the "re-evaluating against rubric" moment are shown on purpose. The system performs its thinking rather than hiding it.

Most products treat a loading state as dead time to disguise. Criterion uses it as a chance to be honest. When a score is still thin, it says so — "evidence gap detected: this criterion is scoring on one short answer" — and marks the score provisional rather than final. When you challenge, the re-evaluation runs visibly: reading your case, re-reading the transcript for missed evidence, then committing to a result. Confidence is a first-class label (Low, Medium, High), not a hidden parameter. Admitting what it isn't sure of is exactly what makes the confident scores believable.

State = shape first, colour second

scored — evidence matched provisional — thin evidence awaiting — not yet asked challenged — under review human review recommended

Evidence gap detected

Handling ambiguity provisional

"Scoring on one short answer — the next question is steering to close the gap, not penalise you." A follow-up generated from the weak criterion, shown rather than hidden.

The hardest decision: live video that never undercuts the thesis

Video exists to practise presence and nerves. You are scored only on what you say — never how you look, sound, your accent, or your background.

A real interview is partly a performance under pressure, so the prototype puts you on camera with a live interviewer to practise that. But the moment a product scores a face, the glass-box thesis is dead — and so is the trust. The load-bearing decision was drawing a hard line: the camera is for rehearsal, the rubric only ever reads the transcript. The concrete expression of that boundary is the delivery-notes panel — opt-in coaching on pace and filler words, framed in indigo, explicitly never scored, never written into the transcript, and never shown to a human reviewer. The boundary is stated in the UI, not buried in a policy page.

Scored

The substance of your answers.

Reasoning, structure, evidence — only the words in your transcript. Every point traces back to something you can read.

Never scored

Your face, expressions, accent, attractiveness, background, or body language.

The camera stays on your device. None of it reaches the rubric — by design, not by setting.

Delivery notes Optional · never scored

Practise presence and nerves. Coaching cues from your camera — separate from the rubric, never in the transcript, never shown to a reviewer.

"Shows its work," made literal: score → evidence → the actual moment

Every evidence timestamp on the scorecard is a real link that opens a bounded replay of the exact moment you said it.

Tracing a score to a quote is good. Tracing it to the actual seconds is better. Each piece of matched evidence carries a play control — a real ▶ link — that opens a short, bounded replay of precisely that part of your answer. Score, to the evidence behind it, to the words in your own voice, in two clicks. There's nowhere for an unexplained number to hide: if Criterion claims you named a tradeoff, you can watch yourself name it. That round trip is the most concrete answer to "what does shows its work mean?"

Product sense · evidence matched

"Names the user segment before proposing a fix, then ties the metric back to it." — matched to the moment you said it.

04:12 — replay the exact words

This is a static specimen — the ▶ links are live on the scorecard in the prototype above.

v1 → v7: sharpening one thesis, not stacking features

Each version defends the same idea — make the AI's judgment legible and contestable — against the easier, less honest version of every feature.

The arc isn't a changelog. Early versions found the product identity and put the challenge interaction at the centre. The middle versions made the interview a real, practised experience and added the trust layer that keeps video honest. The late work is where the thesis got defended: honest defaults, a replayable challenge, a warm dark theme that keeps "Criterion at night," and accessibility done as real contrast math rather than a checkbox.

  • v1

    Component dump.

    A grayscale admin shell. It looked like a tool, not a product — and there was no thesis yet.

  • v2

    Consumer rework.

    A real product identity (Fraunces / Plus Jakarta, warm palette). No video yet; the challenge interaction takes centre stage.

  • v3

    Live video.

    The interview becomes a practised experience, not a quiz.

  • v4

    Contestable replay, real math, trust layer.

    Timestamp → bounded moment replay; the Maya Osei candidate persona; a sticky picture-in-picture interviewer; a weighted total that recomputes live when a challenge revises a score; and the delivery-notes panel that makes the presence-vs-score boundary explicit.

  • v5

    Polish and honest defaults.

    A loop-seam crossfade with a reduced-motion override; voice mode drops the interviewer to an audio-only orb instead of faking a feed; and the challenge state resets to a clean slate each visit so anyone can replay it — a documented decision, not an oversight.

  • v6

    Resolved open decisions.

    Held a warm near-black dark theme over true black to keep the "Criterion at night" warmth; surfaced the Maya Osei name lightly while keeping the live self-label as "You"; added a discreet stock-footage disclosure to the review-only strip.

  • v7

    Accessibility, done properly.

    The real dark-mode failures were two accent-as-text colours under the 4.5:1 AA line — not the background. Fixed with dedicated text-only variables that brighten in dark mode (to 7.9:1 and 8.6:1) without touching button fills, plus a light-mode jade nudge to 5.3:1. Collapsed panels and decorative overlays got aria-hidden; the picture-in-picture inert toggle and replay dialog labelling were confirmed correct.

Explore more work

More explorations from the AI Product Design Lab — each a different facet of making AI products people can direct, verify, supervise, and trust.

steer exploration cover
Steer — intent before generation

Turn an under-specified prompt into a negotiated brief: the model surfaces what it inferred and flags ambiguity before it commits.

View exploration
ground exploration cover
Ground — verify what AI claims

Every claim traceable to a source with confidence and freshness; unsupported claims flagged; source conflicts shown, not smoothed over.

View exploration
recall exploration cover
Recall — legible AI memory

A memory layer you can see, attribute, edit, scope, and revoke — personalization as a negotiated, inspectable thing, not a black box.

View exploration