Live in production at criterionscore.com. New, and early on purpose.

Criterion: the AI interview coach that shows its work

Practice the interview. See the standard. Challenge the score.

Most interview tools hand you a number. A score you cannot trace and cannot contest is not an assessment. It is a verdict. Criterion is built the other way around. Every score links to the exact words you said that earned it. If you disagree, you contest it and watch the model re-evaluate in the open: raise, hold, lower, or escalate to a human. The whole product is one argument made operable. An AI judgment a person can read, trust, and talk back to.

Run the live app

Interactive playground

3 artifacts · switch below

These are static prototype artifacts, framed in place. For the real product, run the live app.

Set up Interview Score Debrief

The four rooms a candidate moves through

The thesis

Make the AI's judgment legible and accountable

An unexplained verdict teaches nothing. If a tool tells you that you scored a 6, you learn a digit. You do not learn what you said, why it counted, or how to do better next time. So the design problem was not "score an interview." It was "make a machine's judgment something a person can inspect and argue with."

Five components carry that argument. Each one below is the real, shipped component, rendered inline from the production markup, not a mockup. Each is paired with the decision behind it: what I tried, what broke, and why the shipped version won. The sample content is the app's own demo interview for a Design Engineer screen, so what you see is what a hiring manager sees if they land on the scorecard cold.

Component 01 · Score Card, Confidence Meter

A score you can read at a glance, and one worth arguing with

The contrast with a bare "6.5" is the entire pitch. A score arrives with its state, its confidence, the weight it carries, and the reason it landed where it did. The most contestable score wears a coral frame, so the eye goes to the number most worth challenging.

The real shipped component, rendered inline from the production markup.

Component 02 · Evidence Drawer

Shows its work, made literal: the exact words behind the score

Open any score and it tells you the words that earned it. Each piece of evidence is a quote, the reason it moved the number, and a timestamp that replays the moment it was said. Direction is a glyph and a written note, never color on its own. Hover a quote and the same words light up in the transcript below, so the rationale and the source are unmistakably the same thing.

The real shipped component, rendered inline from the production markup.

The decision: replay the words, not a face

TriedA video clip on replay: The prototype played back a recorded clip of the candidate at the moment a score anchored. It felt impressive in a demo and it was wrong. The self view is local only and never leaves the device, and the product's whole claim is that you are judged on what you said.
BrokeThe proof contradicted the promise: Showing a face on replay quietly reintroduced exactly the thing the product exists to remove. If the score is about words, the evidence has to be words.
ShippedThe honest moment: Replay shows the exact quote a score anchored to, the question it answered, when it was said, and what it matched. The honesty is stated in the frame. If a real self view clip is ever captured it can ride in the same stage. Until then there is nothing to fake.

Component 03 · Challenge Panel, Challenge Thread

The signature move: contest a score, watch it re-evaluate in the open

This is the product's whole argument made operable. If you disagree, you say why, and the model re-evaluates against the current value. It can raise, hold, lower, or escalate to a human. The full contest record stays on the page. Nothing is rewritten in the dark.

The real shipped component, rendered inline from the production markup.

The decision: make disagreement a first class path

TriedA feedback form: The obvious version was a "report this score" link that filed a complaint into the void. It let a person vent without changing anything, which is worse than no path at all. It promises a voice and delivers a suggestion box.
BrokeTrust needs a real consequence: A contestable score that never actually moves is theater. For the argument to mean anything, the model has to re-evaluate live and be willing to change its mind on the record.
ShippedA bounded, open re-evaluation: Every contest re-scores against the current value, shows its reasoning, and appends to a visible chain. It can raise, hold, lower, or hand off to a human. The cap and the escalation are the two honest limits that keep it from becoming a haggle.

Component 04 · Interview Stage, Live Read

A live interview that shows its read while you are still in it

The interview does not hide the machine until the end. You answer in text, and a running read updates on submit, scored against the same criteria the final card uses. It is labeled provisional on purpose, and it says so plainly: the read is reconfirmed at the end against the full transcript. The feed keeps the whole exchange in view, so nothing about how you are being read is happening off screen.

The real shipped component, rendered inline from the production markup.

Component 05 · Debrief Timeline

The interview replayed, wired to the scores

The scorecard answers "what did each criterion earn." The debrief answers the more useful question: "what did each thing I said drive." Every answer carries chips for the criteria it fed, marked as helped or hurt, and each chip jumps to that score above. It is the same evidence graph the scorecard renders, walked from the transcript side, so an answer can never claim to have driven a score the cards do not also show.

The real shipped component, rendered inline from the production markup.

The decision: one evidence graph, two directions

TriedA separate debrief summary: Early on the debrief was its own narrative, written after the fact. It read well and it could drift. A summary that is authored separately can claim an answer mattered in ways the scores never reflected.
BrokeTwo stories that could disagree: The moment the debrief and the scorecard are generated independently, they can contradict each other, and the whole legibility claim falls apart.
ShippedOne graph, walked both ways: The debrief is the exact evidence graph the scorecard uses, traversed from the transcript side. A turn's chips are derived, not written, so they can only point at scores the cards already show. The two views are the same truth, read from opposite ends.

Honest about a new product

How I will know it is working

Criterion is live and early. I would rather state what I am watching for than borrow a number that does not exist yet. These are the signals that tell me the thesis holds in real use, not in a demo. This is the measurement plan, written before the data, because what you choose to measure is itself a design decision.

Legibility

Do people open the evidence?

If the words behind a score go unread, the glass box is decorative. I want to see evidence drawers opened on a real share of scores, not just the low ones.

Accountability

Do contested scores change on their merits?

A challenge path only earns trust if it sometimes moves the number for a good reason. I am watching the split of raised, held, lowered, and escalated, and reading the arguments behind each.

Completion

Do people reach the debrief?

The value lands in the debrief, where answers connect to scores. If people stop at the number, the most useful part of the product is not being seen.

Trust

Do people act on the result?

The real test is whether someone trusts the verdict enough to change how they prepare, or to save the scorecard and hand it on. That is the outcome the whole design is built to earn.

v1 to v12

Sharpening one thesis, not stacking features

Every version answered the same question more honestly than the last: can a person read this judgment and argue with it. The work was subtraction as often as addition. Each step below maps to a component you just saw.

early

The bare number. A score and a one line reason. Clean, and useless for learning. This is the version the rest of the product argues against.

mid

Evidence on tap. Scores started linking to the exact words. The Evidence Drawer and the transcript highlight arrived together. The replay was a video clip here, which I later cut.

mid

Contest, for real. The Challenge Panel turned disagreement into a live re-evaluation with an open chain. The cap and the human escalation came in here, to keep it honest.

now

Honest presence and one graph. The interviewer became an orb, the video replay became a transcript moment, and the debrief was rebuilt as the same evidence graph walked from the transcript side. Less faked, more legible.

Run the live app

Explore more work

More explorations from the AI Product Design Lab, each a different facet of making AI products people can direct, verify, supervise, and trust.

steer exploration cover

Steer, intent before generation

Turn an under specified prompt into a negotiated brief: the model surfaces what it inferred and flags ambiguity before it commits.

View exploration

ground exploration cover

Ground, verify what AI claims

Every claim traceable to a source with confidence and freshness. Unsupported claims flagged. Source conflicts shown, not smoothed over.

View exploration

recall exploration cover

Recall, legible AI memory

A memory layer you can see, attribute, edit, scope, and revoke. Personalization as a negotiated, inspectable thing, not a black box.

View exploration

Explore the AI Product Design Lab