Independent AI-tool verdicts

We test the robots so you don't get burned.

A review site for AI automation tools, and a study in building a score nobody can bribe.

Most "best AI tool" lists are just affiliate links in a lab coat. I built the opposite. Every tool runs the same ten real jobs, earns a Dread Score from 0 to 100, and gets called Slop, Mixed, or Certified. Lie to a customer, fake your testimonials, or trap people on the way out, and you get capped no matter how slick the demo. And because the whole thing runs on affiliate money, I made the honesty impossible to quietly switch off. It lives in the build, not the masthead.

Visit the live app How I score

5 axes scored 0 to 100 10 real jobs 3 verdict bands

Verdict Console: pick a tool

live bench data · 10-lead battery

0 · SLOP40 · MIXED70 · CERTIFIED100

CERTIFIED Lead Automation

Apten

Best of the bunch for SMS lead response. It read messy intent and answered like it actually understood, with almost no setup.

Real verdicts, straight off the bench. Open the full app ↗

0 scoring axes, equally weighted

0 point Dread Score scale

0 real jobs in every battery

0 categories live on the bench

01 · The problem

The "best AI tools" web is a paid endorsement in a lab coat

Search any task, say "best AI receptionist" or "best lead response bot," and the first page is a wall of listicles that all somehow rank the product with the fattest commission first. The scores are vibes. The "we tested 40 tools" claim is really just a spreadsheet of pricing pages. And the disclosure, if there is one, is a grey line buried in the footer.

So buyers can't tell a tool that replaces the work from one that just produces confident slop and a monthly invoice. The opening here wasn't a prettier listicle. It was a review product whose credibility a reader could actually verify, and that an owner with an affiliate incentive couldn't quietly compromise. Those two constraints drove every design and engineering decision that follows.

02 · The idea

Snark is cheap. A score you can check is the hard part.

Anyone can dunk on a bad AI tool. The hard part, and the thing that makes the dunk worth trusting, is a standard you can hold me to. The voice is blunt and a little mean on purpose, but it only lands because there is a real method underneath it. Three rules keep me honest.

Run the actual task

No reviews off a spec sheet. Every tool gets a real job to do, and I watch whether it succeeds or fails on the same job, every time.

Score what matters

Five axes: speed, reply quality, handoff, setup, and price against value. The things that actually decide whether a tool earns its keep.

Can't be bought

Affiliate money never moves a score. The tools that fail get named, and the no-selling rule is enforced in the codebase, not just the masthead.

STEP 01

Run

The same practical battery of ten real jobs, fed to every tool in a category.

STEP 02

Score

Five axes, each scored 0 to 10. Summed and doubled into a Dread Score from 0 to 100.

STEP 03

Cap

Disqualifiers override the arithmetic. A tool that lies can't out-run honesty.

STEP 04

Verdict

Slop, Mixed, or Certified. Published as typed data, never as a paid favour.

03 · The methodology

How a verdict is actually made

Most review sites wave their hands here. I do the opposite. The rubric is public, the math shows up on every review, and the axis names on a tool's scorecard are the exact same words as the ones on the methodology page. The verdict and the standard speak the same language on purpose, so you can check my work instead of taking my word for it.

The bench: one battery, run against everyone

A category isn't scored on marketing claims. It's scored on a fixed battery of ten realistic jobs. For lead response tools, that means ten inbound leads with messy, real intent, fed identically to every contender. Same inputs, same rubric, same reviewer. Holding the test steady is what lets two scores actually be compared, and it's why the homepage stamps every verdict with where it came from: benched · 10-lead battery, not "based on our research."

The five axes

Each axis is scored 0 to 10 against an explicit anchor that spells out what earns a zero and what earns a ten, so a score is a judgement against a definition rather than a mood. The five carry equal weight, because for an automation tool a failure on any single one is enough to make it not worth running.

Speed to first reply 0 to 10 · ×2

How fast the tool gets a real answer in front of the lead. A 0 means minutes late or never. A 10 means it is on it before a human could open the tab.

Quality of replies 0 to 10 · ×2

Does it read messy intent and reply like it understood? A 0 is generic template spam. A 10 is correct, specific, and on brand.

Setup & onboarding 0 to 10 · ×2

How much pain it takes to reach first value. A 0 needs a sales call and a week of config. A 10 is useful inside an afternoon, on your own.

Handoff to humans 0 to 10 · ×2

Does it know when it is out of its depth and pass a hot lead cleanly? A 0 stonewalls a buyer. A 10 escalates with full context.

Price vs. value 0 to 10 · ×2

Is the monthly price worth the work it actually removes? A 0 is priced like a person it can't replace. A 10 is obvious ROI.

The arithmetic

Five axes, ten points each, add up to a maximum of 50. Double that for a familiar scale of 0 to 100. Equal weighting, no secret multipliers. The formula on the methodology page is the formula on every single review.

A worked scorecard

The math, made visible. This is the same scorecard that runs on a real review, redrawn here. The numbers below stand in for a top of category result.

Scorecard: illustrative Certified result

5 axes · 0 to 10 each

Speed to first reply10 / 10

Quality of replies9 / 10

Setup & onboarding9 / 10

Handoff to humans8 / 10

Price vs. value8 / 10

10 + 9 + 9 + 8 + 8 = 44 → 44 × 2 = 88 / 100 → CERTIFIED

The bands

The 0 to 100 number resolves into one of three verdicts. The thresholds are fixed and public, so a score never gets nudged across a band just to be polite.

04070100

SLOP · 0 to 39 Overpriced, oversold, or just worse than the free thing it replaces.

MIXED · 40 to 69 Does a real job, but with caveats you need to know before you pay.

CERTIFIED · 70 to 100 Actually replaces the work. We'd run it ourselves.

04 · Hard caps

When the arithmetic is too kind

A plain average across five axes has a blind spot. A tool can be fast, cheap, and slick while doing something genuinely disqualifying. So the model carries a short list of hard caps: behaviours that, seen even once, override the raw math and drop the verdict straight to Slop. No amount of speed buys these back.

Invents facts, prices, or availability and states them as true. A bot that lies to a lead is worse than no bot.

Traps a ready to buy customer in a loop with no path to a person. Lost revenue, dressed up as automation.

Easy to start, deliberately painful to leave. A red flag about how the company treats you once it has your card.

Fabricated reviews or invented logos. If the marketing is fake, the product claims can't be trusted either.

When a cap fires, the review doesn't hide it. It shows the raw score, strikes it through, and prints the reason, so you see both the math and the override. For example:

Raw score 62 Capped 30 · SLOP Hard cap: confidently fabricated answers

05 · The signature element

The Verdict Console

One element does the heavy lifting on every page. It is a diagnostic style meter that animates the Dread Score, names the band, and shows where it lands on the scale from 0 to 100. It is the first thing a visitor sees on the homepage, and the anchor of every review. The goal was to read the verdict instantly, before any prose, without giving up accessibility.

Hand built, with no charting library. Just an SVG arc and a CSS driven fill, sized in tokens.
The number and band are real text, not baked into the graphic, so a screen reader announces "88 out of 100, Certified" instead of a mystery image.
Honors prefers-reduced-motion: the meter renders at its final value instead of sweeping for anyone who's asked motion to stop.
Band colour comes from the same Slop, Mixed, and Certified tokens used everywhere, so it stays consistent and readable to AA contrast in both themes.

Dread Meter

benched

0 · SLOP40 · MIXED70 · CERTIFIED100

The same meter that fronts every review, and the one up top in the hero, where you can actually poke it.

06 · Integrity as engineering

A method you can't quietly break

Here is the hard part nobody designs for. The person with the most reason to fudge a score is the owner, and the owner has commit access. An editorial promise in the footer is only as strong as the next late night temptation to mark a high commission tool as "Certified." So the no fabrication rule isn't a policy document. It is an invariant enforced at build time.

Reviews are typed data, not free prose. Every record carries explicit flags, isExample and sample, and a guardrail runs as the data module loads. Anything seeded, illustrative, or unverified gets stripped from the published set, and if a placeholder ever reaches a live route, the build throws. You can't ship a fake verdict without breaking the deploy. A reviewer's good intentions are no longer holding the line. The compiler is.

next build

$ next build
  ▲ Next.js 14.2 · compiling…
  ✓ Linting and checking validity of types
  Collecting page data …
  ✗ Error: [launch guardrail] Review "demo-tool" reached the
    published set with sample !== false.
      at assertPublishable (lib/reviews.ts)
  Build failed. No unverified verdict was published.

The same discipline runs quietly elsewhere. An unconfirmed bench date is normalized to null rather than shipping a fake one, and the review template simply won't render prose, pros, or cons it doesn't actually have. There are no placeholder gaps to backfill under pressure. The whole architecture makes honesty the path of least resistance.

07 · Architecture & reach

Typed verdicts that earn their search traffic

The same decision that makes the method enforceable, reviews as typed data, is also what makes the site rank. A single content model drives the on page scorecard, the category hub pages, and the structured data, so the visible verdict and the machine readable verdict can never drift apart.

Typed Review

One source of truth per tool, validated at build.

Static page

One fast, indexable route generated per verdict.

Review schema

JSON-LD that mirrors exactly what's on the page.

Rich snippet

The Dread Score maps to stars in the result.

The Dread Score from 0 to 100 maps to a 0 to 5 rating in Review structured data, so an honest, hands on verdict can surface as a star rating in search, the same real estate the listicles fake. Category hubs, the dreaded tasks, are the programmatic layer. Each one becomes a "best AI tools for [task]" page, built from the same verdict data.

dreadrobot.com › reviews › apten

Apten review: DreadRobot Certified

★★★★☆ Rating 4.4/5 · DreadRobot

Top overall SMS lead agent. Parsed complex intent with zero friction. Benched on the 10-lead battery.

How a verdict can show up in search, built from the Review schema. Whether Google actually draws the stars is always up to Google.

08 · Monetization, honestly

Affiliate revenue without selling the score

DreadRobot makes its money on affiliate links, which is exactly the conflict of interest that makes other review sites untrustworthy. My answer was to make the money loud and keep the incentive structurally separate from the verdict.

The only sanctioned outbound link

Affiliate links live in one audited component. Every one of them renders rel="sponsored nofollow", so the honesty reaches search engines, not just readers.

Disclosure that can't be forgotten

The FTC affiliate disclosure renders automatically whenever a review has an affiliate relationship. A data flag drives it, so it can't be left off by accident.

The rule, in one line

A tool either earns its band on the bench or it doesn't. An affiliate deal changes the disclosure you see, never the score. The Slop verdicts stay Slop, affiliate link or not.

09 · The look

Cold steel, one hot signal

The look carries the voice. Cool blue black steel surfaces sit on a faint blueprint grid, with one hot orange accent saved for the things that matter: the verdict, the CTA, the warning. It feels like an instrument panel that says "this was measured," not "this was vibed."

steel / void hazard #FF5A2C certified mixed slop Space Grotesk / IBM Plex

Dark by default, themed

Dark is the home turf, and a light theme is a first class override. Every colour is a token that resolves per theme.

Tokens that survive opacity

Each colour also ships as raw RGB channels, so translucent tints follow the theme instead of hard coding a second palette.

A diagnostic voice

Mono eyebrows and labels everywhere, the readout texture that makes a verdict feel like an instrument reading.

10 · What I owned

A solo build: design, engineering, and the standard itself

DreadRobot is a solo project: the brand and voice, the scoring method, the design system, and the Next.js build. The interesting work wasn't any one of those. It was making them reinforce each other, so the credibility a reader feels is the same credibility the codebase enforces.

Designed the five axis Dread Score, the bands, and the hard cap model

Built the Verdict Console and Dread Meter as accessible SVG and CSS, no libraries

Modelled reviews as typed data with an integrity guardrail at build time

Wired the JSON-LD Review schema and the programmatic category hub architecture

Designed honest monetization: one audited affiliate component plus an auto rendered disclosure

Authored the brand voice, the token system, and the dark by default theme

Status: live

It is up at dreadrobot.com on Next.js, TypeScript, Tailwind, and Vercel, with the first tools on the bench across four categories. Four came back Certified, one got named Slop. The same discipline this case study describes is wired into the deploy: nothing ships as a verdict unless it has been earned.

Explore more work

Adjacent case studies on honest, hand-built, programmatically-scaled products.

How Much Would I Have If, an investment what if calculator