We test the robots so you don't get burned.
A review site for AI automation tools, and a study in building a score nobody can bribe.
Most "best AI tool" lists are just affiliate links in a lab coat. I built the opposite. Every tool runs the same ten real jobs, earns a Dread Score from 0 to 100, and gets called Slop, Mixed, or Certified. Lie to a customer, fake your testimonials, or trap people on the way out, and you get capped no matter how slick the demo. And because the whole thing runs on affiliate money, I made the honesty impossible to quietly switch off. It lives in the build, not the masthead.
Verdict Console: pick a tool
Apten
Best of the bunch for SMS lead response. It read messy intent and answered like it actually understood, with almost no setup.
Real verdicts, straight off the bench. Open the full app ↗
The "best AI tools" web is a paid endorsement in a lab coat
Search any task, say "best AI receptionist" or "best lead response bot," and the first page is a wall of listicles that all somehow rank the product with the fattest commission first. The scores are vibes. The "we tested 40 tools" claim is really just a spreadsheet of pricing pages. And the disclosure, if there is one, is a grey line buried in the footer.
So buyers can't tell a tool that replaces the work from one that just produces confident slop and a monthly invoice. The opening here wasn't a prettier listicle. It was a review product whose credibility a reader could actually verify, and that an owner with an affiliate incentive couldn't quietly compromise. Those two constraints drove every design and engineering decision that follows.
Snark is cheap. A score you can check is the hard part.
Anyone can dunk on a bad AI tool. The hard part, and the thing that makes the dunk worth trusting, is a standard you can hold me to. The voice is blunt and a little mean on purpose, but it only lands because there is a real method underneath it. Three rules keep me honest.
Run the actual task
No reviews off a spec sheet. Every tool gets a real job to do, and I watch whether it succeeds or fails on the same job, every time.
Score what matters
Five axes: speed, reply quality, handoff, setup, and price against value. The things that actually decide whether a tool earns its keep.
Can't be bought
Affiliate money never moves a score. The tools that fail get named, and the no-selling rule is enforced in the codebase, not just the masthead.
Run
The same practical battery of ten real jobs, fed to every tool in a category.
Score
Five axes, each scored 0 to 10. Summed and doubled into a Dread Score from 0 to 100.
Cap
Disqualifiers override the arithmetic. A tool that lies can't out-run honesty.
Verdict
Slop, Mixed, or Certified. Published as typed data, never as a paid favour.
How a verdict is actually made
Most review sites wave their hands here. I do the opposite. The rubric is public, the math shows up on every review, and the axis names on a tool's scorecard are the exact same words as the ones on the methodology page. The verdict and the standard speak the same language on purpose, so you can check my work instead of taking my word for it.
The bench: one battery, run against everyone
A category isn't scored on marketing claims. It's scored on a fixed battery of ten realistic jobs. For lead response tools, that means ten inbound leads with messy, real intent, fed identically to every contender. Same inputs, same rubric, same reviewer. Holding the test steady is what lets two scores actually be compared, and it's why the homepage stamps every verdict with where it came from: benched · 10-lead battery, not "based on our research."
The five axes
Each axis is scored 0 to 10 against an explicit anchor that spells out what earns a zero and what earns a ten, so a score is a judgement against a definition rather than a mood. The five carry equal weight, because for an automation tool a failure on any single one is enough to make it not worth running.
Speed to first reply 0 to 10 · ×2
How fast the tool gets a real answer in front of the lead. A 0 means minutes late or never. A 10 means it is on it before a human could open the tab.
Quality of replies 0 to 10 · ×2
Does it read messy intent and reply like it understood? A 0 is generic template spam. A 10 is correct, specific, and on brand.
Setup & onboarding 0 to 10 · ×2
How much pain it takes to reach first value. A 0 needs a sales call and a week of config. A 10 is useful inside an afternoon, on your own.
Handoff to humans 0 to 10 · ×2
Does it know when it is out of its depth and pass a hot lead cleanly? A 0 stonewalls a buyer. A 10 escalates with full context.
Price vs. value 0 to 10 · ×2
Is the monthly price worth the work it actually removes? A 0 is priced like a person it can't replace. A 10 is obvious ROI.
The arithmetic
Five axes, ten points each, add up to a maximum of 50. Double that for a familiar scale of 0 to 100. Equal weighting, no secret multipliers. The formula on the methodology page is the formula on every single review.
A worked scorecard
The math, made visible. This is the same scorecard that runs on a real review, redrawn here. The numbers below stand in for a top of category result.
Scorecard: illustrative Certified result
The bands
The 0 to 100 number resolves into one of three verdicts. The thresholds are fixed and public, so a score never gets nudged across a band just to be polite.
When the arithmetic is too kind
A plain average across five axes has a blind spot. A tool can be fast, cheap, and slick while doing something genuinely disqualifying. So the model carries a short list of hard caps: behaviours that, seen even once, override the raw math and drop the verdict straight to Slop. No amount of speed buys these back.
When a cap fires, the review doesn't hide it. It shows the raw score, strikes it through, and prints the reason, so you see both the math and the override. For example:
The Verdict Console
One element does the heavy lifting on every page. It is a diagnostic style meter that animates the Dread Score, names the band, and shows where it lands on the scale from 0 to 100. It is the first thing a visitor sees on the homepage, and the anchor of every review. The goal was to read the verdict instantly, before any prose, without giving up accessibility.
- Hand built, with no charting library. Just an SVG arc and a CSS driven fill, sized in tokens.
- The number and band are real text, not baked into the graphic, so a screen reader announces "88 out of 100, Certified" instead of a mystery image.
- Honors prefers-reduced-motion: the meter renders at its final value instead of sweeping for anyone who's asked motion to stop.
- Band colour comes from the same Slop, Mixed, and Certified tokens used everywhere, so it stays consistent and readable to AA contrast in both themes.
Dread Meter
The same meter that fronts every review, and the one up top in the hero, where you can actually poke it.
A method you can't quietly break
Here is the hard part nobody designs for. The person with the most reason to fudge a score is the owner, and the owner has commit access. An editorial promise in the footer is only as strong as the next late night temptation to mark a high commission tool as "Certified." So the no fabrication rule isn't a policy document. It is an invariant enforced at build time.
Reviews are typed data, not free prose. Every record carries explicit flags, isExample and sample, and a guardrail runs as the data module loads. Anything seeded, illustrative, or unverified gets stripped from the published set, and if a placeholder ever reaches a live route, the build throws. You can't ship a fake verdict without breaking the deploy. A reviewer's good intentions are no longer holding the line. The compiler is.
$ next build ▲ Next.js 14.2 · compiling… ✓ Linting and checking validity of types Collecting page data … ✗ Error: [launch guardrail] Review "demo-tool" reached the published set with sample !== false. at assertPublishable (lib/reviews.ts) Build failed. No unverified verdict was published.
The same discipline runs quietly elsewhere. An unconfirmed bench date is normalized to null rather than shipping a fake one, and the review template simply won't render prose, pros, or cons it doesn't actually have. There are no placeholder gaps to backfill under pressure. The whole architecture makes honesty the path of least resistance.
Typed verdicts that earn their search traffic
The same decision that makes the method enforceable, reviews as typed data, is also what makes the site rank. A single content model drives the on page scorecard, the category hub pages, and the structured data, so the visible verdict and the machine readable verdict can never drift apart.
Typed Review
One source of truth per tool, validated at build.
Static page
One fast, indexable route generated per verdict.
Review schema
JSON-LD that mirrors exactly what's on the page.
Rich snippet
The Dread Score maps to stars in the result.
The Dread Score from 0 to 100 maps to a 0 to 5 rating in Review structured data, so an honest, hands on verdict can surface as a star rating in search, the same real estate the listicles fake. Category hubs, the dreaded tasks, are the programmatic layer. Each one becomes a "best AI tools for [task]" page, built from the same verdict data.
dreadrobot.com › reviews › apten
Apten review: DreadRobot Certified
★★★★☆ Rating 4.4/5 · DreadRobot
Top overall SMS lead agent. Parsed complex intent with zero friction. Benched on the 10-lead battery.
How a verdict can show up in search, built from the Review schema. Whether Google actually draws the stars is always up to Google.
Affiliate revenue without selling the score
DreadRobot makes its money on affiliate links, which is exactly the conflict of interest that makes other review sites untrustworthy. My answer was to make the money loud and keep the incentive structurally separate from the verdict.
The only sanctioned outbound link
Affiliate links live in one audited component. Every one of them renders rel="sponsored nofollow", so the honesty reaches search engines, not just readers.
Disclosure that can't be forgotten
The FTC affiliate disclosure renders automatically whenever a review has an affiliate relationship. A data flag drives it, so it can't be left off by accident.
A tool either earns its band on the bench or it doesn't. An affiliate deal changes the disclosure you see, never the score. The Slop verdicts stay Slop, affiliate link or not.
Cold steel, one hot signal
The look carries the voice. Cool blue black steel surfaces sit on a faint blueprint grid, with one hot orange accent saved for the things that matter: the verdict, the CTA, the warning. It feels like an instrument panel that says "this was measured," not "this was vibed."
Dark by default, themed
Dark is the home turf, and a light theme is a first class override. Every colour is a token that resolves per theme.
Tokens that survive opacity
Each colour also ships as raw RGB channels, so translucent tints follow the theme instead of hard coding a second palette.
A diagnostic voice
Mono eyebrows and labels everywhere, the readout texture that makes a verdict feel like an instrument reading.
A solo build: design, engineering, and the standard itself
DreadRobot is a solo project: the brand and voice, the scoring method, the design system, and the Next.js build. The interesting work wasn't any one of those. It was making them reinforce each other, so the credibility a reader feels is the same credibility the codebase enforces.
It is up at dreadrobot.com on Next.js, TypeScript, Tailwind, and Vercel, with the first tools on the bench across four categories. Four came back Certified, one got named Slop. The same discipline this case study describes is wired into the deploy: nothing ships as a verdict unless it has been earned.
Explore more work
Adjacent case studies on honest, hand-built, programmatically-scaled products.