Observability

Vantage: turn a sea of signals into time to insight

An incident-investigation console built for the question that actually matters at 3 a.m.: where is the cause?

When something breaks across a distributed system, the failure isn't a shortage of data. It's a flood of it. One root cause sets off hundreds of downstream alerts, and the operator becomes a detective digging through logs, dashboards, and traces under pressure. Vantage is designed around a different unit of value than the dashboard: time to insight. It collapses the cascade to the originating signal, draws the one hot path through the system, and points at the slowest span, then offers an AI reading of the evidence that suggests a fix but never acts, handing execution to a human-gated control surface.

Open the full console See the visualization craft

Fully interactive. Click a service or a span, run the diagnosis, then approve the suggested fix via Oversee. Everything is keyboard-navigable; the data is synthetic. Opens in desktop by default; use the toggle for the mobile layout. If the embed is blank, open it in a new tab.

01: The problem

slow MTTR

Alert fatigue, and the cost of a slow MTTR

When a distributed system fails, the bottleneck is rarely missing data. It's that one root cause detonates into hundreds of downstream alerts, and a human has to find the signal inside the noise, fast, and usually at the worst possible hour.

Every minute of mean-time-to-resolution is paid in revenue, reliability budget, and the operator's sleep. Yet the tools often make it worse: a pager that fires on every symptom trains people to ignore it, and a wall of dashboards asks the responder to be the correlation engine. The design problem isn't visualizing more. It's cutting the cascade down to the one thing worth looking at first.

Live specimen · Basalt

synthetic

147 alerts versus one signal, rendered live in the Basalt design system, the same dark UI as the console above.

Grounded in practice

Alert fatigue is a named, well-documented failure mode in incident response. The industry on-call literature (e.g. PagerDuty's operations guidance) treats noisy, low-signal alerting as a primary driver of slow response and burnout. Netflix has described its own engineers' troubleshooting as detective work across scattered logs and dashboards, the exact experience their tracing tool Edgar (written up by Elizabeth Carretto) was built to compress.

02: The thesis

time to insight

Design for time to insight, not dashboards

A dashboard answers "what are all the numbers?" An investigation needs the opposite: "which one number explains this, and what does it point to?"

The unit Vantage optimizes is the seconds between an incident starts and a human understands why. That reframes every screen. Instead of a grid of charts to scan, the console leads with a small, stable vocabulary of signals, ranks what's anomalous, and keeps the path from symptom to cause short. Coverage is table stakes; what's scarce, and what the design protects, is the responder's attention.

Specimen · timeline

recovered

The whole incident in one line: spike at 14:02, root cause in 41 seconds, mitigated under two minutes.

Specimen · golden signals

small multiples

The four golden signals as small multiples. Status carried by shape and word, never colour alone.

Standing on established frameworks

The signal vocabulary isn't invented from scratch. It stands on three well-known operability frameworks: Google's SRE practice and its four golden signals (latency, traffic, errors, saturation); Brendan Gregg's USE method (utilization, saturation, errors) for resources; and the RED method (rate, errors, duration) for services, popularized by Tom Wilkie. Vantage's job is to make these legible under pressure, not to reinvent them.

03: Signature craft

the heart of the piece

High-density visualization that stays legible at scale

This is the heart of the piece: dense operator visuals that a tired human can still read at a glance, hand-built in accessible SVG, with no charting library.

Two views carry the investigation. A service topology encodes structure in shape and dependency in position, then draws a single bright traced path so the eye lands on the cause rather than counting nodes. A flame graph maps width to time, so the widest bar literally is the answer. Both are fully keyboard-navigable: nodes and spans are focusable, arrow-key reachable, and announce their selection through an ARIA live region. High information density and accessibility are not in tension when the visual is built deliberately.

hand-built SVG keyboard + ARIA live zero charting deps

The worked example · synthetic

read it back

41s root cause in 41 seconds, from a checkout p99 spike at 14:02

79% of the request is a single connection-pool wait

147 1 147 alerts versus one signal

Specimen · topology

traced path

Topology: structure in shape, the story in one bright traced path.

Specimen · flame graph

width is time

Flame graph: width is time; the widest bar is the cause.

Specimen · contribution

same answer, bar form

The same answer as a bar chart: 79% of the request is a single connection-pool wait.

Proven visual idioms

Both idioms are battle-tested in this domain. The flame graph was created by Brendan Gregg to make profiling legible at a glance. Distributed tracing and service maps are now standardized through OpenTelemetry, and Netflix's Edgar is built on distributed tracing to reconstruct a request's journey across services. Vantage's contribution isn't the idiom. It's making these keyboard-and-screen-reader accessible without a charting dependency.

04: Guided analysis

paved road

Surface the one relevant signal, don't make them hunt

Exploration tools are powerful and pitiless: infinite dimensions, and no opinion about which one matters right now.

Vantage layers guidance over raw power. Smart defaults open on the anomalous service, not the homepage. Progressive disclosure keeps the first screen to the few facts that move the investigation, with depth one interaction away. It's the "paved road" idea applied to debugging: the common path is the easy path, and the responder is steered toward the signal that explains the others, instead of being handed a query builder and wished luck.

Echoes of guided exploration at scale

Guided, high-cardinality exploration is the direction the field has moved: Honeycomb's BubbleUp-style analysis automatically surfaces which dimensions differ inside an anomaly. At Netflix scale, Atlas carries dimensional time-series telemetry for macro error-trend analysis, and Telltale provides application health and intelligent alerting across a large fleet of applications: guidance, not just a bigger query box.

05: Trust signals

error budget

SLOs and error budgets, made decidable

"Is it up?" is the wrong question. The useful one is "how much room is left before we should stop shipping features and start protecting reliability?"

A service-level objective plus an error budget turns reliability from a vibe into a number a team can act on. Vantage makes that legible: a burn-down that shows the budget being consumed against the safe pace, and a gauge for what remains. When an incident takes a visible bite out of the budget, the trade-off becomes a shared, explicit decision, not an argument about feelings.

Specimen · error budget

38% left

Error-budget burn-down and remaining-budget gauge: reliability as a decision, not a guess.

Straight from SRE practice

SLOs and error budgets are core to Google's Site Reliability Engineering discipline: define the target, measure against it, and let the remaining budget govern how aggressively a team ships. Vantage's role is purely presentational, making that budget visible at the moment of decision.

06: AI in the read lane

suggests, never acts

An AI that diagnoses, and a deliberate handoff before it acts

The AI panel reads the evidence and proposes a cause. It never touches production. Diagnosis and execution are split into two lanes on purpose.

Ask a plain-language question and Vantage surfaces anomalies, ranks hypotheses with honest confidence, and shows the metric, trace, and deploy evidence behind each. Then it suggests a reversible remediation, and stops. The console operates in the read lane; anything that writes to production is handed to a human-gated control surface. That boundary is the whole safety argument: an AI's best guess earns a recommendation, never an unattended action.

Read lane → write lane

human-gated

Vantage · diagnose (read) Oversee · approve & execute (write)

The suggested fix doesn't run here. It's previewed and approved through the four-state control surface designed in a separate piece: scope, preview/dry-run, interrupt, and reversible rollback. Vantage points; Oversee gates.

See the execution gate: Oversee

Anchored in real anomaly work

Automated anomaly detection and metric correlation are active areas at operational scale. Netflix's Atlas ecosystem (including streaming evaluation of high-cardinality data and correlation between service-level indicators and custom metrics) is exactly the kind of telemetry that makes ranked, evidence-backed hypotheses feasible. Vantage is the human-facing layer over that work, deliberately confined to suggesting.

07: Coherence

UI · CLI · API

One insight across UI, CLI, and API

Engineers don't live in one window. They live in terminals, IDEs, and scripts, and the answer has to read the same wherever it shows up.

The same query, "what's the slowest span in checkout?", returns the same truth as a console panel, a CLI line, and a JSON payload. That coherence is developer empathy made concrete: it respects Git- and CLI-first mental models, lets the insight flow into automation, and means a finding never gets lost in translation between surfaces.

Specimen · three surfaces

one truth

One insight, three surfaces: UI row, CLI line, JSON payload.

How platform teams actually unify the experience

Large engineering orgs converge their tools into a coherent operator experience. Netflix, for example, has invested in a unified, federated platform console built on its Hawkins design system. Meeting engineers in the terminal and the API, not only the dashboard, is the difference between a tool they tolerate and one they reach for.

The incident, end to end

four screens, in order

One incident, from the 3 a.m. page to the postmortem

The specimens above are stills. This is the same incident as a sequence you can click through. The page lands on the lock screen, the alert storm hands you noise with no origin, the console turns that into one cause and one suggested fix, and the postmortem drafts itself the moment a human approves. Same synthetic incident, four moments, in the order a responder actually lives them. Each frame below is the real prototype, fully interactive.

Screen 1 of 4 · lock

Sev-2 page

It starts on the glass at 3:01 a.m. 412 alerts, no clear origin, and one thing to tap.

Screen 2 of 4 · storm

412 alerts

This is the storm you wake up to. Every service is screaming and none of them is obviously the cause.

Screen 3 of 4 · console

time to insight

Live console This is the beat where Vantage does its work. The console runs live at the top of this page. Jump to the interactive console

Vantage collapses the flood to one origin, draws the hot path, and stops at a fix a human has to approve.

Screen 4 of 4 · postmortem

resolved

Closed in under two minutes, with the postmortem already drafted from what actually happened.

How it got here

v1 → v7

Seven versions, one direction: shrink the time to insight

The design didn't arrive whole. Each version removed something the responder had to do in their head and moved it into the interface, until the console's whole job was to get from incident to cause as fast as a human can read.

v1
Raw dashboards

A wall of every metric. Complete, and useless under pressure. The responder is the correlation engine.
v2
Golden signals first

Cut the grid to latency, traffic, errors, and saturation per service. Less to scan; faster orientation.
v3
Service topology

Added the dependency graph and made the hot call-path visible, so structure stopped living only in people's memory.
v4
Flame graph & tracing

Width-as-time made the slowest span surface itself. The connection-pool wait stopped hiding.
v5
Guided defaults

Open on the anomaly, progressive disclosure, paved-road analysis, power without making people hunt for it.
v6
SLOs & error budgets

Reliability became a decidable number, so "keep shipping or start protecting?" stopped being an argument.
v7
Time to insight + AI assist + Oversee handoff

Ranked, evidence-backed hypotheses in the read lane; a suggested reversible fix; execution handed to the human-gated control surface. The console's job is now exactly its name.

Considered and passed on

the road not taken

A chart-dense dashboard, strong on its own, wrong in the console

One version packed the console with charts: golden signal cards, an annotated p99 timeline, and contribution bars. On its own it was genuinely strong. Folded into a live on-call console, it added density and reading load to a tool whose entire job is speed. So the call was to keep the shipped console lean and let the richer chart language live where there is time to actually read it, in the case study specimens and the drafted postmortem. Showing the version that lost is the honest part.

Explored concept · dashboard

not shipped

The parked build, preserved as its own artifact. Good charts, wrong altitude for a screen measured in seconds.

Honest framing: Vantage is a portfolio design concept with a working, synthetic-data prototype, not a production system, and not a claim to have built Netflix's Atlas, Edgar, Telltale, or Hawkins. The named tools are cited as the real industry practice this design reasons from.

Explore more work

The closest neighbors to this one: where read-lane diagnosis hands off to write-lane safety, where the components come from, and where dense operator UI shows up again.

Oversee, the write-lane safety gate

Where Vantage stops. Diagnosis is read-only; execution gets a preview, dry-run, interrupt, and rollback before anything touches production.

View case study

Kernel, a design system for dense tooling

The discipline behind Basalt: tokens, components, and contrast rules built so high-density operator UI stays legible and accessible.

View case study

Signal Inbox, ranking what matters

The same alert-fatigue problem in a different domain: separating the one signal that needs a human from the noise around it.

View case study

View all case studies