SLOs Are a Reliability Question, Not a Number
A service-level objective is not a number on a dashboard. It is an answer to the question 'how reliable is reliable enough — and to whom?' Without that framing, the math becomes reliability theater.
Most introductions to SLOs start with the math. Set a target, measure achievement, compute an error budget, alert on burn rate. This is the wrong starting point. An SLO is not a number on a dashboard. It is the answer to a question that has to be asked first: how reliable is reliable enough — and to whom? Without that question, the math becomes reliability theater — a precise quantity attached to a fuzzy promise nobody has agreed to.
The framing matters more than the number. A 99.9% SLO and a 99.99% SLO produce identical math when the user, the failure mode, and the consequence of breaching are all undefined. Both numbers look authoritative. Neither tells anyone whether the service is meeting its purpose. Read SLOs this way and the discipline becomes coherent again. The number is the conclusion of a conversation about users, work, and risk — not the starting point.
The framing question
A service-level objective is a target rate at which a defined service-level indicator stays within a defined acceptable range, measured over a defined window, with an explicit consequence when it does not. Every word in that definition matters. Service: which one — the product feature, the API endpoint, the background job? Indicator: which property is being measured — availability, latency, freshness, correctness? Acceptable range: what counts as good — the 95th percentile under 200ms, the success rate above 99.9%, the staleness under 60 seconds? Window: across what time horizon — a rolling 28 days, a quarter, a single business hour during peak load? Consequence: what happens when the budget is exhausted — a feature freeze, an incident review, a customer credit, an organizational reflex of any kind?
Most SLOs in practice answer none of these questions. They report a number and stop. The number then drifts into one of two failure modes. Either the team chases it like a vanity metric — making the dashboard green through measurement tricks rather than through reliability work — or the team ignores it because nobody ever defined what should happen when it breaks. Both are reliability theater. The metric exists; the objective does not.
The user dimension is the load-bearing piece. Reliable enough to whom is a question with a specific answer for every service. A read-heavy product surface that powers a customer’s daily workflow is reliable enough at one threshold. A write-heavy backend that posts billing events is reliable enough at a much higher threshold, because the failure mode is silent data loss, not a refreshed page. A speculative recommendation engine is reliable enough at a much lower threshold, because the failure mode is a missing recommendation, not a missing product. The SLO target follows from the user and the failure mode. Pick those first.
The math, made specific
After the framing comes the indicator. An SLI is a measurable property of the service whose value can be classified as good or bad against an objective. Picking a good SLI is harder than picking a number. A good SLI is observable in production with low overhead and reflects what the user actually experiences, not what the system reports about itself. It cannot be gamed without doing the underlying reliability work. It survives a year of architectural changes.
The classic SLI families are availability (the fraction of requests that complete successfully), latency (the request duration at a defined percentile), correctness (the fraction of responses that match an oracle), and freshness (the staleness of data the service returns). Each family has well-understood traps. Availability measured as HTTP 200 rates ignores semantic failures that return 200 with the wrong body. Latency measured as a mean obscures the tail that the user actually experiences. Correctness without a defined oracle is unmeasurable. Freshness depends on a wall clock the service may not control.
The right SLI is the one that, when it goes red, the team knows what to do, and when it stays green, the team trusts the service is meeting its obligation. Anything else is a metric pretending to be an objective. The hardest part of SLO work is not setting targets but designing indicators that earn their place.
Error budgets are policy, not math
Once an SLO is set, its error budget — the allowed fraction of time or events that may fall outside the acceptable range — becomes a policy artifact. The math is trivial: one minus the target, applied to the window. A 99.9% SLO over 28 days allows about 40 minutes of unavailability. That number is uninteresting on its own. What matters is the policy around it.
A team without error-budget policy treats the budget as headroom. Burn it on whatever ships. Recover quietly. The budget becomes a slack channel post-incident, then a footnote. A team with error-budget policy treats the budget as a contract. The contract spells out what happens when the budget is healthy (ship faster, take more risk, run more experiments) and what happens when the budget is exhausted (freeze feature work, fix reliability, sometimes shut down the service to a known-good subset). The policy is the consequence in the SLO definition. Without it, the budget is just arithmetic.
The cultural artifact this produces is the most important output of SLO work. Error-budget policies make explicit a trade-off most organizations leave implicit: reliability work competes with feature work for engineering attention. When the budget is healthy, the trade-off favors features. When the budget is exhausted, it favors reliability. The team does not need to relitigate the trade-off every sprint. The budget answers for them.
Picking SLIs that survive a year
A common failure mode is to pick SLIs that work for the system as it exists today and break the moment the system is refactored. Availability measured at a load balancer breaks when the service moves behind a CDN. Latency measured client-side breaks when the client changes from a web SPA to a mobile app. Freshness measured against a hardcoded TTL breaks when the data pipeline is rebuilt.
SLIs survive a year when they are anchored to user-visible properties rather than implementation details. The fraction of dashboard loads that render with all panels populated within three seconds survives a backend rewrite, a frontend rewrite, and a deployment topology change, because the property being measured is the user’s experience. The 99th percentile latency of GET /api/v2/dashboards measured at the application server does not survive any of those changes.
The discipline is to pick SLIs from the customer’s vantage point and engineer their measurement to be robust. Synthetic probes that exercise the user journey. Sampled real-user metrics that capture the same property at scale. Server-side measurements that correlate with both. The combination outlasts any single layer of the stack.
When the SLO is wrong
SLOs are written before the data exists to validate them. They get the target wrong. They pick the wrong window. They pick an indicator that turns out to be game-able. The first version of an SLO is almost always at least partly incorrect.
The mature response is to plan for that. The SLO definition includes a review cadence — quarterly is common — at which the team examines the indicator against the year’s reliability history and the user feedback received. Numbers are tuned. Indicators are sometimes replaced. The error-budget policy is sometimes rewritten. None of this is failure of discipline. It is the discipline working as designed. SLOs are hypotheses about what reliable enough means; they get tested against reality and revised when reality disagrees.
The failure mode to avoid is treating the SLO as immutable doctrine. Teams that defend a wrong number end up gaming the measurement to keep it green. Teams that update the number when it stops reflecting the obligation keep the SLO honest. The reliability comes from the revision practice as much as from the number itself.
Multi-window practice
Modern SLO practice rarely uses a single window. The standard pattern is a short-window burn-rate alert (one hour) paired with a long-window compliance window (28 or 30 days). The short window catches fast-burning incidents — a deployment that spikes errors, a downstream outage that drains the budget in minutes — before they exhaust the long-window budget. The long window measures whether the service is honoring the objective over a period users actually feel.
The short window is for paging. The long window is for policy. Conflating them produces either pager fatigue (alerting on the long window catches incidents too late) or budget blindness (alerting only on the short window leaves slow degradation invisible). Two windows. Two purposes.
The implementation detail that matters: the burn-rate alert is parameterized by how much of the budget would be consumed if the current rate continued for the full long window. A 14.4× burn rate over an hour consumes 2% of the 28-day budget — the standard threshold for a page. Higher burn rates page faster; lower burn rates wait for confirmation. The math is well-understood and worth borrowing from existing SLO references rather than reinventing.
The reward structure
None of this works if the organization punishes the reliability function when the budget is healthy and punishes the product function when it is not. Error-budget policy assumes that both functions own the same number. The product side gets to ship faster when reliability is in good shape. The reliability side gets to demand fixes when it is not. Both directions have a short-term cost and a long-term payoff.
Organizations that wire incentives only one way break the discipline. A function rewarded only for shipping will burn the budget and resist reliability work. A function rewarded only for reliability will block all shipping and never run experiments. The error-budget policy works because it reframes the trade-off as a shared resource rather than a competition between disciplines. Teams that get this right find that engineering velocity and reliability move together; teams that get it wrong find them moving in opposition.
The cultural artifact is the same shape as the andon cord. Pulling the cord costs throughput in the short term and earns better quality in the long term. Freezing features to repay error budget costs feature velocity in the short term and earns reliability in the long term. Both depend on a reward structure that survives executive impatience. Neither happens automatically.
What it adds up to
An SLO is a target rate at which a defined indicator stays within a defined range, measured over a defined window, with an explicit policy when it does not. Every word earns its place. The framing comes first — who is the user, what is the failure mode, what is the consequence. The indicator comes second — what property reflects the obligation and survives the system’s evolution. The number comes last, as the conclusion of the conversation, not the start.
Error budgets are policy artifacts that make the velocity-versus-reliability trade-off explicit and let the organization operate without relitigating it every sprint. SLIs survive a year when they are anchored to user-visible properties. SLOs are hypotheses, not doctrine — they get revised when reality disagrees. Multi-window practice splits paging from policy. The reward structure is what makes the discipline hold.
Teams that pick a number first end up with reliability theater. Teams that pick a user first end up with reliability.