Site Reliability Engineering Maturity Model

After a lot of study of the Google SRE books, I came up with this SRE framework. 

The idea is that you could judge a development team's reliability maturity based on this framework. As teams begin to adopt various characteristics of the framework, we would expect that their reliability will go up. 

I hope this is useful to you as you start your journey in applying these principles! 

  1. The first step will be to do a self-assessment of the current status of your Product Team for each one of the identified capabilities.
  2. Define the desired end-point at the end of the next improvement cycle, a cycle can be a month, a quarter, a semester ... every team can define their improvement cycles although a good start would be to set quarterly targets to be able to define meaningful actions.
  3. Identify the actions you will need to achieve the desired end-point.

Characteristic Crawl Walk Run
Risk Observe it/Measure it (SLIs) Set a target (SLO), and regularly review this target for fit Manage work around performance vs the target (Established & agreed upon Error Budget Policies)
Toil Measure it by tracking work in JIRA Set a Target (50% Ops Work) Actively Manage it with proactive development designed to improve performance vs SLO and reduce toil
Monitoring/Observability Inventory apps, inventory apps' SLIs Know the key service metrics, and establish target levels or ranges (SLOs) Measure by error budget and error budget burn rate
Automation Small scripts used by individuals to restore service, reduce toil, etc. Grouping around some common tooling. Reusable automation by any within the group Mature Automation Frameworks & Platforms for rapid prototyping, able to be used by anyone in the company.
Release Response Use the CR reports to know what changed during the previous period. Ad hoc rollback response. Regular cadence of releases prepares team for changes, manual decision to rollback Automated alerting of releases & rollbacks. Human supervised.
Simplicity Inventory and catalog all systems, configs, etc. Reduce/Deprecate old, necessary systems & tooling Only necessary, profitable & useful applications are supported, tooling is lean and widely used.
Alerting Reduce noise, improve signal. High signal to noise ratio, have clear targets for alerts other than the NOC, clear up the actions for response. Alerts come for when automation has recovered something automatically.
On Call Rotation Ad hoc, no spiff, Established schedule, including spiffs internationally, On call is not overwhelming, each shift can do post mortems for each event, which yields great backlog. (Antifragile)
Troubleshooting Ad hoc, person-based Structured, algorithmic Systematic approach, regularly drilled
Emergency Response Human (NOC) based SRE based, NOC supported Automation led, human supervised
Incident Management Ad hoc, person-based Structured, Automated communication
Post Mortems Ad hoc, done by outside group Blameless, done by team Blameless, cataloged & searchable, regularly reviewed and replayed in game-day/chaos testing
Outage Tracking Tracked in a system with low visibility. Published metrics and tags for events Published publicly
Reliability Testing Our customers do the chaos testing for us Regular game day practice sessions on a non-production system to practice outages Regular chaos engineering in production systems to burn accumulating error budgets, etc.
Software Engineering Learn the software development principles and practices (scope, design, triage, build, test, release, maintain, iterate) Practice good design patterns in our own software development, including VCS, CICD, short circuit, Blue/Green, Canary, rapid release Are leaders of software development craftsmanship and design from the perspective of reliability. Can instruct on how to create resilient software from experience.
App Failure Modes
(Break down by workstream)
Fragile - multiple parts can take down an entire workflow
These parts are identified and measured.
Resilient - only certain parts of applications can cause full outages. The rest cause degradation, but not outage. Antifragile - systems learn from overloads, overflows, scaling events, cascades
Onboarding SREs Ad hoc program, assignments, feel-based Structured onboarding - training program, merit/test based achievement Automatic: new team members on day 1 know what they need to do and are systematically given opportunities to achieve these requirements.
Operational Overload Unmanaged, unknown, survey based Structured - KPI based (number of events/on call shift, number of post mortems/shift, capitalizable work over 50%, etc) Regular reviews for workload, participative leadership, reflection
SRE Engagement Model Attached SRE to Workstream - No Responsibility Embedded SREs - Limited Responsibilities Consumable Reliability Frameworks
Buy in of Organization Leery of SRE discipline, do not understand it, shiny new thing, what's in it for me? Aware of program's objectives, interested in learning of the goals and costs of team, how they can help Full partners in the goals of reliability, adapt their work to achieve reliability jointly with SRE team.

*Loosely based on Adidas' DevOps Maturity Framework

Comments

Popular posts from this blog

Application Maturity Mental Model

Revolutionary Ideas Evolved over Time