Site Reliability Engineering Maturity Model

After a lot of study of the Google SRE books, I came up with this SRE framework.

The idea is that you could judge a development team's reliability maturity based on this framework. As teams begin to adopt various characteristics of the framework, we would expect that their reliability will go up.

I hope this is useful to you as you start your journey in applying these principles!

The first step will be to do a self-assessment of the current status of your Product Team for each one of the identified capabilities.
Define the desired end-point at the end of the next improvement cycle, a cycle can be a month, a quarter, a semester ... every team can define their improvement cycles although a good start would be to set quarterly targets to be able to define meaningful actions.
Identify the actions you will need to achieve the desired end-point.

Characteristic	Crawl	Walk	Run
Risk	Observe it/Measure it (SLIs)	Set a target (SLO), and regularly review this target for fit	Manage work around performance vs the target (Established & agreed upon Error Budget Policies)
Toil	Measure it by tracking work in JIRA	Set a Target (50% Ops Work)	Actively Manage it with proactive development designed to improve performance vs SLO and reduce toil
Monitoring/Observability	Inventory apps, inventory apps' SLIs	Know the key service metrics, and establish target levels or ranges (SLOs)	Measure by error budget and error budget burn rate
Automation	Small scripts used by individuals to restore service, reduce toil, etc.	Grouping around some common tooling. Reusable automation by any within the group	Mature Automation Frameworks & Platforms for rapid prototyping, able to be used by anyone in the company.
Release Response	Use the CR reports to know what changed during the previous period. Ad hoc rollback response.	Regular cadence of releases prepares team for changes, manual decision to rollback	Automated alerting of releases & rollbacks. Human supervised.
Simplicity	Inventory and catalog all systems, configs, etc.	Reduce/Deprecate old, necessary systems & tooling	Only necessary, profitable & useful applications are supported, tooling is lean and widely used.
Alerting	Reduce noise, improve signal.	High signal to noise ratio, have clear targets for alerts other than the NOC, clear up the actions for response.	Alerts come for when automation has recovered something automatically.
On Call Rotation	Ad hoc, no spiff,	Established schedule, including spiffs internationally,	On call is not overwhelming, each shift can do post mortems for each event, which yields great backlog. (Antifragile)
Troubleshooting	Ad hoc, person-based	Structured, algorithmic	Systematic approach, regularly drilled
Emergency Response	Human (NOC) based	SRE based, NOC supported	Automation led, human supervised
Incident Management	Ad hoc, person-based	Structured,	Automated communication
Post Mortems	Ad hoc, done by outside group	Blameless, done by team	Blameless, cataloged & searchable, regularly reviewed and replayed in game-day/chaos testing
Outage Tracking	Tracked in a system with low visibility.	Published metrics and tags for events	Published publicly
Reliability Testing	Our customers do the chaos testing for us	Regular game day practice sessions on a non-production system to practice outages	Regular chaos engineering in production systems to burn accumulating error budgets, etc.
Software Engineering	Learn the software development principles and practices (scope, design, triage, build, test, release, maintain, iterate)	Practice good design patterns in our own software development, including VCS, CICD, short circuit, Blue/Green, Canary, rapid release	Are leaders of software development craftsmanship and design from the perspective of reliability. Can instruct on how to create resilient software from experience.
App Failure Modes (Break down by workstream)	Fragile - multiple parts can take down an entire workflow These parts are identified and measured.	Resilient - only certain parts of applications can cause full outages. The rest cause degradation, but not outage.	Antifragile - systems learn from overloads, overflows, scaling events, cascades
Onboarding SREs	Ad hoc program, assignments, feel-based	Structured onboarding - training program, merit/test based achievement	Automatic: new team members on day 1 know what they need to do and are systematically given opportunities to achieve these requirements.
Operational Overload	Unmanaged, unknown, survey based	Structured - KPI based (number of events/on call shift, number of post mortems/shift, capitalizable work over 50%, etc)	Regular reviews for workload, participative leadership, reflection
SRE Engagement Model	Attached SRE to Workstream - No Responsibility	Embedded SREs - Limited Responsibilities	Consumable Reliability Frameworks
Buy in of Organization	Leery of SRE discipline, do not understand it, shiny new thing, what's in it for me?	Aware of program's objectives, interested in learning of the goals and costs of team, how they can help	Full partners in the goals of reliability, adapt their work to achieve reliability jointly with SRE team.

*Loosely based on Adidas' DevOps Maturity Framework

Search This Blog

What The Heck is DevOps?

Site Reliability Engineering Maturity Model

Comments

Post a Comment

Popular posts from this blog

Application Maturity Mental Model

Revolutionary Ideas Evolved over Time