Site Reliability Engineering Maturity Model
After a lot of study of the Google SRE books, I came up with this SRE framework.
The idea is that you could judge a development team's reliability maturity based on this framework. As teams begin to adopt various characteristics of the framework, we would expect that their reliability will go up.
I hope this is useful to you as you start your journey in applying these principles!
- The first step will be to do a self-assessment of the current status of your Product Team for each one of the identified capabilities.
- Define the desired end-point at the end of the next improvement cycle, a cycle can be a month, a quarter, a semester ... every team can define their improvement cycles although a good start would be to set quarterly targets to be able to define meaningful actions.
- Identify the actions you will need to achieve the desired end-point.
| Characteristic | Crawl | Walk | Run |
|---|---|---|---|
| Risk | Observe it/Measure it (SLIs) | Set a target (SLO), and regularly review this target for fit | Manage work around performance vs the target (Established & agreed upon Error Budget Policies) |
| Toil | Measure it by tracking work in JIRA | Set a Target (50% Ops Work) | Actively Manage it with proactive development designed to improve performance vs SLO and reduce toil |
| Monitoring/Observability | Inventory apps, inventory apps' SLIs | Know the key service metrics, and establish target levels or ranges (SLOs) | Measure by error budget and error budget burn rate |
| Automation | Small scripts used by individuals to restore service, reduce toil, etc. | Grouping around some common tooling. Reusable automation by any within the group | Mature Automation Frameworks & Platforms for rapid prototyping, able to be used by anyone in the company. |
| Release Response | Use the CR reports to know what changed during the previous period. Ad hoc rollback response. | Regular cadence of releases prepares team for changes, manual decision to rollback | Automated alerting of releases & rollbacks. Human supervised. |
| Simplicity | Inventory and catalog all systems, configs, etc. | Reduce/Deprecate old, necessary systems & tooling | Only necessary, profitable & useful applications are supported, tooling is lean and widely used. |
| Alerting | Reduce noise, improve signal. | High signal to noise ratio, have clear targets for alerts other than the NOC, clear up the actions for response. | Alerts come for when automation has recovered something automatically. |
| On Call Rotation | Ad hoc, no spiff, | Established schedule, including spiffs internationally, | On call is not overwhelming, each shift can do post mortems for each event, which yields great backlog. (Antifragile) |
| Troubleshooting | Ad hoc, person-based | Structured, algorithmic | Systematic approach, regularly drilled |
| Emergency Response | Human (NOC) based | SRE based, NOC supported | Automation led, human supervised |
| Incident Management | Ad hoc, person-based | Structured, | Automated communication |
| Post Mortems | Ad hoc, done by outside group | Blameless, done by team | Blameless, cataloged & searchable, regularly reviewed and replayed in game-day/chaos testing |
| Outage Tracking | Tracked in a system with low visibility. | Published metrics and tags for events | Published publicly |
| Reliability Testing | Our customers do the chaos testing for us | Regular game day practice sessions on a non-production system to practice outages | Regular chaos engineering in production systems to burn accumulating error budgets, etc. |
| Software Engineering | Learn the software development principles and practices (scope, design, triage, build, test, release, maintain, iterate) | Practice good design patterns in our own software development, including VCS, CICD, short circuit, Blue/Green, Canary, rapid release | Are leaders of software development craftsmanship and design from the perspective of reliability. Can instruct on how to create resilient software from experience. |
| App Failure Modes (Break down by workstream) |
Fragile - multiple parts can take down an entire workflow These parts are identified and measured. |
Resilient - only certain parts of applications can cause full outages. The rest cause degradation, but not outage. | Antifragile - systems learn from overloads, overflows, scaling events, cascades |
| Onboarding SREs | Ad hoc program, assignments, feel-based | Structured onboarding - training program, merit/test based achievement | Automatic: new team members on day 1 know what they need to do and are systematically given opportunities to achieve these requirements. |
| Operational Overload | Unmanaged, unknown, survey based | Structured - KPI based (number of events/on call shift, number of post mortems/shift, capitalizable work over 50%, etc) | Regular reviews for workload, participative leadership, reflection |
| SRE Engagement Model | Attached SRE to Workstream - No Responsibility | Embedded SREs - Limited Responsibilities | Consumable Reliability Frameworks |
| Buy in of Organization | Leery of SRE discipline, do not understand it, shiny new thing, what's in it for me? | Aware of program's objectives, interested in learning of the goals and costs of team, how they can help | Full partners in the goals of reliability, adapt their work to achieve reliability jointly with SRE team. |
*Loosely based on Adidas' DevOps Maturity Framework
Comments
Post a Comment