What You Don't Know Can Hurt your Uptime

Image Source: Buckminster Fuller Institute

Distributed systems are fascinating things. Like organisms which are comprised of many different & dependent parts, if we want to understand the likelihood of failure of those systems, we need to do some interesting math. 

No component is perfect, nor is any full system, but based on the number of dependencies and redundancies, we can accurately predict its estimated uptime.

Here is the math:

Given set 𝑀 of components for a System={𝑛0,𝑛1,,𝑛𝑡}

System Uptime 𝑃=

𝑖=0𝑡𝑛𝑖


That is that for dependent systems, we need to take the product of their estimated component uptimes. 

For instance: 

P = 99.9% x 99.9% = 99.8001%

This is because the total system will fail if ANY of the parts fail. 

The conventional wisdom: "a chain is only as strong as its weakest link" is actually an overestimation!


And this isn't even all the bad news. 

Lurking underneath our assumptions is some potentially nasty truth.
Take another example:

We assume System Z consists of Components A & B. Component A has uptime of 99.9%, Component B has the same. This means our total System Uptime is estimated at (99.9% x 99.9%) = 99.8001% which translates to:

  • Daily: 2m 52s
  • Weekly: 20m 8s
  • Monthly: 1h 27m 36s
  • Quarterly: 4h 22m 50s
  • Yearly: 17h 31m 22s

But upon further investigation, our system actually includes and is dependent upon a Component C @ 99.9%.

Therefore, the real uptime is this: (99.9% x 99.9% x 99.9%) = 99.7002999% 

  • Daily: 4m 18s
  • Weekly: 30m 12s
  • Monthly: 2h 11m 21s
  • Quarterly: 6h 34m 4s
  • Yearly: 1d 2h 16m 16s

That mistake of dependency omission is the same as assuming that component C would be at 100%. 

(99.9% x 99.9% x 100%) = 99.8001%. That is a mistake of almost 50% of allowed downtime (2m 52s as compared to the real downtime 4m 18s).

So, not accounting for a dependency is the same as assuming it will be up 100% of the time. That can cost someone quite a bit depending upon their cost of downtime. 

So, what does one do to increase the total system uptime in a system of many dependent parts? In a word: Redundancy. But that is a topic for another post. 

Want to see the math for yourself? Use my Distributed Systems SLO calculator.

Comments

  1. This is such a good breakdown and highlights how often companies technically over promise and under deliver without knowing it.

    ReplyDelete

Post a Comment

Popular posts from this blog

Application Maturity Mental Model

Site Reliability Engineering Maturity Model

Revolutionary Ideas Evolved over Time