What You Don't Know Can Hurt your Uptime
Image Source: Buckminster Fuller Institute Distributed systems are fascinating things. Like organisms which are comprised of many different & dependent parts, if we want to understand the likelihood of failure of those systems, we need to do some interesting math. No component is perfect, nor is any full system, but based on the number of dependencies and redundancies, we can accurately predict its estimated uptime. Here is the math: Given set 𝑀 of components for a System = { 𝑛 0 , 𝑛 1 , … , 𝑛 𝑡 } System Uptime 𝑃 = ∏ 𝑖 = 0 𝑡 𝑛 𝑖 That is that for dependent systems, we need to take the product of their estimated component uptimes. For instance: P = 99.9% x 99.9% = 99.8001% This is because the total system will fail if ANY of the parts fail. The conventional wisdom: "a chain is only as strong as its weakest link" is actually an overestimation! And this isn't even all the bad news. Lurking underneath our assumptions is some potentially nasty...