Ground Up Reliability - An Example

Liam Aikin

15 Jul 2024 — 6 min read

Introduction:

I have recently been reading Alex Hidalgo's SLO book, and one quote that struck me, which I will paraphrase below is:

In a perfect world, an SLO would be built from the ground up

This implies that for us to build an SLO for our application, system or infrastructure, that we would need to find the reliability of our power delivery, our internet infrastructure, etc, and this gave me the idea for building out an example.

Warning: This post will contain a bit more maths than usual.

Firstly, I will briefly define what an SLO is, and why it matters in the context of reliability. An SLO is a crucial part of the SRE approach, they allow us to declare if a system is reliable if our SLO is reached, generally an SLO is given in two distinct parts:

An SLI (a metric)
An objective which we wish to meet

An example being:

P95 of requests, will be completed successfully within 400ms

Why does an SLO matter in this context? SLOs are the users "expected level of reliability" and they also provide us with a way to know whether our service is performing well enough and doing so often enough.

From the ground up:

If we are to build an SLO we need to understand the dependencies of our services, we have two main types of dependencies:

Hard dependencies - which we rely on for functioning
Soft dependencies - which are not needed, but are necessary for optimal functioning

For our example, were going to imagine that we have built a service that we will need to run on bare-metal infrastructure, and were going to imagine that all the software components of our service are at 99.99% reliability, how can we design our bare-metal infrastructure to theoretically provide at least 99.99% reliability?

First, I'll start with the hard dependencies, what are the infrastructure requirements:

Power
Internet connection
CPU
Memory
Storage
Power Supply Unit
Motherboard

Power:

There are a few key metrics for power grid reliability, but the main two are the SAIDI (System Average Interruption Duration Index) and the SAIFI (System Average Interruption Frequency Index).

The SAIDI measures the total duration of interruptions experienced over a period of time, typically a year, and the SAIFI measures the average number of interruptions that a customer of the power network would experience in a given period.

For my example, I will be using AusGrid, with data retrieve from here


Reliability	FY2022	FY2021	FY2020	FY2019	FY2018	FY2017
SAIDI - System Average Interruption Duration Index Average time that a customer is without electricity in minutes	74.8	70.7	92.2	74.7	69.0	79.0
SAIFI - System Average Interruption Frequency Index Average number of service interruptions to each customer	0.61	0.56	0.68	0.66	0.68	0.71

Averaging out the implied reliability from the statistics above we can get a mean power network reliability of 99.9855%.

Internet

ISPs provide their own SLOs for their networks, although in my experience, they rarely come close to their published SLOs in actual reliability (at least for the base consumer). I have gathered my own statistics from my servers HealthChecks.io monitoring which I mentioned in Going Mobile: Part 4.

The mean reliability of which is 98.455%

Hardware:

The method for deriving reliability for hardware follows two main paths in this post:

Calculating the failure probability from the FIT rate
Calculating the failure probability from the MTBF (Mean Time Between Failures)

The FIT rate (Failure in Time) is defined as the number of failures per billion hours of operation.

The Mean Time Between Failures is defined as the average time elapsed by which a component is then expected to fail.

CPU

Researching FIT rates for CPUs, I found that a average FIT rate for any given CPU is 50, assuming that we are calculating our reliability for a year:

The formula for the failure probability over a given period is:

$$Failure\ Probability=FIT\ rate×\frac{Total\ hours}{10^9}$$

For a FIT rate of 50 and 8,766 hours in a year:
$$Failure\ Probability=50×\frac{8,766}{1,000,000,000}=0.0004383$$

So our reliability percentage could be given for a year as:
$$Reliability = 1 - Failure\ Probability$$
or:
$$Reliability_{1\ year} = 1-0.0004383$$
$$Reliability_{1\ year} = 0.9995617$$

Or as a percentage:
$$99.96\%$$

Memory / RAM

FIT rates are higher in RAM, where you can see a range of 50 to 1000 for the FIT rate depending on the type of RAM selected, typically DRAM has a higher FIT rate, but a good rough median is 200.

Assuming a FIT rate of 200:

$$\text{Failure Probability}=200×\frac{8,766}{1,000,000,000}=0.0017532$$

This gives a reliability of:
$$Reliability = 1 - 0.0017532$$
$$Reliabiltiy = 0.9982468$$

or:
$$99.83\%$$

Power Supply Unit:

Lets use the Corsair AX850 as an example for this: here

The MTBF for the AX850 is 100,000 hours, from research this is a fairly standard MTBF for PSUs

We can use that in the function used previous to calculate the MTBF:

$$Failure\ Probability = \frac{Total\ Hours\ in\ a\ year}{MTBF} = \frac{8766}{100,000} = 0.08766$$

Which provides us with the reliability of:
$$Reliability = 1 - 0.08766 = 0.91234$$
Or:
$$91.234\%$$

Motherboard

For the motherboard, I am going to use the calculated MTBF estimates from Intel for the S1200V3RP Server board here

This provides us a mean MTBF of 377,400 hours.

We can use that in the function used previous to calculate the MTBF
$$Failure\ Probability = \frac{Total\ Hours\ in\ a\ year}{MTBF} = \frac{8766}{377,400} = 0.02322734499$$

Which provides us with the reliability of:
$$Reliability = 1 - 0.02322734499 = 0.97677$$
Or:
$$97.677\%$$

Storage

For storage we will be introducing another measure for reliability, which is the AFR or the Annualised Failure Rate.

We have two different types of storage to consider:

HDD
For HDD we will use the AFR, and Backblaze provides quarterly reporting for their HDD failure rates, which we will be using to extrapolate.

Most recent failure rate is 1.41% or a reliability of:
$$Reliability = 1-0.0141$$
$$Reliability = 0.9859$$
or:
$$98.59\%$$

SSD
A good measure for calculating SSD reliability is the MTBF metrics, which from research, Enterprise SSDs have a rough average of 2,500,000 hours of MTBF.

We can therefore use it in a similar equation to earlier with the FIT rate as below:

$$\text{Failure Probability} = \frac{\text{Total hours in a year}}{MTBF}=\frac{8,766}{2,500,000}=0.0035064$$

Which gives a reliability of:
$$Reliability = 1 - 0.0035064 = 0.9964936$$
or:
$$99.65\%$$

But one other aspect of storage reliability is Data striping or RAID, for this example we will use RAID6 which is a variant of RAID which is widely used in enterprise settings.

With RAID6 we have 3 options for failures:

No drives fail
1 drive fails
2 drives fail

So to calculate the reliability, we need to calculate the probability of each of the above options:

Probability of any 1 drive failing:
$$= 1 - 0.9965 = 0.0035$$

No Failure:
$$= 0.9965^8 = 0.9722$$

1 Failure:
$$= \binom{8}{1} = 8 × 0.9965^7 × 0.0035 = 0.0251$$

2 Drive Failures:
$$= \binom{8}{2} = 28 × 0.9965^6 × 0.0035^2 = 0.0024$$

Finally we combine the probabilities:
$$= 0.9722+0.0251+0.0024=0.9997$$
$$= 99.97\%$$

Putting it together:

We can calculate the overall reliability of a system of components by multiplying each component and taking the product.

$$= 0.999855 × 0.98455× 0.9995617 × 0.9982468 × 0.91234 × 0.9997 × 0.97677$$
$$= 0.85412$$

A reliability for the single node of 85.412% isn't very close to our 4 9s for our application, so what can we do?

We can list our dependencies by ascending order to find where we can make the most impact first:

Power Supply Unit (91.234%)
Motherboard (97.68%)
Internet (98.455%)
Memory (99.82%)
CPU (99.96%)
Storage (99.97%)
Power (99.99%)

What can we do?

One way to improve our reliability is to add redundancy, the function for the reliability of redundant components is:

$$R_{system}=1−(1−R)^n$$

Where:

%%(1−R)%% is the probability of a single component failing.
%%(1−R)^n%% is the probability of all components failing simultaneously.

Adding a redundant Power supply unit:
$$Probability\ of\ PSU\ Failure = 0.08766$$
$$Redundant\ PSUs\ (2) = 0.08766^{2}$$
$$Probability\ of\ both\ failing\ = 0.0076842756$$
providing a reliability of:
$$=0.9923157244 = 99.23\%$$

Now, the total reliability is:
$$= 0.999855 × 0.98455 × 0.9995617 × 0.9982468 × 0.9923 × 0.9997 × 0.97677$$
$$=0.95126$$
$$= 95.126\%$$

Now, we can scale horizontally, lets make 3 separate geographic nodes:
$$=1-(1-0.999855 × 0.98455 × 0.9995617 × 0.9982468 × 0.9923 × 0.9997 × 0.97677)^3$$
$$= 0.99988$$
$$=99.99\%$$

Wrapping it up:

We have shown that we can build a system that will theoretically have a reliability of 99.99% which can act as the bare-metal infrastructure for our workloads without compromising the services reliability through various methods of calculating dependency reliabilities, and implementing redundancies.