Friday, August 12, 2011

Measuring Availability of Cloud Systems

The analysts at Saugatuck Technology recently wrote a note on "Cloud IT Failures Emphasize Need for Expectation Management". One comment caught my attention:

"Recall that the availability of a group of components is the product of all of the individual component availabilities. For example, the overall availability of 5 components, each with 99 percent availability, is: 0.99 X 0.99 X 0.99 X 0.99 X 0.99 = 95 percent."

I understand their math - but it strikes me odd that they would use this thinking when discussing cloud computing. In cloud environments, the components are often available as virtualized n+1 highly available pairs. If one is down, the other is taking over. In a non-cloud world, this architecture is typically only reserved for the most critical components (e.g., load balancers or other single-point-of-failures). It's also common to create a complete replica of the environment in a disaster recovery area (e.g., AWS availability zones). In theory, this leads to very high up-time.

Let me put this another way... I currently have 2 cars in my driveway. Let's say each of them has 99% up-time. If one car doesn't start, I'll try the other car. If neither car starts, I'll most likely walk over to my neighbors house and ask to borrow one of their two cars (my DR plan). You can picture the math... in the 1% chance that car A fails, theirs a 99% chance that car B will succeed, and so on. However, experience in both cars and in computing tells us that this math doesn't work either. For instance, if car A didn't start because it was 20 degrees below zero outside, there's a good chance that car B won't work start - and for that matter, my neighbors cars won't start either. Structural or natural problems tend to infect the mass.

I wish I could show you the new math for calculating availability in cloud systems - but it's beyond my pay grade. What I know is that the old math isn't accurate. Anyone have suggestions on a more modern approach?

4 comments:

Alex said...

If N1 (0.<=N1<=1.) is the availability of your first car and N2 (0.<=N2<=1.) is the availability of your second car then the availability of any of them is:
1-(1-N1)*(1-N2)

Thanks,
AS

jeff said...

Alexandar,
Your formula doesn't take into account the idea of an 'infectious agent' that would decrease the likelihood of multiple items not being available. Said another way, your formula seems to say that the likelihood of the cars starting has nothing to do with the weather outside (that the likelihood of any one car starting is a constant).

I disagree with this position. Thoughts?
Jeff

jeff said...

I've since learned that one way of dealing with the 'infectious agent' is through N Version Programming: http://en.wikipedia.org/wiki/N-version_programming

shankar said...

I have a solution based on a probability model with hidden state "Theta" that connects the observations X1 and X2:

Theta -> X1 ; Theta->X2.

Both the observations (X1 and X2) are dependent on Theta. But they are CONDITIONALLY independent of each other, GIVEN Theta.

Further Info can be found at:
http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm