Over the last few days, a few of the engineers at TranscendComputing have been discussing what we could have done to have helped Netflix avoid their Christmas outage. For those of you who aren’t aware, AWS suffered an outage in the Elastic Load Balancer (ELB) service in the East Region.
In the middle of our discussions on creating massively scalable, highly available, clustered load balancers with feature parity to ELB, I caught a post by Diane Mueller at Active State. The gist of her post is that ‘Netflix went down because of AWS but her personal app (which leveraged FeedHenry and Stackato’ was revived after 10 minutes. The post seems to imply that if you use PaaS (like Stackato), one can switch clouds easily, like she did when she moved to her application to the HP Cloud.
I’ll avoid the overly dramatic retort but let’s just say that I disagree with Dianne’s implication. Here’s my position: if core Netflix applications were negatively affected by any core service (such as ELB), it would be extremely difficult to quickly switch to another cloud. Here are some specifics:
- No disrespect to my friends on the HP Cloud team but I honestly believe that if Netflix were to have done a sudden switch from AWS to HP it would have brought HP Cloud to its knees. ELB’s (if they had them) would have been crushed and Internet gateways would have been overloaded. Finding a very large number of idle servers may have also been a challenge.
- In this imaginary scenario, I guess we’ll assume that Netflix decided to keep their movie library and all application services running on multiple clouds. Sure this would be expensive but it wouldn’t have been realistic for them to do a just-in-time copy of the data from one location to the other.
- Netflix has done a great job of publishing their technical architecture: EMR, ELB, EIP, VPC, SQS, Autoscale, etc. None of these are available in the solution Dianne prescribed (Stackato), nor does HP Cloud offer them natively. There is a complete mismatch of services between the clouds. CloudFoundry offers some things that are ‘similar’ but I’m concerned that they wouldn’t have offered performance at scale.
- Netflix has also created tools specific to the AWS cloud (Asgard, Astyanax, etc.) as well as tuned off-the-shelf tools for AWS like Cassandra. These would have to be refined to work on each target cloud.
In summary, there’s little-to-no chance that Netflix could have quickly moved to ANY other cloud provider (including Rackspace or Azure) and there’s not a thing that Stackato would have done to alleviate the problem. All medium and large customers have real needs that are service dependent. I’ve joked that CloudFoundry is a toy. It is, but it’s a toy that is maturing and eventually may help with ‘real’ problems – but let’s be clear – that day isn’t today. Any suggestion that it is ready for a ‘Netflix-like-outage’ is either naïve or intentionally misleading.
I’ve spent the last three years working on solving the AWS portability problem – and it’s a bitch. Like Dianne, if you have a simple app my solution, TopStack, will work. It replicates core AWS services for workload portability. As proud as I am as what the team at Transcend Computing has done, I’m also quick to note that cloning any of the AWS services at massive scale with minimal down-time, across heterogeneous cloud platforms and providers is an incredibly tough problem.
Here’s my belief: Running the Transcend Computing ELB service on HP Cloud would not have worked for Netflix in their time of need. Our software would have been crushed. HP’s cloud would have been crushed. Netflix’s homegrown software wouldn’t have had ‘practically portable’. It would not have worked.
I’m happy to acknowledge where we suck. We’ll continue to listen to the unfortunate incidents that AWS, Netflix and others encounter. My 2013 prediction for Transcend Computing is this: we’ll suck less. Acknowledging reality is the first step.