Tuesday, December 31, 2013

Looking back on my 2013 predictions

Last year, I made some predictions on cloud computing. Here's my self-analysis:

===================
1. OpenStack continues to gain traction but many early adopters bypass Folsom in anticipation of Grizzly.
>> Correct. This was a gimme. 

2. Amazon's push to the enterprise means we will see more hosted, packaged apps from Microsoft, SAP and other large ISV's. Their IaaS/PaaS introductions will be lackluster compared to previous years.
>> Correct. It's interesting that the press failed to notice the lack of interesting stuff coming out of AWS. Has the law of diminishing returns already hit Amazon?

3. BMC and CA will acquire their way into the cloud.
>> Incorrect. CA picked up Nolio (and Layer 7), BMC acquired Partnerpedia. These acquisitions are pieces to the puzzle - but are not large enough to serve as anchors for a cloud portfolio. 

4. SAP Hana will quickly determine that Teradata isn't their primary competitor as the rise of OSS solutions matures.
>> Incorrect. SAP Hana continued to kick butt in 2013 and the buyers of it have probably never heard of the large open source databases. What was I thinking?

5. Data service layers (think Netflix/Cassandra) become common in large cloud deployments.
>> Partially Correct. We're seeing the cloud-savvy companies implement cross-region data replication strategies - but the average enterprise is nowhere near this. 

6. Rackspace, the "Open Cloud Company" continues to gain traction but users find more and more of their services 'not open'.
>> Correct. Rackspace continues to push a 'partially open' agenda - but users seem to be more than happy with their strategy. 

7. IBM goes another year without a cohesive cloud strategy.
>> Correct. The acquisition of SoftLayer was a huge step forward in having a strategy - but from the outside looking in, they still look like a mess. 

8. Puppet and Chef continue to grow presence but Cfengine gets a resurgence in mindshare.
>> Partially Correct. Puppet and Chef did grow their presence, especially in the large enterprise. I could be wrong, but I personally didn't see Cfengine get traction. That said, Ansible and Salt came out strong. 

9. Cloud Bees, Rightscale, Canonical, Inktank, Enstratus, Piston Cloud, PagerDuty, Nebula and Gigaspaces are all acquired.
>> Incorrect. I was right about Enstratus but some of these predictions were stupid (like Canonical). The others remain strong candidates for acquisition. 

10. Eucalyptus sunsets native storage solutions and adopts OpenStack solutions.
>> Unsure; I don't keep track of Eucalyptus. 

11. VMware solution dominates over other CloudFoundry vendors.
>> Correct. I was referring to what is now called Pivotal. 

12. Cloud 'cost control' vendors (Newvem, Cloudyn, Cloud Cruiser, Amysta, Cloudability, Raveld, CloudCheckR, Teevity, etc.) find the space too crowded and begin shifting focus.
>> Correct. Some of them have moved into adjacent spaces like governance, billing, etc. 

13. PaaS solutions begin to look more and more like orchestration solutions with capabilities to leverages SDN, provisioned IOPS, IAM and autonomic features. Middleware vendors that don't offer open source solutions lose significant market share in cloud.
>> Incorrect. I believe this is still coming but for the most part the vendors aren't there.

14. Microsoft's server-side OS refresh opens the door to more HyperV and private cloud.
>> Unsure. This should have happened but I have no data. 

15. Microsoft, Amazon and Google pull away from the pack in the public cloud while Dell, HP, AT&T and others grow their footprint but suffer growing pains (aka, outages).
>> Correct. Well - at least the part where AWS, Azure and Google pull away from the pack. Dell continues to frustrate me; I need to have a sit-down with Michael Dell.

16. Netflix funds and spins out a cloud automation company.
>> Incorrect. Perhaps this was wishful thinking. I'm a Netflix OSS fanboy - but think that they're starting to fall into the same trap as OpenStack (aka, open sourcing the kitchen sink without strong product/portfolio management). 

17. Red Hat focuses on the basics, mainly integrating/extending existing product lines with a continued emphasis on OpenStack.
>> Correct. Red Hat appears to be taking a risk averse strategy... slow but methodical movement. 

18. Accenture remains largely absent from the cloud, leaving Capgemini and major off-shore companies to take the revenue lead.
>> Unsure. I'm unaware of any large movements that Accenture made in the cloud. The big move in the SI space was CSC acquiring ServiceMesh. 

19. EMC will continue to thrive: it's even easier to be sloppy with storage usage in the cloud and users realize it isn't 'all commodity hardware'.
>> Correct. That said, we're starting to see companies implement multi-petabyte storage archival projects with cloud companies. 

20. In 2013, we'll see another talent war. It won't be as bad as dot-com, but talent will be tight.
>> Correct. And it will get worse in 2014.

Thursday, April 25, 2013

New Presentations: SOA, DevOps and Technical Debt

MomentumSI recently published a series of presentations on hot topics in I.T.

DevOps in 2013 covers the current state of I.T. operations automation and the issues in the SDLC that need to be addressed in order to achieve continuous delivery:




By now, most I.T. professionals are familiar with "technical debt". This presentation encourages practitioners to think about the structural issues that slow us down:



A lot has changed in the SOA world over the last few years. However, we continue to see many organizations adopting techniques that don't promote agility:


Thursday, January 03, 2013

ITIL and DevOps: Inbreeding?

The 2012 Christmas Eve outage at Amazon has people talking. The fuss isn't about what broke; it's about what Amazon said they're going to do to fix it. If you aren't familiar with their report, it's worth a quick read. If it's tl;dr, I'll sum it up: a developer whacked some data in a production database that made the load balancing service go hay-wire, and it took longer than it should have to identify the problem and restore it. (did you see how i avoided the technical jargon??)

If you're Amazon, you have to start thinking about how to make sure it never happens again. Restore confidence... and fast. Here's what they said:

We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval. This would have prevented the ELB state data from being deleted in this event. This is a protection that we use across all of our services that has prevented this sort of problem in the past, but was not appropriately enabled for this ELB state data. We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.

Here's my question: If ITIL Service Transition (thoughtful change management) and DevOps (agile processes with infrastructure-as-code were to mate, what would the outcome be?
A) A child that wanted to run fast but couldn't because of too many manual/approval steps
B) A child that ran fast but only after the change board approved it
C) Mate multiple times; some children will run fast (with scissors) others will move carefully
D) No mating required; just fix the architecture (service recovery)

This is the discussion that I'm having with my colleagues. And to be clear, we aren't talking about what Amazon could/should do, we're talking about what WE should do with our own projects.

Although there's no unanimous agreement there has been some common beliefs:
1. Fix the architecture. I like to say that "cloud providers make their architecture highly available so we don't have to." This is an exaggeration, but if the cloud provider does their job right, we will have to focus less on making our application components HA and more about correctly using the providers HA components.  There's little disagreement on this topic. AWS screwed up the MTTR on the ELB. We've all screwed up things before... just fix it.

2. Rescind dev-team access. So this is where it gets interesting. Remember all that Kumbaya between developers and operators? Gone. Oh shit - maybe we should have called the movement "DevTestOps"! One simple mistake and you pulled my access to production?? LOL - hell, yea. The fact is all services aren't created equal. I have no visibility into Amazon's internal target SLA's - but I'm going to guess that there are a few services that are five-9's (or 5.26 minutes of down-time per year). Certain BUSINESS CRITICAL services shouldn't be working in DevOps time. They should be thoughtfully planned out with Change Advisory Boards with Change Records and Release Windows by pre-approved Change Roles. Yes - if it's BUSINESS CRITICAL - pull out your ITIL manuals and follow the !*@$ing steps!

Again - there's little disagreement here. People who run highly available architectures know that to re-release something critical requires a special attention to detail. Run the playbook like your launching a nuclear missile: focus on the details.

To be clear, I love infrastructure-as-code. I think everything can be automated and it kills me to think about putting manual steps into tasks that we all know should run human-free. If your application is two-9's (3.6 days of down-time), automate it! Hell, give the developers access to production data - - you can fix it later! What about 99.9% uptime (8.76 hours)? Hmm... not so sure. What about 99.99% up-time? (52.56 minutes)? Well, that's not a lot of time to fix things if they go wrong. But wait - if I did DevOps automation correctly, shouldn't I be able to back out quickly? The answer is Yes - you SHOULD be able to run your SaveMyAss.py script and it MIGHT work.

Ponder this:
Dev-to-Test = Use traditional DevOps & IaC (Infrastructure as Code)
Test-to-Stage = (same as above)
Stage-to-Prod (version 1) = (same as above)
Patch-Prod (99% up-time or less) = (same as above)
Patch-Prod (99.9% or greater up-time) = Run your ITIL checklist. Use your IaC scripts if you got'em.

For me, it's not an either/or choice between ITIL Transition Management and DevOps. IMHO, both have a time and a place. That said, I don't think that the answer is to inbreed the two - DevOps will get fat and be the loser in that battle. Keep agile agile. Use structure when you need it.

Monday, December 31, 2012

2013 Cloud Predictions

Here's my quick cloud predictions for 2013:


1. OpenStack continues to gain traction but many early adopters bypass Folsom in anticipation of Grizzly.

2. Amazon's push to the enterprise means we will see more hosted, packaged apps from Microsoft, SAP and other large ISV's. Their IaaS/PaaS introductions will be lackluster compared to previous years.

3. BMC and CA will acquire their way into the cloud.

4. SAP Hana will quickly determine that Teradata isn't their primary competitor as the rise of OSS solutions matures.

5. Data service layers (think Netflix/Cassandra) become common in large cloud deployments.

6. Rackspace, the "Open Cloud Company" continues to gain traction but users find more and more of their services 'not open'.

7. IBM goes another year without a cohesive cloud strategy.

8. Puppet and Chef continue to grow presence but Cfengine gets a resurgence in mindshare.

9. Cloud Bees, Rightscale, Canonical, Inktank, Enstratus, Piston Cloud, PagerDuty, Nebula and Gigaspaces are all acquired.

10. Eucalyptus sunsets native storage solutions and adopts OpenStack solutions.

11. VMware solution dominates over other CloudFoundry vendors.

12. Cloud 'cost control' vendors (Newvem, Cloudyn, Cloud Cruiser, Amysta, Cloudability, Raveld, CloudCheckR, Teevity, etc.) find the space too crowded and begin shifting focus.

13. PaaS solutions begin to look more and more like orchestration solutions with capabilities to leverages SDN, provisioned IOPS, IAM and autonomic features. Middleware vendors that don't offer open source solutions lose significant market share in cloud.

14. Microsoft's server-side OS refresh opens the door to more HyperV and private cloud.

15. Microsoft, Amazon and Google pull away from the pack in the public cloud while Dell, HP, AT&T and others grow their footprint but suffer growing pains (aka, outages).

16. Netflix funds and spins out a cloud automation company.

17. Red Hat focuses on the basics, mainly integrating/extending existing product lines with a continued emphasis on OpenStack.

18. Accenture remains largely absent from the cloud, leaving Capgemini and major off-shore companies to take the revenue lead.

19. EMC will continue to thrive: it's even easier to be sloppy with storage usage in the cloud and users realize it isn't 'all commodity hardware'.

20. In 2013, we'll see another talent war. It won't be as bad as dot-com, but talent will be tight.

I try to keep my predictions upbeat and avoid the forecasts on who will meet their demise - but yes, I anticipate a few companies will close doors or do asset sales. It's all part of the journey.

Enjoy your 2013!
Jeff

Saturday, December 29, 2012

AWS Outage: Netflix and Stackato

Over the last few days, a few of the engineers at TranscendComputing have been discussing what we could have done to have helped Netflix avoid their Christmas outage. For those of you who aren’t aware, AWS suffered an outage in the Elastic Load Balancer (ELB) service in the East Region.

In the middle of our discussions on creating massively scalable, highly available, clustered load balancers with feature parity to ELB, I caught a post by Diane Mueller at Active State. The gist of her post is that ‘Netflix went down because of AWS but her personal app (which leveraged FeedHenry and Stackato’ was revived after 10 minutes. The post seems to imply that if you use PaaS (like Stackato), one can switch clouds easily, like she did when she moved to her application to the HP Cloud.

I’ll avoid the overly dramatic retort but let’s just say that I disagree with Dianne’s implication. Here’s my position: if core Netflix applications were negatively affected by any core service (such as ELB), it would be extremely difficult to quickly switch to another cloud. Here are some specifics:
  1. No disrespect to my friends on the HP Cloud team but I honestly believe that if Netflix were to have done a sudden switch from AWS to HP it would have brought HP Cloud to its knees. ELB’s (if they had them) would have been crushed and Internet gateways would have been overloaded. Finding a very large number of idle servers may have also been a challenge.
  2. In this imaginary scenario, I guess we’ll assume that Netflix decided to keep their movie library and all application services running on multiple clouds. Sure this would be expensive but it wouldn’t have been realistic for them to do a just-in-time copy of the data from one location to the other.
  3. Netflix has done a great job of publishing their technical architecture: EMR, ELB, EIP, VPC, SQS, Autoscale, etc. None of these are available in the solution Dianne prescribed (Stackato), nor does HP Cloud offer them natively. There is a complete mismatch of services between the clouds. CloudFoundry offers some things that are ‘similar’ but I’m concerned that they wouldn’t have offered performance at scale.
  4. Netflix has also created tools specific to the AWS cloud (Asgard, Astyanax, etc.) as well as tuned off-the-shelf tools for AWS like Cassandra. These would have to be refined to work on each target cloud.

In summary, there’s little-to-no chance that Netflix could have quickly moved to ANY other cloud provider (including Rackspace or Azure) and there’s not a thing that Stackato would have done to alleviate the problem. All medium and large customers have real needs that are service dependent. I’ve joked that CloudFoundry is a toy. It is, but it’s a toy that is maturing and eventually may help with ‘real’ problems – but let’s be clear – that day isn’t today. Any suggestion that it is ready for a ‘Netflix-like-outage’ is either naïve or intentionally misleading.

I’ve spent the last three years working on solving the AWS portability problem – and it’s a bitch. Like Dianne, if you have a simple app my solution, TopStack, will work. It replicates core AWS services for workload portability. As proud as I am as what the team at Transcend Computing has done, I’m also quick to note that cloning any of the AWS services at massive scale with minimal down-time, across heterogeneous cloud platforms and providers is an incredibly tough problem.

Here’s my belief: Running the Transcend Computing ELB service on HP Cloud would not have worked for Netflix in their time of need. Our software would have been crushed. HP’s cloud would have been crushed. Netflix’s homegrown software wouldn’t have had ‘practically portable’.  It would not have worked.

I’m happy to acknowledge where we suck. We’ll continue to listen to the unfortunate incidents that AWS, Netflix and others encounter. My 2013 prediction for Transcend Computing is this: we’ll suck less. Acknowledging reality is the first step.

Saturday, November 24, 2012

What's After Cloud?


As an advisor to some of the world’s largest companies, it’s my job to keep up with advances in technology.  I’m paid to answer questions like, “what’s after cloud?” I’ve thought a lot about this very question and I’ve formed my answer: “More Cloud”. I believe that many new innovations will be packaged as 'cloud' and the combined ecosystem of innovation will outweigh other non-server side contenders. 

Clouds promote increased automation, computing efficiency and increased service levels. Public clouds add the outsourcing model, while private clouds leverage existing infrastructure. Despite the value clouds offer, investments made in cloud computing by both vendors and buyers have been insignificant relative to the size of the opportunity. I believe that the next several decades will be dominated by a single computing paradigm: cloud.

From Structured Programming to Cloud Elasticity
The magic of cloud is the ability of a service to provision additional computing capacity to solve the problem without the user being aware.  Cloud offerings are divided into sub-systems that perform a specific function and can be called over a network via a well-defined interface. For the uninitiated, we call this a service-oriented architecture (SOA). Cloud offers a variety of services such as compute-as-a-service and database-as-a-service. The service-oriented approach allows an implementer to swap-out the internals of a service without impacting the users. This concept is borrowed from prior art (structured programming, OOD, CBD, etc.) While SOA extends prior paradigms to embrace distributed computing, cloud extends SOA to solve issues related the quality attributes or non-functional concerns such as scalability and availability. Cloud services respond to requests from various users/consumers where each request varies in complexity to the point where the amount computational power needed to satisfy the request will vary over time.

Encapsulated Innovation
The as-a-service model encapsulates (or hides) new innovations behind the service interface. For example, when Solid State Drives began delivering fast IO access at competitive prices, cloud storage services began using them under the covers. When new patterns and algorithms are invented we see them turned into as-a-Service offerings:
  •        Map reduce becomes the AWS Elastic MapReduce Service
  •        Dynamo and eventual consistency become AWS Dynamo / MongoDB-aaS
  •        Dremel becomes Google Big Query


Significant innovations will continue to unfold but the vehicle for delivering those innovations will be as-a-Service (SOA) with elastic infrastructure (cloud). Said another way, cloud will be awarded the credit for innovation because it is the delivery vehicle of the innovation.  This might seem like an inappropriate assignment of credit but in many cases the cloud model may be the only practical means of delivering highly complex, infrastructure intensive solutions. For example, setting up a large Hadoop farm is impractical for many users, but using one that is already in place (e.g., AWS EMR) brings the innovation to the masses. In this sense, the cloud isn’t the innovation but it is the agent that ignites its viability.

Metcalfe’s Law
A cloud is a collection of nodes that interact across multiple layers (e.g., security, recovery, etc.) As the collection of nodes grow, so does the value of the cloud. If this sounds familiar, it’s rooted in network theory (Metcalfe’s Law, Reed’s Law, etc.) To liberally paraphrase, these Laws state that the value of the network increases as more nodes, users and content are added to the network. I’d argue that the same model holds true for cloud: As the size of a cloud grows (machines, users, as-a-Service offerings) the value of the cloud grows proportionately.  Any solution that is able to accumulate value in a non-linear fashion becomes very difficult to replace. The traditional killer of network value propositions is when a new innovation kills the original, or when the network gets dirty (too costly, too complicated, etc.). In theory, SOA and the Cloud delivery model exhibit inherent properties that counter these concerns.

Incremental Funding
A significant attribute of cloud is that it grows ‘horizontally’. This means that a cloud operator can add another server or storage system incrementally. Unlike the mainframe, you can grow a cloud by using small, inexpensive units. This characteristic encourages long-term growth.  Anyone who has had to fight for I.T. budget will recognize the importance of being able to leverage agile funding models. It’s more than a nicety; it’s a Darwinian survival method during depressed times. Cloud, like a cockroach, will be able to survive the harshest of environments.

Data Gravity (Before and After)
Dave McCrory suggested the concept of Data Gravity: “Data Gravity is a theory around which data has mass.  As data (mass) accumulates, it begins to have gravity.  This Data Gravity pulls services and applications closer to the data. This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.” McCrory’s concept suggests an initial barrier to cloud adoption (moving data to the cloud), but also suggests that once it has been moved, more data will be accumulated, increasing the difficulty to move off of the cloud. This model jives with modern engineering belief that it’s better to move the application logic to the data, rather than the reverse.  As clouds accumulate data, Data Gravity suggests that even more data (and logic) will accumulate.

 The Centralization-Decentralization Debate
One of my first managers told me that I.T. goes through cycles of centralization and decentralization. At the time he mentioned it, we were moving from mainframes to client/server. He noted that when control was moved too far out of ones control there would be a natural reaction to remove power from the central authority and to regain enough power to solve your problem. Of course, cloud attempts to balance this concern. The cloud is usually considered a centralized model due to the homogenous nature of the data centers, servers, etc. However, the self-service aspect of cloud attempts to push power to the end user.   Cloud is designed to be the happy medium between centralized and decentralized; only time will tell if it satisfies this issue.

In summary, I believe that multiple large innovations are coming but many, if not most, will be buried behind an as-a-Service interface and we’ll call them cloud. When I watch TV, I’m rarely aware of the innovations in the cameras, editing machines, satellites or other key elements of the ecosystem. From my perspective, TV just keeps getting better (it’s magic). The cloud encapsulates innovation in a similar manner. In some ways, it is unfortunate new innovations will be buried by the delivery model but in fundamentally, it’s this very abstraction that will ensure its survival and growth.    

Monday, November 19, 2012

Amazon’s Cloud: Five Layers of Competition

Most people would agree: Amazon Web Services is crushing their competition. Their innovation is leading edge, their rate of introducing new products is furious and their pricing is bargain-basement low.

This is a tough combination to beat! How do they do it?

The Power of Permutations
Amazon’s offering takes a layered approach. New solutions are introduced at any of the Five Layers and are then combined with the other layers. By creating solutions with interchangeable parts, they’ve harvested the power of permutations via configurable systems.

Platform
Take an example starting with a new platform. Let’s imagine that Amazon were to offer a new Data Analytics service. They’d likely consider the offering from two angles: 1) How do we support current analytics platforms (legacy)? and 2) How do we reinvent the platform to take advantage of scale-out, commodity architectures? Amazon typically releases new platforms in a way that supports current customer needs (e.g., support for MySQL, Oracle, etc.) and then rolls out a second way that is proprietary (e.g, SimpleDB, DynamoDB) but arguably a better solution for a cloud-based architecture.

Data Center: When Amazon releases a new offering they rarely release it to all of their data centers at the same time. We’d expect them to deliver it in their largest center: the AWS East Region where it would be delivered across multiple availability zones. After some stabilization period, the offering would likely be delivered in all US regions, or even globally. Later, it would be added to restricted centers like GovCloud. Amazon is careful to release a new offering in a limited geography for evaluation purposes. Over time, the service is expanded geographically.

Virtualized Infrastructure: The new service would likely use hardware and storage devices best suited for the job (large memory, high CPU, fast network). It’s common to see Amazon introduce new compute configurations that were driven by the needs of their platform offerings. Over time, the offerings are extended to use additional support services. This might include things like ways to back up the data or patch the service. Naturally, we’d expect that as even newer infrastructure offerings became available, we’d be able to insert them into our platform configuration.

Cross-Cutting Services: For every service introduced, there are a number of “crosscutting services” that intersect all of the offerings. Amazon’s first priority is usually to update their UI console, which enables convenient administration of the service. Later, we’d expect the service to be added to their monitoring system (Cloud Watch), their orchestration service (CloudFormation) and ensure that it could be secured via their permissions system (I&AM). These three crosscutting services are key enablers to the automation story that Amazon offers.

Economics: Perhaps the only thing Amazon enjoys more than creating new cloud services is finding interesting ways to price them. For any new offering, we would expect  Amazon to have multiple ways to price the offerings. If it was for a legacy platform, we’d expect to be billed by the size of the machines and the number of hours that they ran, and the disk and network that they used. If it was a next-generation platform, we’d expect to be billed on some new concept – perhaps the number of rows analyzed, or rows returned on a query. Either way, we’d expect that the price of the offering will come down over time due to Amazon’s economies of scale and efficiency.

The Amazon advantage isn’t about any one service or offering. It’s a combinatorial solution. They have found a formula for decoupling their offering in a way that enables rapid new product introduction and perhaps more importantly it offers the ability to upgrade their offerings in a predictable and leveraged manner over time. Their ability to combine two or more products to create a new offering gives them ‘economies of scope’. This is a fundamental enabler of product diversification and leads to lower average cost per unit across the portfolio. Amazon’s ability to independently control the Five Layers has given them a repeatable formula for success. Next time you read about Amazon introducing XYZ platform, in the East Region, using Clustered Compute Boxes, hooked into CloudWatch, CloudFormation and IAM, with Reserved Instance and Spot Instance pricing – just remember, it’s no accident. Service providers who aren’t able to pivot at the Five Layers may find themselves obsolete.

Saturday, June 02, 2012

Why You Really, Truly Want a Private Cloud

Jason Bloomberg wrote a thought provoking article on, "Why You Really, Truly Don't Want a Private Cloud". The article reviews the benefits of public cloud and then challenges the ability for a private cloud to bring the same benefits. Unfortunately, I think Jason's conclusions are wrong. I want to be clear about two things: my day-to-day experience and my potential conflict of interests.

Conflicts of Interest: MomentumSI consults with organizations on how to select private clouds, install/configure them, monitor, manage, govern and secure them. Transcend Computing provides software that makes the private cloud run more like Amazon. Each company does a significant amount of work in public cloud. We love both public and private clouds.

My day-to-day Experience: I have teams of consultants and engineers who use private cloud and public cloud on every assignment. They have done so for years. Cloud is their default deployment model. Several of my younger team members have never worked with physical servers/disks/network devices - they only know IaaS/PaaS. Team members switch between public and private clouds like they're switching from a pen to a pencil. They don't think twice about it - they just do it. The reasons why they select one over the other are commonsense to them (and anyone who has access to both):

  • They run elastic/bursty jobs in the public cloud
  • Most new production applications are run in the public cloud because there is built in disaster recovery, elastic scaling and a global footprint: Availability Zones, Regions, etc.
  • Pre-production staging environments are done in the public cloud because we want it to mirror the production architecture. 
  • Most of our legacy COTS applications have been moved to the private cloud. We watch them closely, optimize their environments when needed and avoid violating the license agreements which often prohibit their execution in a public cloud. 
  • Most dev/test is done in our private clouds. We run Eucalytpus, OpenStack and CloudStack. Most companies wouldn't do this, but we do given the nature of our consulting. Developers prefer private clouds when they want:
    • Low latency access to their cloud (for themselves or other applications)
    • Low level probing that filters out multi-tenant noisy-neighbors 
    • Constant booting of a machine (fast and cheap)
    • More choice in the cloud hardware configuration (Amazon is getting there, but still has a long way to go...)
    • We see more experimentation being done on the private cloud (fixed/sunk costs). Most team members are keenly aware of the large public cloud bills that they've generated.

The reasons why a person might use one or the other are in some ways irrelevant. The fact is, they do. I'm proposing the following:
"When I.T. staff are given access to a public and a private cloud, they will use both. Either way, they will get their work done faster and ultimately save their employer money in labor and asset costs."
I had a real hard time swallowing Jason's analysis that clouds are only good when you rent them from a third party like Amazon or Rackspace. Using the vehicle analogy, I believe that it's OK to own your car and to rent other vehicles when needed (e.g., boats, RV's, taxi-cabs, vacation bikes, jet ski's, limos, and so on). It's not an all-or-nothing proposition.

The one thing I want to leave you with is this: I.T. staff will use both - and they'll figure out when to use each. They're not dumb. They don't blindly listen to bloggers, authors, analysts or tweeters. Give them access and let them do what you pay them to do. Empower them.

Wednesday, April 04, 2012

Formation of Transcend Computing


I’m excited to announce that today, April 4th, our new company Transcend Computing will be emerging from stealth mode. In short, we are launching:

StackStudio is a visual, drag-and-drop online development environment for assembling multi-tier application topologies using the Amazon CloudFormation format. Application stacks assembled with StackStudio are ready to run on Amazon Web Services (AWS) and on other public and private ACE platforms.

These stacks can then be shared with other developers in StackPlace, which was also launched today as an open social architecture community sponsored by Transcend Computing. StackPlace allows developers to create, contribute, consume and collaborate on ACE-compatible application topologies.

To learn more, please visit: http://www.TranscendComputing.com

This is an exciting time for the Momentum family. As most of you know, we’ve been incubating this program for the last couple of years. The initial offering is a SaaS solution used to create multi-part applications on AWS. In the coming months, we'll be introducing additional on-premise services. 

It's been fun to watch the transition from SOA to 'as-a-Service'. There's little doubt that cloud (IaaS/PaaS) is the new model for application development and deployment. This is one of those few areas where both engineers and executives can agree on a new paradigm. This recipe for success will unfold over the coming years - and we're excited to be leaders in this new movement. 

Saturday, December 03, 2011

Will Amazon Support Linux Containers?

Early on, Amazon EC2 was recognized as the leading IaaS provider because of their ability to easily provision new virtual machines with a variety of configurations (size, speed, attachments, etc.) Virtual machines are a powerful, yet simple tool for engineers to use but they come at a price (a performance hit). At MomentumSI, we've been pondering if Amazon would ever support Linux Containers in their cloud. 


When asked, "Will Amazon Support Linux Containers?" Raj comments, "Would love it. We may see a type of instance which allows containers on it. You will have to take the whole machine and not just a container on it. That way AWS will not have to bother about maintaining the host OS. Given the complexities I think it will be a lower priority for Amazon and as it may be financially counterproductive; they may never do it."



Tom comments, "I doubt it. While I'm one of, if not *the*, biggest proponent of linux containers, the business reasoning still lags the technical reasoning. Intel, for instance, would *hate* such a move. Why? They spent a ton of money on virtualization at a chip level, which becomes a non-issue in containers (no hardware gets shared at the metal, rather, it's all one kernel for all containers). So, while it would be a great thing to see, the business market simply doesn't support this at this point, other than for folks like Pixar or other compute heavy folks.

What I *would* bet on is that AWS internally switches to some container based systems. For instance, ElasticMapReduce is far better off in a container world than in a VM world. Easier to maintain, direct access to 'cpu speed' and no need to virtualize access to disks -- it's all just there (even ISCSI ends up better in containers -- no 'vm to hypervisor' network translations)."


Amazon will likely be forced into one of three positions: 
1. Delivering sub-optimal platform performance on VM's (current state)
2. Supporting Linux Containers behind the scenes but not giving customer access to it. 
3. Delivering Linux Containers to customers and dealing with a whole new set of technical headaches. 


I'm more optimistic than my counterparts on the likelihood of #3. My reasons are simple: First, Amazon has done what they needed to do to satisfy customer needs.  Second, I think they'll need to do it to remain competitive with companies like Rackspace. As developers move from "needing a vm" to "needing a platform" (database, app server, etc.), Amazon will be pressed to expose a more highly performant layer to platform developers. One thing my associates and I agreed on is that we will not likely see containers in 2012... perhaps 2013?

Tuesday, November 22, 2011

Is Cloud Foundry a PaaS?

I've been asking some people in the industry a real simple question, "Is Cloud Foundry a Platform as a Service"?

The obvious answer would seem to be "yes" - after all, VMware told us it's a PaaS:

That should be the end of it, right? For some reason, when I hear "as-a-Service", I expect a "service" - as in Service Oriented. I don't think that's too much to ask. For example, when Amazon released their relational data service, they offered me a published service interface:
https://rds.amazonaws.com/doc/2010-07-28/AmazonRDSv4.wsdl

I know there are people who hate SOAP, WS-*, WSDL, etc. - that's cool, to each their own. If you prefer, use the RESTful API: http://docs.amazonwebservices.com/AmazonRDS/latest/APIReference/

Note that the service interface IS NOT the same as the interface of the underlying component (MySQL, Oracle, etc.), as those are exposed separately.

Back to my question - is Cloud Foundry a PaaS?

If so, can someone point me to the WSDL's, RESTful interfaces, etc?

Will those interfaces be submitted to DMTF, OASIS or another standards body?

Alternatively, is it merely a platform substrate that ties together multiple server-side technologies (similar to JBoss or WebSphere)?

Will cultural pushback kill private clouds?

Derick Harris asks the question, "Will cultural pushback kill private clouds?" His questioning comes from a piece provided by Lydia Leong where she notes that many enterprises have fat management structures and aren't organized like many of the leaner cloud providers.

I tend to agree with the premise that the enterprise will have difficulties in adopting private cloud but not for the reasons the authors noted. The IaaS & PaaS software is available. Vendors are now offering to manage your private cloud in an outsourced manner. More often than not, companies are educated on cloud and "get it". They have one group of people who create, extend and support the cloud(s). They have another group who use it to create business solutions. It's a simple consumer & provider relationship.

Traditionally, there are three ways things get done in Enteprise IT:
1. The CIO says "get'er done" (and writes a check)
2. A smart business/IT person uses program funds to sneak in a new technology (and shows success)
3. Geeks on the floor just go and do it.

With the number of downloads of open source stacks like OpenStack and Eucalyptus, it is apparent that model #3 is getting some traction. My gut tells me that the #2 guys are just pushing their stuff to the public cloud (will beg forgiveness - not asking for permission). On #1, many CIO's are hopeful that they can just 'extend their VMware' play - while more aggressive CIO's are looking to the next generation cloud vendors to provide something that matches the public cloud features in a more direct manner.

There are adoption issues in the enterprise. However, it's the same old reasons. Fat org-charts aren't going away and will not be the life or death of private cloud. In my opinion, we need the CIO's to make bold statements on switching to an internal/external cloud operating model. Transformation isn't easy. And telling the CIO that they need to fire a bunch of managers in order to look more like a cloud provider is silly advice and a complete non-starter.

Friday, August 12, 2011

Measuring Availability of Cloud Systems

The analysts at Saugatuck Technology recently wrote a note on "Cloud IT Failures Emphasize Need for Expectation Management". One comment caught my attention:

"Recall that the availability of a group of components is the product of all of the individual component availabilities. For example, the overall availability of 5 components, each with 99 percent availability, is: 0.99 X 0.99 X 0.99 X 0.99 X 0.99 = 95 percent."

I understand their math - but it strikes me odd that they would use this thinking when discussing cloud computing. In cloud environments, the components are often available as virtualized n+1 highly available pairs. If one is down, the other is taking over. In a non-cloud world, this architecture is typically only reserved for the most critical components (e.g., load balancers or other single-point-of-failures). It's also common to create a complete replica of the environment in a disaster recovery area (e.g., AWS availability zones). In theory, this leads to very high up-time.

Let me put this another way... I currently have 2 cars in my driveway. Let's say each of them has 99% up-time. If one car doesn't start, I'll try the other car. If neither car starts, I'll most likely walk over to my neighbors house and ask to borrow one of their two cars (my DR plan). You can picture the math... in the 1% chance that car A fails, theirs a 99% chance that car B will succeed, and so on. However, experience in both cars and in computing tells us that this math doesn't work either. For instance, if car A didn't start because it was 20 degrees below zero outside, there's a good chance that car B won't work start - and for that matter, my neighbors cars won't start either. Structural or natural problems tend to infect the mass.

I wish I could show you the new math for calculating availability in cloud systems - but it's beyond my pay grade. What I know is that the old math isn't accurate. Anyone have suggestions on a more modern approach?

Thursday, August 11, 2011

OpenShift: Is it really PaaS?

Redhat recently announced an upgraded version of OpenShift with exciting new features including support for Java EE6, Membase, MongoDB and more. See details at:

As I dug through the descriptions, I found myself with more questions than answers. When you say Membase or MongoDB are available as part of the PaaS, what does this really mean? For example:
  • They're pre-installed in clustered or replicated manner?
  • They're monitored out of the box?
  • Will it auto-scale based on the monitoring data and predefined thresholds? (both up and down?)
  • They have a data backup / restore facility as part of the as-a-service offering?
  • The backup / restore are as-a-service?
  • The backup / restore use a job scheduling system that's available as-a-service?
  • The backup / restore use an object storage system that has cross data center replication?
Ok, you get the idea. Let me be clear - I'm not suggesting that OpenShift does or doesn't do these things. Arguments can be made that it in some cases, it doesn't need to do them. My point is that several new "PaaS offerings" are coming to market and they smell like the same-ole-sh!t. If nothing else, the product marketing teams will need to do a better job of explaining what they currently have. Old architects need details.

It's no secret that I'm a fan of Amazon's approach of releasing their full API's (AWS Query, WSDL, Java & Ruby API's, etc.) along with some great documentation. They've built a layered architecture whereby the upper layers (PaaS) leverage lower layers (Automation & IaaS) to do things like monitoring, deployment & configuration of both the platforms and the infrastructure elements (block storage, virtual compute, etc.) The bar has been set for what makes something PaaS - and going forward, products will be measure based on this basis. It's ok if your offering doesn't do all they sophisticated things you find in AWS - but it's better to be up front about it. Old architects will understand.

Tuesday, April 26, 2011

Private Cloud Provisioning Templates

One of the primary benefits of a cloud computing environment is the increased automation. The Provisioning Service is perhaps the core mechanism to deliver this. To better understand the kinds of things we might orchestrate, take a look at the following template. You'll notice that it takes on the same format as Amazon's CloudFormation. This example launches a load balancer as part of our LB-aaS solution for a Eucalyptus cloud:

{
"ToughTemplateFormatVersion" : "2011-03-01",

"Description" : "Launch Load Balancer instance and install LB software.",

"Parameters" : {
"AvailabilityZone" : {
"Description" : "AvaialbilityZone in which an instance should be created",
"Type" : "String"
},
"AccountId" : {
"Description" : "Account Id",
"Type" : "String"
},
"LoadBalancerName" : {
"Description" : "Load Balancer Name",
"Type" : "String"
}
},

"Mappings" : {
"AvailabilityZoneMap" : {
"msicluster" : {
"SecurityGroups" : "default",
"ImageId" : "emi-FF070BFE",
"KeyName" : "rarora",
"EKI" : "eki-3A4A0D5A",
"ERI" : "eri-B2C7101A",
"InstanceType" : "c1.medium",
"UserData" : "80"
}
}
},

"Resources" : {
"LoadBalancerLaunchConfig": {
"Type": "TOUGH::LaunchConfiguration",
"Properties": {
"AccountId" : { "Ref" : "AccountId" },
"SecurityGroups" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "SecurityGroups" ]},
"ImageId" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "ImageId" ]},
"KeyName" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "KeyName" ]},
"InstanceType" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "InstanceType" ]},
"EKI" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "EKI" ]},
"ERI" : { "Fn::FindInMap" : [ "AvailabilityZoneMap", { "Ref" : "AvailabilityZone" }, "ERI" ]}
}
},
"LoadBalancerInstance" : {
"Type" : "TOUGH::EUCA::LaunchInstance",
"Properties" : {
"AccountId" : { "Ref" : "AccountId" },
"AvailabilityZone": { "Ref" : "AvailabilityZone" },
"LaunchConfig" : { "Ref" : "LoadBalancerLaunchConfig" },
"Setup" : {
}
}
},
"RegisterLoadBalancerInstance" : {
"Type" : "TOUGH::ElasticLoadBalancing::RegisterLoadBalancerInstance",
"Properties" : {
"AccountId" : { "Ref" : "AccountId" },
"LoadBalancerName" : { "Ref" : "LoadBalancerName" },
"Instance" : { "Ref" : "LoadBalancerInstance" }
}
},
"Setup" :{
"Type" : "TOUGH::EUCA::Parallel",
"Operations" : {
"TrackLoadBalancerInstance" : {
"Type" : "TOUGH::EUCA::TrackInstance",
"Name" : "LoadBalancerInstance",
"Properties" : {
"AccountId" : { "Ref" : "AccountId" },
"InstanceId" : { "Fn::GetAtt" : [ "LoadBalancerInstance", "InstanceId" ] }
}
},
"InstalLoadBalancerSoftware" : {
"Type" : "TOUGH::ElasticLoadBalancing::InstallLoadBalancerSoftware",
"Properties" : {
"AccountId" : { "Ref" : "AccountId" },
"IP" : { "Fn::GetAtt" : [ "LoadBalancerInstance", "PublicIp" ] }
}
}
}
}
},

"Outputs" : {
"PublicIP" : {
"Description" : "PublicIP address of the LoadBalancer",
"Value" : { "Fn::GetAtt" : [ "LoadBalancerInstance", "PublicIp" ] }
}
}
}

The JSON format can be a bit difficult to read if you're not familiar with it. Amazon and others now have UI's that facilitate the creation of the templates. In this example, there are a few items worth noting:
1. The template accepts input variables and returns information at the end of execution
2. The orchestration automates a series of tasks (launches a bare image, installs LB software, tracks the progress, configures the software, registers the newly launched instance, etc.)
3. The templates treat the cloud concepts (availability zones, cloud services, etc.) as first-order concepts in the syntax.

Keep in mind that the orchestration scripts can be multiple levels deep. This example was a simple one just to launch a load balancer. A more complicated orchestration would initiate multiple orchestration templates.

In the coming months, we'll be releasing a series of templates designed to orchestrate the provisioning of many common applications. The provisioning templates will fully leverage the power of the cloud (auto scale, auto recover, auto-snapshot, auto balance, etc.)

Sunday, April 24, 2011

Private Cloud Provisioning & Configuration

Cloud provisioning has focused on the rapid acquisition and initialization of a new server, disk or some other piece of infrastructure. Provisioning a single piece of infrastructure is now quite easy. Provisioning an entire set is much more complicated. In addition to the setup of a single piece of equipment, it's necessary to understand the dependencies between elements. In some cases, certain infrastructure components must be launched before another element or configuration data from one item needs to be used in a third element. Getting it all right is a difficult task and is a major cause of system failures. An approach to solving the problem is to consider the Deployment Fidelity, that is, the degree to which a deployment is able to fully describe it's architecture and configuration in a digitally precise manner.

Historically, application architects have used Word documents and Visio diagrams to depict the relationship between their software modules and the hardware infrastructure that would host them. Deployment Fidelity deals with accurately describing a set of computing resources and their relationship to each other. Organizations that embrace high fidelity will digitally describe their software and hardware topology: what type of hardware, operating systems, memory, infrastructure services, platform services, etc. and pass the digital description to the cloud provisioner for execution. The business value is two-fold. First, the high fidelity description reduces the chances of manual error, especially during hand-off. Second, the automation of the provisioning task reduces the deployment time and associated costs (e.g., sysadmins running individual scripts, testers waiting for new environments, etc.)





To increase the Deployment Fidelity, the relationships between elements must be captured. For instance, if an application server uses a relational database, the link between the two is recorded and configuration variables (such as IP addresses) are noted. If the server has an outage, a replacement can be auto-launched with the same configuration information. As the complexity of an application increases (load balancers, web servers, app servers, multiple databases, message queues, pub/sub, etc.) the need to keep a digital description becomes extremely important in order to reduce the chance of errors during deployment.

From an organizational perspective, there are two highlights: 1. The deployment architect can describe their proposed solution with complete fidelity - no misinterpretation. In addition, if there is an issue, the changes to the architecture can be captured in version control, just as if it was another piece of software code. 2. The sysadmin or release engineer can take the provisioning script and easily create a new environment (i.e., replicating Dev to Test, etc.)

Today, MomentumSI is announcing the release of two new services that orchestrate the provisioning of complex application topologies and then provide the configuration information:
The Tough Provisioning Service provides equivalent functionality found in Amazon's CloudFormation and is API/Syntax compatible with their offering.

The Tough Configuration Service integrates the most popular configuration management systems into the private cloud. Use your choice of Chef or Puppet to create configuration scripts and then expose them as enterprise grade services (secure access, multiple node delivery, guaranteed transmission, closed loop feedback, etc.)

Our solution brings this functionality to your private cloud by complementing your existing investment in VMware or Eucalyptus.

For more information, see Tough Solutions.

Tuesday, April 05, 2011

Are Enterprise Architects Intimidated by the Cloud?

Are Enterprise Architects Intimidated by the Cloud?

EA's are often the champion of large change initiatives that span multiple business units. If they're not on board - we've got problems.

Here's why I ask the question:
1. It's my perception (perhaps incorrect) that the EA leadership typically doesn't come from a background in infrastructure architecture. It's been my observation that the EA's who tend to get promoted usually have a background in business or application architecture. These people are often hesitant to enter deep discussions on CPU power consumption, DNS propagation, VLAN decisions, storage protocols, hypervisor trade-offs, etc.

2. Most people have agreed that the cloud can be viewed as a series of layers. You can attack it from top (SaaS) or bottom (IaaS). Quite frankly, there isn't *that much* architecture in SaaS (other than the secure connection and integration). That leaves IaaS as the starting point - which takes me back to point #1 - IaaS intimidates the EA team - - meaning that they're relying on the I.T. data center operations team (and localized infrastructure architects) to define the foundational IaaS layers which will serve PaaS, Dev/Test, disaster recovery, hadoop clusters, etc.

Any truth here? Leave a comment (moderated) or send me an email either way: jschneider AT MomentumSI DOT com

Monday, April 04, 2011

Cloud.com offers Amazon API

The most recent version of Cloud.com is now offering a 'bridge' for the core AWS EC2 services:

"CloudBridge provides a compatibility layer for CloudStack cloud computing software that tools designed for Amazon Web Services with CloudStack.

The CloudBridge is a server process that runs as an adjunct to the CloudStack. The CloudBridge provides an Amazon EC2 compatible API via both SOAP and REST web services."
The functions they support include:

Addresses
AllocateAddress
AssociateAddress
DescribeAddresses
DisassociateAddress
ReleaseAddress
Availability Zones
DescribeAvailabilityZones

Images
CreateImage
DeregisterImage
DescribeImages
RegisterImage
Image Attributes
DescribeImageAttribute
ModifyImageAttribute
ResetImageAttribute

Instances
DescribeInstances
RunInstances
RebootInstances
StartInstances
StopInstances
TerminateInstances
Instance Attributes
DescribeInstanceAttribute

Keypairs
CreateKeyPair
DeleteKeyPair
DescribeKeyPairs
ImportKeyPair

Passwords
GetPasswordData
Security Groups
AuthorizeSecurityGroupIngress
CreateSecurityGroup
DeleteSecurityGroup
DescribeSecurityGroups
RevokeSecurityGroupIngress

Snapshots
CreateSnapshot
DeleteSnapshot
DescribeSnapshots

Volumes
AttachVolume
CreateVolume
DeleteVolume
DescribeVolumes
DetachVolume

Although this list represents the core features of EC2, it doesn't yet cover the upper layers (CloudWatch, Auto Scale, etc.) or the PaaS offering (SNS, SQS, etc.) Regardless, I'm excited to see more emphasis being placed on supporting the AWS standard. It's easy for people to say that IaaS standards don't matter. However, if you're the guy building software on top of IaaS, they matter a WHOLE lot.

Cloud.com is a solid piece of software that has achieved success in the service provider market. To date, they haven't pushed too hard in the enterprise. Their decision to embrace the AWS API is a good one - and is complemented with their decision to use the pieces of OpenStack in their software where appropriate. This idea seems to be getting more traction. I'm hearing more and more people talking about OpenStack like it's a drawer that you reach into and grab out the components that you want - - rather than a holistic platform. I'm not sure if that's what the OpenStack team was shooting for but it's interesting to see guys like Cloud.com being open to leveraging the bits and pieces that they find useful.


Saturday, April 02, 2011

The commoditization of scalability

Last week, I had an interesting discussion with a product owner at an ISV. We discussed his offering; it was core plumbing-middleware-kind-of-stuff. When I asked about how he differentiated his offering from others on the market the answer was that they scale better. Our discussion moved from what he was doing to what I was up to and without trying to be coy I said, "We enable the commoditization of scalability". What I mean by this is that we help our customers adopt public and private clouds that know how to auto scale applications (and much more).

Of course, ISV's have always used non-functional attributes like availability, scalability and security as competitive differentiators in their offering. These capabilities are now being provided as features in the IaaS fabric. The next generation products coming from ISV's will need to redesign their solution on top of cloud infrastructures like Amazon, Eucalyptus, vCloud Director, Cloud.com, OpenStack and Nimbula. It will no longer be acceptable for an ISV to march into a customer and demand a block of servers to run their proprietary clusters. They will be expected to be able to allocate computer resources from the IaaS common pool. In addition, the ISV's will need to differentiate on attributes other than those provided by the IaaS fabric.

This change will affect the corporate I.T software development department as well. I've witnessed several I.T. groups design highly scalable architectures. Usually, the I.T. personnel aren't educated to perform this kind of work and either the project fails or delivery costs are very high. I believe that the I.T. departments that invest in IaaS will be able to significantly reduce the cost to design, deploy and operate highly scalable systems. It might be premature to declare the commoditization of scalability, but I truly believe we are witnessing the most significant step towards that goal in my 20 year career.

Wednesday, March 16, 2011

Providing Cloud Service Tiers

In the early days of cloud computing emphasis was placed on 'one size fits all'. However, as our delivery capabilities have increased, we're now able to deliver more product variations where some products provide the same function (e.g., storage) but deliver better performance, availability, recovery, etc. and are priced higher. I.T. must assume that some applications are business critical while others are not. Forcing users to pay for the same class of service across the spectrum is not a viable option. We've spent a good deal of time analyzing various cloud configurations, and can now deliver tiered classes of services in our private clouds.

Reviewing trials, tribulations and successes in implementing cloud solutions, one can separate tiers of cloud services into two categories: 1) higher throughput elastic networking; or 2) higher throughput storage. We leave the third (more CPU) out of this discussion because it generally boils down to 'more machines,' whereas storage and networking span all machines.

Higher network throughput raises complex issues regarding how one structures networks – VLAN or L2 isolation, shared segments and others. Those complexities, and related costs, increase dramatically when adding multiple speed NICS and switches, for instance 10GBase-T, NIC teaming and other such facilities. We will delve into all of that in a future post.

Tiered Storage on Private Cloud

Where tiered storage classes are at issue, cost and complexity is not such a dramatic barrier, unless we include a mix of network and storage (i.e., iSCSI storage tiers). For the sake of simplicity in discussion, let's ignore that and break the areas of tiered interest into: 1) elastic block storage (“EBS”); 2) cached virtual machine images; 3) running virtual machine (“VM”) images. In the MomentumSI private cloud, we've implemented multiple tiers of storage services by adding solid state drives (SSD) drives to each of these areas, but doing so requires matching the nature of the storage usage with the location of the physical drives.

Consider implementing EBS via higher speed SSD drives. Because EBS volumes avail themselves over network channels to remain attachable to various VMs, unless very high speed networks carry the drive signaling and data, a lower speed network would likely not allow dramatic speed improvements normally associated with SSD. Whether one uses ATA over Ethernet (AoE), iSCSI, NFS, or other models to project storage across the VM fabric, even standard SATA II drives, under load could overload a one-gigabit Ethernet segment. However, by exposing EBS volumes on their own 10Gbe network segments, EBS traffic stands a much better chance of not overloading the network. For instance, at MSI we create a second tier of EBS service by mounting SSD on the mount points under which volumes will exist – e.g., /var/lib/eucalyptus/volumes, by default, on a Eucalyptus storage controller. Doing so gives users of EBS volumes the option of paying more for 'faster drives.'

While EBS gives users of cloud storage a higher tier of user storage, the cloud operations also represent a point of optimization, thus tiered service. The goal is to optimize the creation of images, and to spin them up faster. Two particular operations extract significant disk activity in cloud implementation. First, caching VM images on hypervisor mount points. Consider Eucalyptus, which stores copies of kernels, ramdisks (initrd), and Eucalyptus Machine Images (“EMI”) files on a (usually) local drive at the Node Controllers (“NC”). One could also store EMIs on an iSCSI, AoE or NFS, but the same discussion as that regarding EBS applies (apply fast networking with fast drives). The key to the EMI cache is not so much about fast storage (writes), rather rapid reads. For each running instance of an EMI (i.e., a VM), the NC creates a copy of the cached EMI, and uses that copy for spinning up the VM. Therefore, what we desire is very fast reads from the EMI cache, with very fast writes to the running EMI store. Clearly that does not happen if the same drive spindle and head carry both operations.

In our labs, we use two drives to support the higher speed cloud tier operations: one for the cache and one for the running VM store. However, to get a Eucalyptus NC, for instance, to use those drives in the most optimal fashion, we must direct the reads and writes to different disks,– one drive (disk1) dedicated to cache, and one drive (disk2) dedicated to writing/running VM images. Continuing with Eucalyptus as the example setup (though other cloud controllers show similar traits), the NC will, by default, store the EMI cache and VM images on the same drive -- precisely what we don't want for higher tiers of services.

By default, Eucalyptus NCs store running VMs on the mount point /usr/local/eucalyptus/???, where ??? represents a cloud user name. The NC also stores cached EMI files on /usr/local/eucalyptus/eucalyptus/cache -– clearly within the same directory tree. Therefore, unless one mounts another drive (partition, AoE or iSCSI drive, etc.) on /usr/local/eucalyptus/eucalyptus/cache, the NC will create all running images by copying from the EMI cache to the run-space area (/usr/local/eucalyptus/???) on the same drive. That causes significant delays in creating and spinning up VMs. The simple solution: mount one SSD drive on /usr/local/eucalyptus, and then mount a second SSD drive on /usr/local/eucalyptus/eucalyptus/cache. A cluster of Eucalyptus NCs could share the entire SSD 'cache' drive by exposing it as an NFS mount that all NCs mount at /usr/local/eucalyptus/eucalyptus/cache. Consider that the cloud may write an EMI to the cache, due to a request to start a new VM on one node controller, yet another NC might attempt to read that EMI before the cached write completes, due to a second request to spin up that EMI (not an uncommon scenario). There exist a number of ways to solve that problem.

The gist here: by placing SSD drives at strategic points in a cloud, we can create two forms of higher tiered storage services: 1) higher speed EBS volumes; and 2) faster spin-up time. Both create valid billing points, and both can exist together, or separately in different hypervisor clusters. This capability is now available via our Eucalyptus Consulting Services and will soon be available for vCloud Director.

Next up – VLAN, L2, and others for tiered network services.