The recent Windows Azure outage raises some interesting lessons in running a reliable service. I've posted a discussion at the Scalyr blog:
http://blog.scalyr.com/2012/03/13/the-azure-outage-time-is-a-spof-leap-day-doubly-so/
Tuesday, March 13, 2012
The Azure Outage: Time Is a SPOF, Leap Day Doubly So
Wednesday, January 11, 2012
Introducing Scalyr
I'm very excited to announce that I've recently launched a new venture that will allow me to focus on just that: Scalyr. At Scalyr, I'll be working full time to attack the problems of reliability, scale, and complexity I've been discussing in this blog. I'll still post here from time to time, but most of my blogging -- which will be much more frequent -- will now appear on the Scalyr blog at blog.scalyr.com. I encourage you to subscribe to the Scalyr blog; the content should be of interest to readers here.
Rather than say more about Scalyr, I'll just reproduce the initial post from the Scalyr blog.
Welcome to the Scalyr blog. Today we’re announcing our first service, Knobs.
What’s a Knob, you may ask? Or perhaps, what’s a Scalyr?
First, a little background. I’ve spent a good chunk of my career developing “in the cloud”. (Building Writely, for instance — aka Google Docs.) It can be an amazing experience. With the variety and sophistication of services available today, I sometimes feel like I’m programming with seven-league boots. One day, you wake up to find that thousands or millions of people are using your work.
However, building on cloud services can also be frustrating. Performance can be unpredictable, error messages unhelpful, protocols confusing. Sometimes they go down. As you scramble to cope, you can’t help but picture those thousands of people glaring at an error page and silently cursing. Cursing you, probably, even if they don’t know who you are. Sometimes you can work around the problem; sometimes all you can do is glare at the error page and add your own curse to the silent chorus.
At Scalyr, we’re building a new breed of cloud services. Services architected for reliability, so you can depend on them. For transparency, so you know what kind of performance and behavior to expect. For simplicity and practicality, so you can integrate quickly and get on with your work. You’ll hear more about all of these themes in future posts.
On to Knobs. For almost as long as there has been code, there have been knobs to tweak. These take many forms — configuration files, command-line parameters, constants, “magic cookies”. If you’ve written server code, you’ve wrestled with this. You need to specify a threadpool size, or a server address, or some other little constant. You know it might need tweaking, so you put it in a configuration file. And write code to parse the file. And a little script to copy the file to the server. And another script to restart all your servers so they can pick up the change. Oops — let’s tweak that script to only restart one server at a time! OK, problem solved… until all those copied files inevitably get out of sync, or you get tired of waiting for a rolling server restart every time you tweak a parameter.
Knobs is a simple service to address this problem. We store configuration files for you; you edit them in a web page, or via our API. We give you a library that lets you read values with a single call. We take care of the rest, managing files and instantly copying updates to all of your servers.
For reliability, we run servers in multiple facilities (of course). Furthermore, the Knobs library maintains a persistent cache of your configuration files on each server. So even if we were to have an outage, you won’t: the library will use the local cache.
If this sounds interesting, learn more or just dive in. And if you love the idea of building services that people can really depend on, drop us a line – we’re hiring!
Monday, December 5, 2011
Survey on Cloud Services (Results)
As readers of this blog will know, I’ve had a long-time interest in cloud services, particularly backend services used as building blocks for other systems. Back in October (my, how time flies), I posted a brief survey on people’s use of such services. This is a quick post to present the results.
Thank you to the 18 people who responded. (Per the promise on the survey page, I’ve made an extra $2.50 x 18 = $45 donation to the EFF.) 13 of the responders agreed to make their responses public; their raw responses appear at the bottom of this post.
While this sample is not large enough for serious quantitative analysis, a few themes emerge:
Price matters. 16 of the 18 responses listed price as a consideration. Quotes: “Almost all the offerings are SAN based storage, which is insanely expensive”, “We are using the cheapest options currently”, “Obligations still too quantized”, “It's mostly around the price”.
Ease of use matters. This tied price with 16 responses... though the essay responses didn’t mention it much, so it may be less deeply felt.
Trust is a huge issue. “Stability” and “Track record’ were the next most common considerations, with 13 and 12 responses (respectively). More telling, trust issues came up over and over in the essay responses. “Can I trust my data with so many different businesses?”, “When the majority of pieces require other pieces to work, it becomes a liability when any one part comes down”, “Unquantifiable fear of data loss”, “Shared hosting is sometimes down”, “Security and integrity concerns”, “We were affected by the [AWS] outage - it really caught us off-guard. Four days of downtime", “Requirement of reliable bandwidth between data center and users”, “The general complain is "what's going on?" and "did they break?". We've had to build a bit of infrastructure to verify that services are up”.
Performance was less of an issue. 10 responses listed performance as a consideration. Though it was a major issue for a few people: “SimpleDB is abysmal when it comes to performance”, “poor performance on Google”.
(The other considerations offered in the survey were “Recommendation” and “Other”, with six and two responses, respectively.)
I’ll have more to say on this topic in future posts. In the meantime, here are the 13 public responses.
What backend services do you currently use? | None. I work with Salesforce as an analyst for my day job, but currently no foundation-type cloud thingies. |
What criteria do you use to choose services? |
|
What backend services do you currently use? | Amazon AWS, Heroku, SendGrid |
Do you have any problems or complaints with these services? | Decent for what's necessary. |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | I'm in the process of building it =] |
Other comments | Decentralization is good to an extent, however, when the majority of pieces require other pieces to work, it becomes a liability when any one part comes down. In some cases, one piece going down means everything falls apart. Also, security. Can I trust my data with so many different businesses? |
What backend services do you currently use? | rackspace.com, icontact.com, authorize.net, cdgcommerce.com, paypal.com, postmarkapp.com, linktrack.info |
Do you have any problems or complaints with these services? | Unquantifiable fear of data loss. |
What criteria do you use to choose services? |
|
What criteria do you use to choose services? [Other] | Attractiveness of web site |
What are you coding / running yourself that you'd rather buy as a service? | Anything I can think of I'll code myself. You'll have to come up with your own ideas ;) |
Other comments | I like roll my own services when possible, so I'm interested in reducing reliance on backend services. |
What backend services do you currently use? | Azure Google APIs |
Do you have any problems or complaints with these services? | Azure has limits on database size |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | web scraper NLP |
What backend services do you currently use? | Heroku. AppHarbor. Shared hosting for wordpress blogs. |
Do you have any problems or complaints with these services? | Shared hosting is sometimes down. |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | none |
What backend services do you currently use? | Ec2, s3, and ebs. |
Do you have any problems or complaints with these services? | Inaccurate Amazon health dashboard. Usually outages are not posted on it until much later. |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | The ability to roll your own Linux ami through a browser. |
Other comments | Security and integrity concerns of hosting services on servers physically located outside of my organization's control |
What backend services do you currently use? | Amazon AWS S3 and EC2, and Google App Engine |
Do you have any problems or complaints with these services? | poor performance on Google, particularly in moving data and security |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | web security tools and APIs |
Other comments | moving data to the cloud for processing - as in very large datasets - e.g. geospatial imagery |
What backend services do you currently use? | Amazon AWS |
Do you have any problems or complaints with these services? | We were affected by the outage - it really caught us off-guard. Four days of downtime. |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | It would be nice if something like SQL Azure existed on AWS. The ability to have a custom Windows machine and a turn-key SQL Server-like storage repository would be huge. It just doesn't exist today. We've built a backup and recovery service for internal use. No one else seems to have rolled one out, so we're going to package it as a service for others. |
Other comments | SimpleDB is abysmal when it comes to performance. The need to create an entire new account with its own billing every time you need a new "database" is ridiculous as well. |
What backend services do you currently use? | EC2, Azure, Joyent |
Do you have any problems or complaints with these services? | Oh boy howdy! The biggest one is easily the lack of transparency. I'm fine with the fact that I might not have access to the all the resources on a system and the load on it is variable. I'd just like to know when that load is impacting my system's performance so I can avoid worrying about problems with my code/configuration and focus on compensating for it. The second biggest problem is nobody is going with the kind of configuration that Map/Reduce made so powerful: lots of big, slow drives hooked up directly to the processor. Almost all the offerings are SAN based storage, which is insanely expensive. |
What criteria do you use to choose services? |
|
What criteria do you use to choose services? [Other] | Tranparency |
What are you coding / running yourself that you'd rather buy as a service? | Hadoop Hadoop + HBase Cassandra |
Other comments | Aside from amazon, nobody else has a spot market, which I think is one of their better innovations. Also, it tends to be a pain to combine Akamai's load balancing with these highly variable services. |
Amazon EC2, S3 | |
Do you have any problems or complaints with these services? | We are using the cheapest options currently (don't all startups love that!) but even then, we feel it's a difficult toss up between the value and transparency. We may soon be moving out to Rackspace to address the transparency/control part. Something that you didn't ask but is worth understanding is the "what works" aspect. Amazon is working pretty hard at making it easier to move various pieces of different solutions work well together. For us we made an early decision to trust Amazon only with the EC2 and S3 services. However now this may be/should be reassessed since their offering for a number of other services (big data, monitoring, etc.) is growing and well integrated. |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | At this stage pretty much everything we do is a service available in various forms. The ability to integrate from one service to another is a big design-level decision that's often easily traded for doing stuff yourself. So while we do have a combination of such services and code that we build/run ourselves- we'd eventually like to move all of this to the cloud. As a startup- the decision to trust the cloud completely and invest more time/effort in integrating these various solutions is a critical one. |
Other comments | It's mostly around the price. A close second is finding the right expertise that makes using the backends feasible. As an example, between learning to deal with EC2 images and building something ourselves, the (exGoog!) engineering mindset tends to often favor the latter rather than investing the resources in learning a system you don't have control over. |
What backend services do you currently use? | AWS, Sendgrid, GAE, Beanstalk, a host of geo-api and data providers |
Do you have any problems or complaints with these services? | - Obligations still too quantized, should be a much smoother function of your usage of services. |
What criteria do you use to choose services? |
|
What criteria do you use to choose services? [Other] | Observable ethics of provider |
What are you coding / running yourself that you'd rather buy as a service? | - One that neatly managed all the cloud services we use. Even if only the payment/accounting part. Each item is small, but there's a lot of leakage of underutilize/undermanaged subscriptions. |
Other comments | - Cost - Management complexity growth - Inter-service bandwidth/performance |
What backend services do you currently use? | Linode S3 Sendgrid Mailchimp Getclicky Google Analytics |
Do you have any problems or complaints with these services? | Linode: - Poor reliability, 3 major outages in the last few months - Lacking some features like HTTPS termination in load balancer Sendgrid: - Very expensive, but don't have time to use something else right now - The interface is a bit whacky Mailchimp: - Piece of sh*t interface. Can't figure out anything on it. Getclicky: - Doesn't do cohorts and such retention analysis Google Analytics: - Can't identify individual users |
What criteria do you use to choose services? |
|
What are you coding / running yourself that you'd rather buy as a service? | Internal Analytics - something that combines web analytics + database metrics MySQL inside of Linode A good python hosting provider (similar to Heroku) |
Other comments | I try VERY hard to not use anything proprietary from the platforms. I always want the ability to switch the providers without too much extra effort. |
What backend services do you currently use? | AppEngine, aws |
Do you have any problems or complaints with these services? | Love AppEngine! It's a lot of work to learn it though; we're not in Kansas any more, Toto! I've only touched on AWS lightly, but ridiculously easy for what it does. |
What criteria do you use to choose services? |
|