Wednesday, March 24, 2010

Benchmarking the Cloud

Is this thing on?  Yes?  OK, I'll get started then.

I should begin by explaining the blog name.  You've probably heard this saying:

That which does not kill us makes us stronger. 
Friedrich Nietzsche 

I'd always thought it was "that which does not kill me makes me stronger", and I didn't know it was from Nietzsche.  But I just Googled it, and a site with a name like "brainyquote.com" must be accurate -- look, they use typography and everything! -- so there you are.  In any case, I interpret this as another way of saying "we learn from our mistakes", and over the years I've had time and opportunity to make some dillies.  Maybe I've learned something.

I've never blogged before, so I'm not sure what direction this will go.  My notion is to post occasionally, on technical topics relating to whatever I happen to be working on that week.  I recently went on indefinite leave (from my day job at Google), so "whatever I happen to be working on" may be fairly haphazard, but hey -- that's the fun, right?

I've been reading up on computer vision, a topic I know little about but have always been interested in.  So far I don't have much to say about it other than "I'd always assumed this would be hard, and gee, it is hard".  So for today I'll be talking about something else.  :)

Another subject dear to my heart is the design of fast, reliable server systems.  One thing my past mistakes have taught me, loud and clear, is that anything you build is not likely to be fast or reliable unless you have a deep understanding of your building blocks.  This post will be the first in a series on the topic of "understand your building blocks".

What do I mean by "fast and reliable"?  It seems straightforward enough.  People typically measure speed in terms of mean latency, and reliability in terms of "availability" -- the fraction of the time that a service responds to requests.  However, speed and reliability can't really be decoupled.  A service is not fast and reliable unless it's reliably fast -- i.e. unless it responds to requests, correctly and quickly, close to 100% of the time.

Amazon is apparently known for focusing on 99.9th percentile latencies instead of means (e.g. search for "99.9" in http://the-paper-trail.org/blog/?p=51), and this does a better job of capturing "reliably fast".  But to fully understand a system, you need to measure it over a long period of time.  This is especially true for the multi-tenant services which are becoming more prevalent, such as Google App Engine and Amazon Web Services.  How does the service behave when it experiences an internal hardware failure?  When another tenant suddenly quintuples the load it's sending?  During a software upgrade?  Operator error?  These events occur only occasionally, so you're not likely to see them in a short benchmarking run, but if you really care about reliability then you can't ignore them.

To that end, as a hobby, I'm going to set up continuous monitoring for a number of widely known services, probably focusing on the two mentioned above.  This has been done before, but (to my knowledge) not very deeply.  For instance, api-status.com currently lists 31 services, but provides only minimal information about each.  My plan is to dive much more deeply into a small number of services.  I suspect that, over time, this will yield material for quite a few posts.  Among other things, I'm hoping to get some insight into how well these shiny new scalable-hosting services isolate their clients from one another and from vagaries in the underlying hardware.  Do all EC2 "Small Instances" behave the same?  How consistent is Google App Engine performance?

To start with, I'll choose whichever of GAE or AWS appears more expedient, set up a process to continually invoke a few of its services, and dump the results into a database.  I plan to make all of the data available live.  My next post will likely be a progress report.