Wednesday, April 28, 2010

Three Latency Anomalies

Last week, I discussed various choices for cloud data storage.  Subsequently, I've been digging into latency anomalies in some of those systems.  In this post, I'll talk about what I've found.

App Engine Memcache
I'm running a number of benchmarks for App Engine's Memcache service, but they all give similar results, so I'll focus on the simplest: reading a single value, with a random key.  Here's a graph (from

Aside from the jump on April 23d, these latencies are pretty stable, probably the most stable of any network-based service I'm measuring.  There's not much variation over time, and the gap between the 10th and 99th percentile is fairly small.  In absolute terms, it's the fastest of the storage services I discussed last week, and much faster than the other network-based services.

So what's the anomaly?  Well, this service is fast, but not as fast as it should be.  Remember, the "mem" in Memcache refers to RAM.  A Memcache read amounts to a hashtable lookup.  Server-side execution time should be trivial, and client-perceived latency (as I'm measuring) should be dominated by network latency.  13 ms median is a long time for an intra-cluster round trip.  Equally suspicious, across 125,000 samples to date, the minimum time recorded was 5 ms.  It's hard to explain that with network latency unless the App Engine servers are hundreds of miles away from the memcache servers -- which makes no sense.  (By way of comparison, I've been measuring ping times within and between Amazon EC2 availability zones, and the median latencies are 450 microseconds and 2 ms, respectively.  I don't have a way of directly measuring App Engine's internal network, but there's no reason to believe it would be vastly slower.)

In the end, someone at Google was kind enough to explain this for me.  It turns out to be a known issue with the RPC implementation in the Java version of App Engine, with the effect of imposing a floor on RPC latency.  Operations which take longer than this floor -- say, Datastore access -- are not affected, but fast operations like Memcache are.  It will be fixed in a future App Engine update.

As a cross-check, it was suggested that I re-implement the benchmark in Python.  The results are indeed much faster, with median latency of 4.15 ms (compared to 22ms from Java, after last week's jump).

One lesson here is that it's important to have an intuition of what's reasonable.  Memcache was one of the fastest services I measured, but it was still slower than it should have been.  If you don't have a sense of how your system ought to behave, you'll miss problems.

Another lesson is that no story ever ends; there are always loose ends you don't have time to tie up.  In this case, there's that mysterious jump on April 23d.  There's also a fairly long tail even in the Python latencies, which could have any number of causes, from network congestion to garbage collection to CPU contention.  I'm not going to bother looking into any of that, but if I was building a system that depended on ultrafast memcache latencies, I'd probably have to.

EC2 networking

On April 24th, I added a handful of network benchmarks.  Each benchmark measures ping times, from the data gathering server (an EC2 "small" instance in availability zone us-east-1b) to a particular destination:
I'm only measuring one instance of each category (e.g. one pair of servers in us-east-1b), so take the results with a grain of salt.  But, hopefully, sampling over a long period of time compensates somewhat.  Eventually I may get fancier, launching a collection of servers for an hour now and then and measuring ping times across many server pairs.

Network latencies are important.  Any complex system depends on communication among its components.  If you want to run a reliably fast service, it helps a lot to have reliably fast networking.  At a minimum, you need a deep understanding of your network latency; otherwise, you can't hope to understand how your service will behave.

After running the benchmark for a few hours, the latency histogram for intra-zone pings looked like this:

I was dismayed to see that 99th percentile latency was 48 ms.  That would make it very difficult to meet the 50 ms goal I set for storage read latencies in my previous post -- just getting to the storage server would use up the entire latency budget.  As with memcache, this result struck me as fishy, so I decided to dig further.  It struck me that Amazon might use a short DNS TTL for the internal server names in their data centers, and that DNS latency might be part of the problem.  So I added two variants of the intra-AWS ping benchmarks.  One variant uses a numeric IP address instead of a DNS name, and the other variant issues two pings in a row and measures latency of the second.  (All of this is done using the "ping" command-line tool.  I'm exec'ing it from my Java process, and parsing the ping time from the command-line output.)

Both variants were, indeed, much faster.  (More precisely, they had shorter tails.  All variants are fast at median.)  But the really interesting thing is that, when I launched the new variants, the original benchmark sped up as well!  Here's a graph over time of the original intra-zone ping benchmark:

What happened?  I don't know, but I have a strong hypothesis.  The benchmark server wakes up once every 10 seconds, and runs all of its ever-growing collection of benchmarks in a burst.  The order in which the benchmarks execute is randomized every time.  So, half of the time, when the original intra-zone single-ping benchmark executes, it's happening shortly after the intra-zone ping-twice-and-measure-the-latter benchmark.  Thus, the DNS cache is more likely to be warm.  ARP lookup, or other arcana above my pay grade, might be involved as well.

Considering only the time period in which all three benchmarks were running, they still have distinct latency histograms:

Original benchmark (single ping, using DNS name):

Same benchmark, but using a numeric IP address instead of dns name:

Back to using a DNS name, but now telling the ping command to issue two pings, and recording the latency of the second one:

Median latency for each variant is around 0.45 milliseconds.  The means are 3, 2.45, and 2.13 respectively.  Eyeballing the histograms, you can see that the pronounced spikes at 10 and 20 ms are smaller in each subsequent benchmark.  The large peak around 0.45 milliseconds seems to be shifted somewhat to the left in the third benchmark; I wonder if that is sampling bias: during momentary network congestion, the first ping might not complete until the congestion clears, giving the second ping a better chance at a smooth run.

These figures also gave me a nice example of the danger of reading too much into percentiles.  90th percentile latency for these three benchmarks is 12 ms, 9 ms, and 0.77 ms respectively.  Huh?  Obviously the third benchmark is not ten times faster than the others.  This is an artifact of the trimodal distribution.  In the first two benchmarks, a bit more than 10% of requests fall in the 10ms and 20ms spikes.  In the third benchmark, a bit less than 10% of those requests fall in the spikes, so the 90th percentile latency jumps down into the primary peak.

I'll explore network latency further in subsequent posts, where I'll try to identify (or, if necessary, invent) a reliably-fast Java RPC mechanism.  For now, a few things are worth noting:
  • Median intra-cluster network latency is very fast, relative to the systems I've been benchmarking.
  • Network latency histograms are complicated.  (This should not surprise anyone.)  You can't characterize network performance with a single number like "mean ping time" or "rate of packet loss".
  • If you care about the latency tail, you have to worry about things like DNS lookup, even within your data center.
  • (Hypothesis) it helps to keep your connections hot, i.e. to exchange messages frequently.  "Warm" -- say, communicating every 10 seconds -- isn't good enough.  This has implications for large clusters: if a server communicates with many other servers, it's hard to keep all of those pairs hot.  (Unclear whether this is an issue if you avoid DNS and use numeric IPs.)
While I'm on the topic, I'll provide histograms for cross-zone latency, and latency from EC2 to  First, AWS us-east-1b to us-east-1a (numeric IP):

Second, us-east-1b to (DNS):

Median latencies are 2.05 and 4.56 ms, respectively.  Means are 5.62 and 11; 99th percentiles are 30.9 and 51.6.

I'm not sure what I should have expected, but these seem nicely fast to me.  Google must have presence quite close to Amazon's us-east-1b facility.  (The Yahoo ping times were much slower, but given that I'm only measuring from a single location, it would be a mistake to read much into that.)


RDS, recall, is Amazon's hosted MySQL service.  The benchmark I'll discuss here is the read benchmark I used in the last post, i.e. fetching one small record chosen randomly from a table of one million records.  Full data for this benchmark is at

When I first launched this benchmark, the latency histogram looked great:

When I checked later, median latency had stayed put, but the tail had stretched way out.  Here's the cumulative histogram over the entire week-long history of this benchmark:

You can see that the two histograms are very different.  Both medians are around 1 ms, but the mean jumps from 1.74 to 16.3!  It turns out this operation exhibits periodic fluctuations in latency:

Remember that my load is absolutely uniform, a handful of simple operations repeated every 10 seconds.  What is up with that oscillation?

The short answer is, I do not yet know what is up with that oscillation.  I've been digging; started a thread on the AWS forum; and opened a case with AWS support; but no resolution yet.  Some things I've tentatively ruled out:
  • Client issues (e.g. CPU contention, Java GC delays, or other problems in the client).  The same client is running a wide variety of other benchmarks, and none of them show anything similar.
  • Network issues.  A fair number of those other benchmarks depend on the same us-east-1b network.  (It is conceivable that it is an issue specific to the network path between my client and the RDS instance.)
  • Disk latency.  I've implemented a few variants on the benchmark, one of which reads the same record over and over again; that record ought to get pinned in MySQL's cache.  The single-record benchmark exhibits the same periodic latency tail.
My best guess is some sort of CPU contention problem on the RDS server.  (People who work with me know I blame everything on CPU contention.)  However, I don't have much evidence.  Amazon's CloudWatch service shows that CPU utilization on the server is steady at about 12%, but it has one-minute resolution, so could be hiding some sort of bursty usage.  I'll keep digging.

A fourth anomaly: EBS

While pulling together data for this post, I noticed that EBS read latencies (remember, that's Amazon's network block device) jumped way up about a week ago:

As often seems to be the case, the median remained steady (around 13 ms), but mean, 90th percentile, and higher latencies all jumped by 1.5x to 2x, and have been elevated ever since.  There's also a nice diurnal cycle in the latency, which hadn't been visible before.  (In general, I haven't seen a lot of diurnal variation in these benchmarks; I'd expected more.)  The usual caveat: my benchmark is limited to a single EBS volume, so don't extrapolate to EBS as a whole.

I'm not going to dig into this one right now.  It is a bit of a cautionary tale, however.  Suppose you'd built a latency-sensitive system on EBS.  You even did your homework and spent a week benchmarking it.  If that week began on March 12th, you'd have determined that 99th percentile read latency was around 125 ms.  You might be pretty unhappy when it jumped to the current 300 ms.  Perhaps you could complain to Amazon, but as far as I can tell they provide no latency SLA of any sort; they don't even say anything about what sort of latency you should expect.  (I don't mean to single out EBS or Amazon.  I haven't encountered much latency guidance, let alone an SLA, for any of the storage services I've been benchmarking.)

Thursday, April 22, 2010

Cloud data storage: many choices, all bad

[Update: I apologize for the atrocious formatting in this post.  Some combination of Blogger's editor, copy/paste from Google Docs, and the contentEditable implementation in Chrome on Mac OS X seems to have rendered the formatting unsalvageable.  It looks much better in the Compose window...]

What I want

I recently set up a MySql instance using Amazon's Relational Database Service (RDS), and added a couple of SQL operations to my microbenchmark suite.  That completes a tour of the data storage services available on Google App Engine and Amazon AWS.  (Aside from bulk storage services -- S3 and Blobstore.  I'll get to those eventually, but they serve a different purpose.)

Imagine that you're building a web service, and you need a simple object store -- a place to write small objects and retrieve them later.  These might be user records, forum posts, whatever.  Let's ignore all questions of indexing, queries, etc. and just consider simple update and fetch operations -- write(objectId, data) and read(objectId).  How well do the various storage services fill this need?  Well, here are some attributes we might want from an object store:

  • 50ms reads.  Suppose you want to limit your server time to 200ms per request.  It might be hard to avoid doing at least two rounds of object fetches -- i.e. you fetch one or more objects in parallel, and from those objects you get IDs that lead you to another round of fetches.  If you don't want to spend more than half your latency budget here, then each fetch needs to complete in 50ms.
  • 500ms writes.  Writes are less common than reads, and write latency can often be hidden in a background AJAX request.  But sometimes the user will have to wait while you update your object store, and you don't want them to wait too long.
  • Transactionality and consistency.  It should be possible to atomically modify an object's state, without interference from competing writers; and once a write has completed, all subsequent reads must observe it.  (The latter property sounds obvious, but is actually difficult to achieve in a distributed system.)
  • Durability.  Once you've written an object, it should stay written, even in the event of machine failures and other disasters.
  • Availability.  An hour of downtime is liable to get you razzed on TechCrunch; if you don't want this to happen more than once/year, then you need to aim for about 99.99% availability.
  • Geographic distribution.  You may have users all over the world.  Ideally, you'd like to have servers all over the world, and get your 50ms reads and 500ms writes from any server.

Another potential consideration is cost, but it seems to me that all of the services I discuss are pretty cheap for most purposes.  Say, less than $1/GB/month (sometimes much less), including both storage and access fees, unless your access rate : storage size ratio is high.  (One exception: RDS is expensive if your needs are very small, since you can't rent less than one "small instance" -- around $1000/year.)

Google's Ryan Barrett gave a nice talk on this subject in 2009, available at  Someone has posted a discussion / summary of the talk at

The options

Database options available on AWS are
 SimpleDB (non-relational, replicated database) and RDS (hosted MySQL).  App Engine has its Datastore.  You could build an object store on top of a filesystem, so I'll also discuss the two read/write file storage options on AWS -- EBS (network block store) and the local disk on an EC2 server.  Finally, just for kicks, I'll include App Engine's Memcache service.

They're all slow

Here is a table of read and write latencies for the six services.  For Datastore (App Engine), I present results for both transactional and nontransactional reads.  SimpleDB has a similar distinction ("consistent" and "inconsistent" reads), but I've observed near-identical latency for both, so I'll ignore the inconsistent variant.

Latencies are taken from  As I've discussed in previous posts, this site performs a suite of operations every 10 seconds, and records the latencies.  Sampling started a week or two ago in most cases, and yesterday morning for RDS and the nontransactional option in Datastore.  Data continues to accumulate, and you can always see the latest figures, with lots of additional detail, on the linked page.

Here, I report 99th percentile latency.  It's more common to discuss mean latency, but this has several drawbacks.  For one thing, a handful of outliers can throw off the result.  For another, it can hide issues that affect a small but still significant fraction of requests.  Consider also that a single web request may touch multiple objects; if you touch 10 objects, then roughly speaking, your 90th percentile user-observed latency is dictated by your 99th percentile object store latency.  In that light, I'd really perfer to report 99.9th percentile latencies, but it would be too cruel.  (Click through to the dashboard if you're morbidly curious.)  The 99th percentile is bad enough:

BackendRead latency (ms)Write latency (ms)
Datastore (transactional)8051100
Datastore (nontransactional)4371100
GAE Memcache2425
EC2 disk3123

The only services that meet our read latency goals are Memcache (which hardly counts) and EC2 local disk.  For write latency, Amazon's database services squeak through or nearly so, and Memcache and both disk options do very well.  (Incidentally, none of the services -- not even Memcache -- meet the 50ms read goal at 99.9th percentile.)

The RDS and EC2 disk latencies may be unfairly good, because in both cases I had to reserve an entire (virtual) machine, which is very lightly loaded.  The other services presumably commingle requests from many clients, and so should not benefit much much from the fact that I'm presenting a light load.  RDS further benefits from having a small data set (one million small records) on a virtual server with 1.7GB of RAM, hence it may be caching the entire database in memory.  All of the other benchmarks are either on shared servers where there is competition for cache space, or have data sets too large for effective caching.  At some point I may add additional benchmarks with heavier load and/or larger data sets.

I'll dig deeper into the latency graphs and histograms in a subsequent post.

At least they're durable, right?

This turns out to be a difficult question to answer.  Let's consider each backend in turn.

Memcache: obviously, no durability guarantees at all.  Data may vanish at any time.

EC2 local disk: also no guarantees, as an EC2 instance could vanish without warning.  In practice, data is likely to survive for days at a time, and you at least will know when data loss occurs (i.e. when an instance vanishes), so perhaps you could implement a durable system using EC2 instances as building blocks.

EBS: the Amazon documentation has some interesting things to say about EBS durability.  Two relevant quotes:
"Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component."
"Amazon EBS volumes are designed to be highly available and reliable. Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives." [from]
So, EBS data is replicated across machines, but not across Availability Zones.  An outage in a single zone will take your EBS volume offline, and conceivably it could be lost forever in a catastrophe.  Short of a complete zone outage, you should expect to lose an EBS volume once every few hundred years (more or less, depending on how how often you snapshot and how rapidly you write).  We aren't given a figure for frequency of data loss events smaller than an entire volume; arguably it's implied that there are no such events (or that they're included in the AFR figure), but we don't know for sure.  In fact, given that we don't know how EBS works, or precisely how the underlying machines are managed, we don't know whether we can rely on the AFR estimate at all.

There's also the question of whether a write ever really got into EBS in the first place.  EBS is presented as a block device, and there are probably at least three levels of buffering involved -- one in the client, one in the EBS server, and one in the disk controller.  It's notoriously difficult to ensure that a write has penetrated all of those buffers and been physically written to the disk platter.  (Put another way: in practice, you can't ensure it; don't pretend that you can.)

RDS: I haven't been able to find any hard statements about RDS durability.  I suspect it is backed by EBS and will have the same durability properties.

SimpleDB: Here's what Amazon has to say:
"Behind the scenes, Amazon SimpleDB creates and manages multiple geographically distributed replicas of your data automatically to enable high availability and data durability."
This sounds reassuring, but doesn't include much detail.  Are how far apart are these "geographically distributed" replicas?  They appear to be different "availability zones" in the same AWS region.  How independent are availability zones?  For instance, do they have independent long-haul backbone connections?  The Amazon FAQ how isolated are Availability Zones from one another doesn't address this.

We also don't know exactly how the replication works.  When I issue a write and SimpleDB reports it as having completed, has the data already been replicated?  How widely?  In a given replica, is it sitting in a buffer or has it gotten onto the actual disk?  In short, exactly how sure are we that the write will not be lost?  Pretty sure, probably; but it's hard to know.

Datastore: this is backed by Bigtable, which in turn is built on GFS.  In GFS, by the time a write completes, it has already been replicated to multiple storage servers within one data center.  However, per the talk linked above, it may not yet have been replicated outside of the data center (similar to an Amazon "availability zone").  Datastore uses asynchronous replication to copy writes out of the data center.  In the event of a data center outage, a (usually small) number of writes could be lost.

What about availability?

Memcache, presumably, is highly available.

EC2 is subject to a brief outage whenever an instance fails; this probably does not endanger our 99.99% availability target.  Much more seriously, a given EC2 instance becomes unavailable whenever its availability zone fails.  Such failures are not unknown (google "EC2 outage"), so we cannot consider an individual EC2 instance to be highly available.

EBS availability is not affected by machine failures, but is affected by failure of an availability zone, so it's not much more available than EC2.

RDS availability is presumably similar to EC2, with brief outages when a machine fails.  RDS is also subject to occasional downtime for maintenance -- see

SimpleDB availability should in principle be very good, since it's replicated across availability zones.  In practice, as noted above, it's hard to evaluate this.

Datastore availability is bounded above by App Engine availability.  Unfortunately, that's well short of our 99.99% target -- google "app engine outage".

In short, none of these services fully guarantees durability and availability.  As Google's Ryan Barrett explains in the linked talk, such a guarantee would require synchronous replication across a significant geographic distance, at a high cost in write latency.  But some of the services may be good enough for most everyday uses.  Let me summarize:

EC2 diskWeak (lost on instance failure)Fair
RDSSame as EBS?Same as EBS?
SimpleDBExcellent?Very good?

The only service that really scores well is SimpleDB, and that only if we make some assumptions about synchronicity of replication and independence of availability zones.  Datastore also seems "good enough", if you're already using App Engine and thus are limited by its availability anyway.  With EC2, EBS, and RDS, if you're running a live service then you'll need some way of replicating your data across zones, which can be very complex.

No one distributes your data geographically.

The Amazon services are available in multiple regions -- currently "N. Virginia", "N. California" and Ireland.  No Asian presence, but it's a start, at least.  However, none of these services replicate data across regions.  Any given object resides in a single region, so access to that object from other regions will be slow.  AppEngine is even more limited, apparently operating from a single location (at a time).

At least they're transactional.

RDS, SimpleDB, and Datastore all provide transactions and consistent reads.  EBS and EC2 disk should also provide both, though this is of limited utility since you're limited to accessing a given volume from a single machine.  Memcache presumably provides consistent reads, but not transactions AFAIK.

The fact that support for transactions and consistency is so widespread speaks to the importance of these features to developers.

A handy, if depressing, chart.

Here's a summary of the properties we've discussed.  Services which meet one of my object storage goals are green, or yellow-green if caveats exist.  Services which miss the goal (or have major caveats) are yellow, orange, or red.

BackendRead latencyWrite latencyDurabilityAvailabilityTransactionalityGeographic Distribution
SimpleDB140514excellent?very good?excellentfair
Datastore (transactional)805 1100goodfairexcellentnone
Datastore (nontransactional)4371100goodfairweaknone
GAE Memcache2425noneexcellentfairnone
EC2 disk3123weakfairgoodfair

All of the options have a fair amount of yellow (or worse).  In other words, there is no off-the-shelf solution that supports what I would call a professional grade object store.  What, then, is a professional to do?  I'll say more about that in subsequent posts.

The least bad option, on this chart, is probably SimpleDB.  SimpleDB, by the standards I've defined, is slow and lacks replication across regions; but it's at least not abysmally slow, and it scores well on the other criteria.

Appendix: benchmark details.

Here are the precise operations being benchmarked.

SimpleDB reads: fetch one record from a domain of one million small records, using AmazonSimpleDB.getAttributes(new GetAttributesRequest().withItemName(...)).  Consistent and inconsistent reads seem to have identical latencies except at the very tip of the long tail, so I've reported on consistent reads.  Writes: update one record in the same domain, using AmazonSimpleDB.putAttributes(...).

RDS reads: fetch one record from a table of one million small records, using select * from data where id = ....  Writes: update one record in the same table, using update data set value='...' where id = ....  This is on an RDS "small DB instance".

Datastore reads: fetch one record from a pool of one million small records, divided into 1000 entity groups of 1000 records each.  The transactional version is DatastoreService.beginTransaction(); DatastoreService.get(...singleton keyset...); Transaction.commit();.  The non-transactional version includes only the get().  Writes update one of the records, using DatastoreService.put(...) inside a transaction.

GAE Memcache reads: fetch one record from the cache, using CacheManager.getInstance().getCacheFactory().createCache(Collections.emptyMap()).get(...).  The keys are selected randomly from a range of one million keys, but there is no way to force the memcache to remain populated, so unlike the previous benchmarks, most invocations return no data.  Writes update one record, using a similar sequence but ultimately invoking Cache.put(...).

EBS reads: read an aligned 4096 byte block from an 8GB file stored in Amazon's EBS.  Writes write one aligned block, using RandomAccessFile.write() on a file opened with mode rwd.  (See

EC2 local disk reads: read an aligned 4096 byte block from an 8GB segment of a raw local disk on Amazon's EC2 (no filesystem involved).  Writes are as for EBS.  This is on an EC2 "small" instance.

Saturday, April 17, 2010

App Engine is MacApp

I've been spending a lot of time lately digging into Google App Engine and Amazon Web Services.  I'd been thinking of them as equivalents -- the two main competitors in the "scalable hosted computation" market.  However, it dawned on me yesterday that they are not actually members of the same category.  To make an analogy, AWS is like the old Macintosh Toolbox -- the APIs for application development in the original Mac OS.  And App Engine is like MacApp -- an early framework for application development.

From a distance, the Toolbox and MacApp served the same purpose.  Both allowed you to create applications.  Both had facilities for creating windows, menus, and controls, provisions for event handling, and so forth.  However, the styles were very different.  The Toolbox exposed raw capabilities like "retrieve the next input event" and "draw a rectangle at a specified location in a specified window".  It was up to the developer to stitch everything together into a coherent application.  MacApp was an application, fully stitched; the developer simply added components to the framework it provided.

There were good and bad aspects to each approach.  Writing a Toolbox app was a lot of work.  A simple "Hello, World" program that fully implemented the user interface guidelines (About box, File, Edit, and Windows menus, etc.) could be hundreds of lines of code.  But because you were accessing the raw capabilities of the windowing system, and managing control flow yourself, you had a lot of power and flexibility.

MacApp, conversely, made it embarrassingly easy to write a simple program.  "Hello, World" was just a few lines of code.  But it had limitations, especially in the early days.  When a new OS release added a new feature -- say, popup menus -- MacApp didn't always support it right away.  And if you wanted to implement some unusual control flow that didn't fit the MacApp model, you were stuck.  You could only work within the confines of the model MacApp provided.

App Engine has limitations of its own.  A few examples: you can only execute code in response to an incoming HTTP request, and the request must complete within 30 seconds, so there's no way of performing a very long computation.  You can't create threads.  You can't control the amount of RAM available to your process.  (For one of my benchmarks, I want a 256MB array, but in App Engine I can't seem to allocate even a 64MB array without triggering an out-of-memory exception).  [Update: originally I claimed the execution limit was 5 seconds, which Ade pointed out is incorrect.  Google's documentation states that the limit can vary but is "typically around 30 seconds".]

I dislike limitations of this sort, because you never know when you're going to trip over them.  You might decide that App Engine is sufficient for your application, and then while working on version 3 you realize you absolutely need threads for a new feature.  This is a bad situation to be in; you either have to find an (often ugly) workaround, port your entire application to a different platform, or abandon the feature.  So I had been planning to focus on AWS, and indeed I'm using it for my "benchmarking the cloud" project.  (As currently implemented, this project would trip over all of the AWS limitations mentioned above.)

Then, yesterday, I was contemplating a new project.  My eight-year-old son recently asked me to teach him programming.  We've done a bit of this before, using MIT's Scratch -- a simple environment where children drag and drop program elements to create sprite animations and games.  However, the drag-and-drop approach quickly becomes cumbersome.  So this time around, I decided to teach him JavaScript.  I spent 10 minutes writing a static HTML file which contains a <textarea> for code input, a Run button, and a <canvas> for display.  When he clicks the Run button, I eval() his code in a context that defines simple graphics functions like circle(x,y,r).

This worked well, but I wanted to turn it into a server-based site.  It would be nice to have access from any computer, and to save the programs somewhere.  A first version of this might only be a few hours' work.  When I contemplated writing it on AWS, my first reaction was a small inward groan at all the steps involved: I'd have to create a new server instance, configure it properly, give it its own EBS backing volume so AWS can restart it after a machine failure, allocate an Elastic IP address and attach a DNS name to it... on and on.  My understanding of how to configure Fedora, Tomcat, and the other tools involved is still evolving, so if I decide to change my setup I'd have two servers to reconfigure (the programming server and the benchmark server).  Pushing new releases is a hassle; I haven't yet managed to script the process.  And then there's the cost; 8.5 cents an hour is $744/year.

Hosting this on App Engine, conversely, would be a few minutes work.  Launch Eclipse, create a new application, add a few files to it, invoke the Publish to App Engine plugin, and I'm done.  Yes, there would be limitations -- I'm already scheming to work around the lack of threads.  (I want the Run button to send a request to the server, which saves the program and performs a syntax check.  Ideally, the database write and the syntax check would run in parallel.  To accomplish this in App Engine, I'll probably have to send two separate requests from the client.)  But the administrative overhead is so much lower that, for this project, I'll gladly accept the limitations.

So, which is better -- App Engine or AWS?  As is so often the case, the answer is "it depends".  For small projects, App Engine seems like the obvious choice.  For larger projects, it depends on how much you care about flexibility vs. ease of administration.  I would not use App Engine for a really major undertaking, but that leaves a lot of middle ground.

If we look back at desktop applications, eventually the frameworks came to dominate.  OS X uses a framework approach.  Last time I did Windows development, everyone used the MFC framework, or Windows.NET (also a framework).  Newer platforms like Android don't even expose a Toolbox style API.  These frameworks still have limitations, but they're far more powerful than the early versions of MacApp, and the limitations are much less burdensome.  Virtually all desktop applications nowadays are built on some provided framework.

Will hosted computation follow the same trajectory?  I suspect so, but it will take a while to get there.  One place where the desktop analogy breaks down is the ability to escape from the framework.  In MacApp, you could always talk directly to the Toolbox when you had to, which made it safer to adopt in the early days.  In App Engine, there's no equivalent way to escape when you need, say, threads.  App Engine makes simple things simple; AWS makes complex things possible.  Eventually someone will give us both.

Wednesday, April 14, 2010

You can learn a lot from a histogram

There's now lots of data accumulating at, but not yet much depth (in time).  In this post, I'm going to dig into a few operations that show interesting behavior even with a brief sampling period.  All results discussed here are running on an EC2 "small instance".

Here's a histogram for the null microbenchmark:

The code behind this measurement boils down to:

   long startTime = System.nanotime();; // an empty method
   long duration = System.nanotime() - startTime;

The measured duration is pinned to a minimum of 1 nanosecond, for benefit of my histogram class (which uses logarithmic buckets).  So the 1 ns bucket, in this case, represents 0 measured time.

The first thing I notice is that all of the measurements are a precise multiple of 1 microsecond.  The obvious conclusion is that System.nanotime(), on this machine, is based on a microsecond clock.

The second thing I notice is that the majority of samples, around 27K of 43K total, were measured at 1 microsecond.  Most of the rest were evenly distributed between 0 ("1 ns") and 2 microseconds.  Then there is a tail of longer timings.

I'm somewhat at a loss to explain the trimodal distribution.  Presumably System.nanotime() is taking some time to execute.  Suppose it takes 800 nanoseconds.  Then, given the 1 microsecond resolution, we'd expect 80% of the samples to measure at 1 microsecond, and 20% at 0.  That would fit the observed peaks at 1ns and 1µs, but wouldn't explain the peak at 2µs.  Conversely, if System.nanotime() takes 1200 nanoseconds, that would explain the peaks at 1µs and 2µs but not the peak at 1ns.  Something more complex must be occurring.  Perhaps it takes right about 1 microsecond, but with a significant amount of jitter.

Then there's the long tail; there's one sample way out at 15 µs.  This might have something to do with the System.nanotime() implementation, or it might involve some external event, such as a CPU interrupt.

I don't have any pressing need to understand how System.nanotime() works, so I'm not going to dig further into this example.  However, it serves to illustrate that you can learn a lot about an operation simply by looking at its latency histogram.  At least as important, the act of explaining the histogram to yourself highlights any behavior you don't understand, identifying areas for further investigation.  (It helps if the histogram has high resolution and is based on a large number of samples, as in this case.)

Real-world histograms aren't always complicated.  Here's the histogram for a multiply-add loop:

The code:

   int d = 1;
   for (int i = 1; i <= repetitions; i++)
      d *= i;
   dummy += d;

where "dummy' is an external variable which prevents the compiler from treating this as dead code.  I measure System.nanotime() before and after, and divide by "repetitions", which is set to 1 million for this test.  Virtually all of the samples fall into a single histogram bucket at about 1.15 ns.  There's a tail, but it accounts for a very small fraction of the samples.  The loop (all million iterations) takes about a millisecond, so it's not surprising that we occasionally get interrupted, explaining the tail.  No mysteries in this histogram.  A similar loop which invokes Math.sin(), surprisingly, yields a more complex histogram:

Here the code is:

   double d = 1;
   for (int i = 0; i < repetitions; i++)
      d *= Math.sin(i);
   dummy += (int)d;

with repetitions set to 100,000.  I'm hard-pressed to explain either the fact that there are two major peaks, or that one of them is broad.  The answer does not have to do with changes in behavior over time -- I generated a series of histograms for each one-hour time period, and they all look the same.  Again, if this operation were important to something I was doing, I'd have a good starting point for investigation.

To finish, let's look at a couple of histograms measuring filesystem access times on EC2.  Here's the histogram for reading 4KB of data, at a random 4K-aligned position, from an 8GB file in a filesystem mounted on Amazon's Elastic Block Store:

(click for full-size image)

Unsurprisingly, the histogram is dominated by a broad peak around 13ms.  This is presumably dominated by disk seek time; 13ms is a reasonable figure. The long tail could represent any number of things, such as queuing on a busy disk.

More interesting is the peak around 15 µs, and the double peaks at 500/800 µs.  At a guess, the first represents a hit in the disk block cache on the local machine, and the double peak represents a hit in the cache on the EBS server.  If these guesses are correct, we can deduce many things, such as the approximate size of "our share" of each block cache, and the RPC time to the EBS server.  The doubled nature of the 500µs peak doubtless hides something interesting as well.  (I thought it might represent variation over time, but even when narrowing down to a single hour's data, the double peak remains.)

We can double-check some of this analysis by varying test parameters.  Here is a histogram of the same operation on the same file, but selecting offsets only in the first 512K of the file.  We invoke each operation every 10 seconds, and other tests involving the same file touch this 512K every 20 seconds or so, effectively pinning it in the local block cache:

A test using the first 256MB of the file, yields a profile similar to that for the 8GB test, but the peak at 15 µs is slightly more pronounced, suggesting a slightly higher cache hit rate.

As an exercise, let's use the 8GB graph to estimate the size of the local block cache.  The peak around 15µs contains roughly 1400 samples (eyeballed), or about 6% of the total of ~23,000.  So at any time, the cache holds 6% of that 8GB file, or about 500MB.  The same machine is performing an identical test on an 8GB file stored on local disk.  We'd expect that file to consume a similar amount of cache, and a glance at its histogram suggests that this is indeed the case.  So far we've accounted for 1GB of block cache, and I don't expect much other filesystem usage going on.  As a cross-check, here's the output of "free -m" on the machine:

         total     used     free   shared   buffers    cached
Mem:      1707     1699        7        0       590       607

I'm not sure exactly what the distinction is between "buffers" and "cached", but my understanding is that they are both used mostly for disk blocks.  590MB + 607MB == 1.17GB, which agrees nicely with our guesstimate of 1GB cache usage for the microbenchmark data files.

In conclusion:
  • You can learn a lot about an operation by looking at its latency histogram.
  • It's a useful exercise to explain each feature of the histogram to yourself.  If there's something you can't explain, it means you don't fully understand the system.
  • If you want to understand the behavior of a production system, you need to look at a histogram from the production system.
  • High-resolution histograms are important.  For instance, the 1µs quantization of System.nanotime wouldn't have been visible at lower resolution.  Neither would the doubled peak at around 500µs in the EBS read time.

Thursday, April 8, 2010

Beginnings of a robust data collector

Since my last posting, I've completely rewritten the data collection system.  The first implementation was a quick hack, serving primarily to get my feet wet. The new version is built on more robust lines. I'll spare you the details, but briefly:
  • Data is summarized into hourly histogram buckets, enabling efficient reporting over long time periods. (Eventually I'll add coarser buckets, to support very long time periods.)
  • There are the beginnings of a reporting engine.
  • It's easy to add microbenchmarks.
A rudimentary dashboard can be seen at (updated -- the original post had a clumsier URL, which no longer functions).  It shows a latency summary for each microbenchmark, and a link per benchmark to a complete histogram.  It's been collecting data for a few hours now, sampling each operation every 10 seconds.  Here are the operations I'm benchmarking at the moment:

  • {read, write} a randomly selected entry from an int[] of size {16K, 16MB, 256MB}.  One thing I hope to probe here is lifetime of data in the processor cache.  If performance varies over time, that may suggest cache pollution from other VMs sharing a physical machine with us.  (On reflection, the parameters I'm using probably need to be tweaked.  Microbenchmarking is of course tricky, and I'm not an expert.  RAM benchmarks might form a topic for a later post.)
  • {read, write} 4K bytes from an int[] of size 16KB.
  • Invoke Math.sin() one million times.  (This was the "CPU" test from the original prototype.)
  • A simple multiply-and-add loop.
  • Read one small entry from a SimpleDB database, with or without consistency.  (Same as original prototype.)
  • Write one small entry to a SimpleDB database.  (Again, same as original prototype.)
Here's a snapshot as of this writing (all times in milliseconds):

Operation# samplesMin10th %ileMedianMean90th %ile99th %ile99.9th %ileMax
Read 4 bytes from a 16KB buffer (10,000,000 times)605021.954.95676.1126.9667.5707
Write 4 bytes to a 16KB buffer (10,000,000 times)6050205351.769.2115.5414.5423.3
Read 4K bytes from a 16KB buffer (100,000 times)6050479594.5823.61589.91985.520952032.8
Write 4K bytes to a 16KB buffer (100,000 times)6050260317.3449.6843.51877.32003.12005.6
Read 4 bytes from a 16MB buffer (1,000,000 times)605025.964.762.584.6169311.4311.7
Write 4 bytes to a 16MB buffer (1,000,000 times)605059.1123.3119.5152.3398.21182.61245.8
Read 4 bytes from a 256MB buffer (1,000,000 times)605032.379.380.599.2271.31075.11051.8
Write 4 bytes to a 256MB buffer (1,000,000 times)605093135.9149.9164.8774.215741562.6
1000000 repetitions of Math.sin6050341395.8597.31143.819542011.21987.9
10000000 repetitions of integer multiplication60501122.826.444.578.6376.8365.6
Read (inconsistent)605021.636.24371.9112.8376.8367.8
Read (consistent)605022.937.545.173.5134.5606.9622.4

[Update: removed broken links from the table above.  If you click on the dashboard link above, you'll see a table similar to this one, but with histogram links included.]

I'll wait for more data, and a better reporting tool (in particular, the ability to graph changes over time), before discussing these results.  I plan to add the following microbenchmarks in the near future:
  • Disk performance: {read, write} {small, large} blocks of data at random offsets in a file.  A very large file tests cache-miss performance; a smaller file could test cross-VM cache pollution.
  • Network: ping another EC2 instance with {small, large} requests.
  • Simple tests of Amazon's RDS (hosted MySQL) service, similar to the SimpleDB tests.
  • AppEngine tests -- as many of the AWS tests as are applicable.  (Local-disk tests are not applicable under AppEngine.  A form of network test is possible, but it would not be directly comparable to the EC2 test, as I don't believe AppEngine supports socket-level network access.)
  • Tests for AppEngine's memcache service.
Suggestions for additional microbenchmarks are welcomed.

Saturday, April 3, 2010

Transparency in tools

I've been dealing with a bunch of new tools and services lately, and running into the inevitable little problems along the way.  This got me thinking about the subject of transparency.  I like to say that any product must meet at least one of the following two goals:

1. Work more-or-less perfectly, virtually all of the time.

2. Be sufficiently transparent that, when the product fails, the user has some idea what to do about it.

Developers usually focus on #1 and ignore #2.  However, perfection is rarely achieved, so users are often left in a bind -- the thing they are trying to use isn't working, but they don't know how to fix it.  If any sufficiently advanced technology is indistinguishable from magic, then insufficiently advanced technologies should not pretend to be magic.

Examples abound.  My (anti-)favorite is a cell phone that can't connect.  My old Blackberry had a habit of dropping calls, with the message "call failed".  Why did the call fail, you might ask?  Was the signal too weak?  (Perhaps I should move to a less occluded spot.)  Was the tower handling too many calls?  (Perhaps I should move far enough to get in range of another tower.)  Is the phone's software glitching?  (Perhaps I should reboot.)  Was it a problem at the remote end?  (Perhaps I should wait for them to call back, or try another number.)  Of course, the phone gives me no information about any of this.  Just "call failed".

(Often, retrying the call succeeds.  It didn't occur to the phone designers to do this automatically, of course, or even prompt me with the option to redial.  The phone just sits there, staring blankly at me, as if nothing had happened.)

My Nexus One, which I otherwise love, has a similar habit with data connections.  It's pretty reliable when I'm on wifi, but otherwise attempts to access the net have a habit of failing.  Sometimes the phone provides no diagnostic information, just "connection failed" or some such.  Those are the good cases.  In other situations, it doesn't even tell me the connection failed, it just never gets around to doing the thing I'd asked it to do.  (One test I've found is invoking the Refresh command in Gmail.  If my connection is working, it spins for a few seconds.  If my connection is down, it spins for a fraction of a second.  No message is displayed in either case.)

The cases that have been troubling me lately all involve software systems.  I've been using Eclipse to develop a simple servlet-based web site on Tomcat and running it on my Mac and Amazon's EC2.  Sometimes it works, but often it can't find my servlet, or doesn't get the latest version.  When it doesn't work, I have a hard time figuring out why, because none of these systems are transparent:
  • Often, Tomcat can't find the compiled version of my servlet -- ClassNotFoundException.  Where is this supposed to be?  (Somewhere under WEB-INF/lib, I think, but I'm not sure and neither the error message nor the documentation of any relevant tool talks about it.)  Which Eclipse command is supposed to create it?  (Build?  Publish?  Something I haven't found yet?)  Did it even get built?  Where would Eclipse be putting it prior to copying it to the Tomcat server?  (No idea.)  When Eclipse publishes the site to Tomcat, where the heck in my filesystem does it put the site?  (Complete mystery.  It's not under Tomcat's webapps folder; at least, not when I run locally.  Tomcat's admin tool displays a list of configured applications, but doesn't show their filesystem path.)
  • When I'm running under EC2, things get even more complicated.  Amazon's Eclipse plugin can launch an EC2 instance and copy my site to it.  This often takes several minutes.  What is going on during that time?  There are some status messages, but I don't know what they mean.  I don't know what actual commands or operations are being performed.  When it fails -- and it often does -- I don't know what operation failed, what context it was invoked in, or how it failed.  I had to dig around for 20 minutes to even find where the heck, in the filesystem of the EC2 image the plugin uses by default, Amazon placed Tomcat.  (/env/tomcat.)
Imagine how much easier all this would have been if the tools involved were more transparent about their own operation.  When Tomcat can't find a class file, the error message could tell me where it looked.  When Eclipse publishes a site to a server, it could tell me what files it's copying, where they came from, and where they went to.  Commands like "Build", "Publish", and "Clean" could tell me what files they affected.  Amazon's EC2 plugin documentation could include a page on "what the plugin actually does when you ask it to launch a server".  None of this would be terribly hard, and it would make the tools much more transparent and debuggable.  This wouldn't just make life better for users, it would make the developers' lives easier, by reducing support requests, enabling better bug reports, and (for open-source projects) giving outsiders a better shot at fixing bugs directly.