Thursday, April 1, 2010

Early results on GAE and AWS

Last time, I proposed to institute regular monitoring of latencies in GAE (Google App Engine) and/or AWS (Amazon Web Services).  I now have prototype data collectors running on both.  In this post, I'll discuss my experience working with each platform, and give a peek at the early data.  Implementing the same starter project on these two platforms has been an interesting experience: App Engine has been much easier to work with, but is also more limited.

Implementation on App Engine

I started with App Engine, having used it before for one very simple project (, a math drill game which I'll describe someday... quick instructions: click some of the checkboxes, type a phrase in the text box, click Generate Code, click Print, hand the result to your elementary school child).

It was really quite easy to throw together a simple data collector.  I wrote the following servlets:

1. Perform three operations (database read, database write, CPU loop), measure the time needed for each, and write the result to a database.

2. Read a specified range of entries from the database, and display them.

3. Read a specified range of entries from the database, and display a histogram.

I then configured GAE's Scheduled Tasks (cron) feature to invoke the first servlet once per minute.  I ran into one or two glitches; for instance, Google's Eclipse plugin makes it easy to deploy a project to GAE -- except that it doesn't copy dependent projects, yielding a ClassNotFoundException that took me a little while to sort out.  (You have to manually build the dependent project and copy the jar file into WEB-INF/lib.  This turns out to be a known issue which will be fixed in a later release of the plugin.)  But really the whole thing was pretty easy, when you consider that the code and data are now reliably ensconced in a multi-homed production environment. 

However, two limitations of GAE were quickly revealed:

1. I want to gather data samples every 10 seconds, but GAE Scheduled Tasks can run at most once/minute, and the operations they invoke are not supposed to run for more than 30 seconds each.  I don't see any workaround that doesn't involve gross hackery, or reliance on some computer outside of GAE.

2. There's no simple way to do large-scale data crunching (e.g. to analyze a mass of historical data).  Your code only runs in the context of a "request", and requests are interrupted after 30 seconds.

Implementation on AWS

I next ported the system to Amazon EC2.  I'm using the default system image supported by Amazon's Eclipse plugin, which includes Tomcat, so the code didn't change much -- it's a servlet environment, just like App Engine.  The database code had to be rewritten for Amazon's SimpleDB, but SimpleDB is conceptually very similar to the App Engine data store.

Getting started with AWS was much more involved than for App Engine.  There are a lot of moving parts to be aware of: your AWS account, EC2 instances, system images, Tomcat, SimpleDB, security domains, four different authentication methods (your Amazon account plus "access keys", "X.509 certificates", and "key pairs"), and so forth.  It took a day or two of flailing around to get a very simple app deployed successfully.  The Eclipse plugin fails to hide the seams often enough that I had to learn quite a bit about Tomcat, ssh into the EC2 instance and figure out where things were deployed, and learn more than I'd previously needed to understand about Eclipse web projects.  I'm still tripping over fresh problems, and I had to embed all my code in a .jsp file because I haven't figured out how to get Eclipse to successfully run a servlet under Tomcat (the .class file doesn't seem to get deployed).

Some of these issues are undoubtedly just fit-and-finish, and may improve in subsequent releases of Amazon's Eclipse plugin.  But some seem to be inherent to the design.  For instance, EC2 and SimpleDB are in separate trust domains, so it's necessary to authenticate your requests to SimpleDB; there is no equivalent need to authenticate data store operations in App Engine.  And while Amazon's Eclipse plugin theoretically provides push-button deployment for simple apps, this only applies to development: the deployment is tied to a particular EC2 instance, and (as I understand things) won't survive a machine failure.  [EC2 would restart the instance, but with a fresh copy of the system image, so the Tomcat app that Eclipse has deployed will be lost.]  Robust production deployment on EC2 requires a more complex manual solution, such as building your own system image.

For the moment, I'm not worrying about this.  My .jsp file is ephemerally deployed to a single EC2 instance, and has been running happily for 5 days.  A background thread triggers data collection every 10 seconds; I'm not limited to one-minute intervals as I am on App Engine.  And when the time comes for data crunching, I'll have no limitations -- there's even MapReduce support, though I haven't looked at it.

Preliminary Results

Here is a sample of the raw data from App Engine:

+63757ms, read latency = 56.087ms, write latency = 116.082ms, cpu latency = 169.271ms
+56196ms, read latency = 52.436ms, write latency = 90.867ms, cpu latency = 169.404ms
+61261ms, read latency = 49.787ms, write latency = 92.084ms, cpu latency = 170.163ms
+58766ms, read latency = 52.859ms, write latency = 69.427ms, cpu latency = 171.168ms
+60014ms, read latency = 51.714ms, write latency = 86.897ms, cpu latency = 166.677ms
+61209ms, read latency = 47.03ms, write latency = 77.575ms, cpu latency = 168.59ms
+58919ms, read latency = 62.297ms, write latency = 101.389ms, cpu latency = 167.873ms
+59906ms, read latency = 108.582ms, write latency = 95.255ms, cpu latency = 167.214ms
+60126ms, read latency = 50.703ms, write latency = 88.301ms, cpu latency = 172.228ms
+59873ms, read latency = 53.44ms, write latency = 130.192ms, cpu latency = 167.152ms
+61244ms, read latency = 45.855ms, write latency = 91.702ms, cpu latency = 170.425ms
+59126ms, read latency = 43.619ms, write latency = 96.122ms, cpu latency = 170.043ms

The first column shows the interval between invocations of the collect-one-sample URL; these are generally close to the requested 60 seconds.  The other three columns show the measured time to execute my three operations:

1. Read one small record from the database, randomly selecting one of 1000 possible row keys from one of 1000 possible entity groups (transaction domains).  I did not bother to pre-populate the database, so most of these reads are returning no data; the database will gradually fill in as writes occur.  The full sequence being timed is "open transaction, read one row, close transaction".

2. Write one small record to the database, selecting a row in the same fashion as for reads.  The full sequence is "open transaction, write one row, commit transaction".

3. Invoke Math.sin() one million times.

For AWS, the benchmarks are similar, but SimpleDB provides the option for "consistent" or "inconsistent" reads, and I measure each.  SimpleDB is currently limited to 100 "domains" (the seeming equivalent of App Engine entity groups), so I'm using 1 million distinct row keys in a single domain.  Here's some raw data:

+10037ms, read latency = 22.541ms / 28.196ms, write latency = 66.617ms, cpu latency = 379.424ms
+9945ms, read latency = 48.362ms / 23.199ms, write latency = 80.882ms, cpu latency = 427.459ms
+10004ms, read latency = 43.896ms / 28.407ms, write latency = 56.014ms, cpu latency = 373.689ms
+10033ms, read latency = 18.509ms / 20.82ms, write latency = 62.999ms, cpu latency = 393.314ms
+10018ms, read latency = 31.931ms / 16.278ms, write latency = 85.663ms, cpu latency = 432.111ms
+9979ms, read latency = 25.375ms / 22.718ms, write latency = 58.732ms, cpu latency = 349.028ms
+10004ms, read latency = 26.88ms / 17.355ms, write latency = 83.115ms, cpu latency = 378.508ms
+10021ms, read latency = 20.897ms / 36.619ms, write latency = 577.392ms, cpu latency = 412.048ms

We can see that, as suspected, both systems exhibit substantial variability.  Here is a summary:

                       mean  10th  50th  90th  99th 99.9th
                       ====  ====  ====  ====  ==== ======
GAE Read               77.4  39.3  46.6 126.9 413.9 2148.4
GAE Write             116.7  68.5  84.3 172.9 526.7 2478.6
GAE CPU               166.3 157.8 164.1 170.6 181.8  188.7

AWS Read Inconsistent  52.2  22.1  30.9  57.8 265.3 3195.8
AWS Read Consistent    30.9  16.0  22.4  41.5 235.2  276.8
AWS Write              71.6  48.1  57.2  86.0 348.9  868.7
AWS CPU               392.0 346.8 393.1 435.4 445.7  485.9

The App Engine figures reflect 5 days of data.  For AWS, only a few hours, due to performance problems when aggregating more data (see below).  All figures are in milliseconds; the first column shows mean latency, the others show percentiles.

I'll hold off on serious analysis until I've gathered more data.  However, a few things jump out here:
  • The CPU in an EC2 "small instance" has much slower floating point performance than the CPUs behind App Engine.  (I'll eventually add tests for other aspects of CPU performance, and other EC2 instance types.)
  • CPU timings are quite consistent.  This suggests that neither Google nor Amazon are regularly oversubscribing their CPUs.
  • Database operations show wide variance in both GAE and AWS.  (Remember that the AWS figures reflect a shorter time period.)
  • The simple database operations tested here are faster in AWS.
  • AWS "read consistent" is faster than "read inconsistent".  This of course is very surprising.  It's possible that I crossed the streams somewhere -- if so, that should show up as I clean up the code (see Next Steps).
A fuller presentation of the data, including histograms, is presented on separate pages -- one for App Engine and one for AWS.

Slow batch reads on SimpleDB

I mentioned that I've only been able to aggregate a few hours' worth of data on SimpleDB.  For some reason, when I query my results database to extract data for aggregation, the execution time is insanely slow: on the order of 60ms per record retrieved.  Here's the code:

    SelectResult result = db
        .select(new SelectRequest().withConsistentRead(true)
            "select * from samples where startTime >= '"
            + rangeStartKey
            + "' and startTime <= '09' order by startTime limit "
            + limit));
    List<Item> items = result.getItems();

If limit is 1, this executes quickly.  Execution time increases more or less linearly with limit, with the enormous coefficient mentioned above (60ms/record).  (Note: the query would be simpler if I could figure out how, in SimpleDB, to filter based on row key instead of a property.  But, given that execution is fast when limit=1, I don't think query complexity is the problem.)

So far, this is a mystery.

Next steps

I need to get Eclipse, Tomcat, and EC2 to play more nicely together, and solve the slow-batch-reads problem.  Once that's done, I plan to rewrite the current prototype along more robust lines.  Since App Engine does not support sampling more than once per minute, nor provide batch processing, I'll do all the collection triggering, data storage, and analysis in EC2.  In App Engine I'll just deploy a stub that performs each operation once and returns the timings to EC2.  I'll also profile a bunch of additional operations, implement a more robust reporting engine, and make it publicly accessible.  More on all this in a later post.

Footnote: EC2 overloaded?

I stumbled onto an interesting pair of blog posts suggesting that EC2 is "oversubscribed"; more properly, that EC2 instances are collectively saturating the physical machines Amazon is running them on, leading to performance problems:

Within a few weeks, I should have enough data to determine whether my application is experiencing the same effect.

1 comment:

  1. Many thanks, kind sir! You saved us weeks of aggravation.