Thursday, April 22, 2010

Cloud data storage: many choices, all bad

[Update: I apologize for the atrocious formatting in this post.  Some combination of Blogger's editor, copy/paste from Google Docs, and the contentEditable implementation in Chrome on Mac OS X seems to have rendered the formatting unsalvageable.  It looks much better in the Compose window...]

What I want

I recently set up a MySql instance using Amazon's Relational Database Service (RDS), and added a couple of SQL operations to my microbenchmark suite.  That completes a tour of the data storage services available on Google App Engine and Amazon AWS.  (Aside from bulk storage services -- S3 and Blobstore.  I'll get to those eventually, but they serve a different purpose.)

Imagine that you're building a web service, and you need a simple object store -- a place to write small objects and retrieve them later.  These might be user records, forum posts, whatever.  Let's ignore all questions of indexing, queries, etc. and just consider simple update and fetch operations -- write(objectId, data) and read(objectId).  How well do the various storage services fill this need?  Well, here are some attributes we might want from an object store:

  • 50ms reads.  Suppose you want to limit your server time to 200ms per request.  It might be hard to avoid doing at least two rounds of object fetches -- i.e. you fetch one or more objects in parallel, and from those objects you get IDs that lead you to another round of fetches.  If you don't want to spend more than half your latency budget here, then each fetch needs to complete in 50ms.
  • 500ms writes.  Writes are less common than reads, and write latency can often be hidden in a background AJAX request.  But sometimes the user will have to wait while you update your object store, and you don't want them to wait too long.
  • Transactionality and consistency.  It should be possible to atomically modify an object's state, without interference from competing writers; and once a write has completed, all subsequent reads must observe it.  (The latter property sounds obvious, but is actually difficult to achieve in a distributed system.)
  • Durability.  Once you've written an object, it should stay written, even in the event of machine failures and other disasters.
  • Availability.  An hour of downtime is liable to get you razzed on TechCrunch; if you don't want this to happen more than once/year, then you need to aim for about 99.99% availability.
  • Geographic distribution.  You may have users all over the world.  Ideally, you'd like to have servers all over the world, and get your 50ms reads and 500ms writes from any server.

Another potential consideration is cost, but it seems to me that all of the services I discuss are pretty cheap for most purposes.  Say, less than $1/GB/month (sometimes much less), including both storage and access fees, unless your access rate : storage size ratio is high.  (One exception: RDS is expensive if your needs are very small, since you can't rent less than one "small instance" -- around $1000/year.)

Google's Ryan Barrett gave a nice talk on this subject in 2009, available at 
http://code.google.com/events/io/2009/sessions/TransactionsAcrossDatacenters.html.  Someone has posted a discussion / summary of the talk at http://highscalability.com/how-google-serves-data-multiple-datacenters.

The options

Database options available on AWS are
 SimpleDB (non-relational, replicated database) and RDS (hosted MySQL).  App Engine has its Datastore.  You could build an object store on top of a filesystem, so I'll also discuss the two read/write file storage options on AWS -- EBS (network block store) and the local disk on an EC2 server.  Finally, just for kicks, I'll include App Engine's Memcache service.

They're all slow

Here is a table of read and write latencies for the six services.  For Datastore (App Engine), I present results for both transactional and nontransactional reads.  SimpleDB has a similar distinction ("consistent" and "inconsistent" reads), but I've observed near-identical latency for both, so I'll ignore the inconsistent variant.

Latencies are taken from 
http://amistrongeryet.com/dashboard.jsp.  As I've discussed in previous posts, this site performs a suite of operations every 10 seconds, and records the latencies.  Sampling started a week or two ago in most cases, and yesterday morning for RDS and the nontransactional option in Datastore.  Data continues to accumulate, and you can always see the latest figures, with lots of additional detail, on the linked page.

Here, I report 99th percentile latency.  It's more common to discuss mean latency, but this has several drawbacks.  For one thing, a handful of outliers can throw off the result.  For another, it can hide issues that affect a small but still significant fraction of requests.  Consider also that a single web request may touch multiple objects; if you touch 10 objects, then roughly speaking, your 90th percentile user-observed latency is dictated by your 99th percentile object store latency.  In that light, I'd really perfer to report 99.9th percentile latencies, but it would be too cruel.  (Click through to the dashboard if you're morbidly curious.)  The 99th percentile is bad enough:

BackendRead latency (ms)Write latency (ms)
SimpleDB140514
RDS338382
Datastore (transactional)8051100
Datastore (nontransactional)4371100
GAE Memcache2425
EBS15050
EC2 disk3123


The only services that meet our read latency goals are Memcache (which hardly counts) and EC2 local disk.  For write latency, Amazon's database services squeak through or nearly so, and Memcache and both disk options do very well.  (Incidentally, none of the services -- not even Memcache -- meet the 50ms read goal at 99.9th percentile.)

The RDS and EC2 disk latencies may be unfairly good, because in both cases I had to reserve an entire (virtual) machine, which is very lightly loaded.  The other services presumably commingle requests from many clients, and so should not benefit much much from the fact that I'm presenting a light load.  RDS further benefits from having a small data set (one million small records) on a virtual server with 1.7GB of RAM, hence it may be caching the entire database in memory.  All of the other benchmarks are either on shared servers where there is competition for cache space, or have data sets too large for effective caching.  At some point I may add additional benchmarks with heavier load and/or larger data sets.

I'll dig deeper into the latency graphs and histograms in a subsequent post.

At least they're durable, right?

This turns out to be a difficult question to answer.  Let's consider each backend in turn.

Memcache: obviously, no durability guarantees at all.  Data may vanish at any time.

EC2 local disk: also no guarantees, as an EC2 instance could vanish without warning.  In practice, data is likely to survive for days at a time, and you at least will know when data loss occurs (i.e. when an instance vanishes), so perhaps you could implement a durable system using EC2 instances as building blocks.

EBS: the Amazon documentation has some interesting things to say about EBS durability.  Two relevant quotes:
"Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component."
"Amazon EBS volumes are designed to be highly available and reliable. Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives." [from http://aws.amazon.com/ebs/]
So, EBS data is replicated across machines, but not across Availability Zones.  An outage in a single zone will take your EBS volume offline, and conceivably it could be lost forever in a catastrophe.  Short of a complete zone outage, you should expect to lose an EBS volume once every few hundred years (more or less, depending on how how often you snapshot and how rapidly you write).  We aren't given a figure for frequency of data loss events smaller than an entire volume; arguably it's implied that there are no such events (or that they're included in the AFR figure), but we don't know for sure.  In fact, given that we don't know how EBS works, or precisely how the underlying machines are managed, we don't know whether we can rely on the AFR estimate at all.

There's also the question of whether a write ever really got into EBS in the first place.  EBS is presented as a block device, and there are probably at least three levels of buffering involved -- one in the client, one in the EBS server, and one in the disk controller.  It's notoriously difficult to ensure that a write has penetrated all of those buffers and been physically written to the disk platter.  (Put another way: in practice, you can't ensure it; don't pretend that you can.)

RDS: I haven't been able to find any hard statements about RDS durability.  I suspect it is backed by EBS and will have the same durability properties.

SimpleDB: Here's what Amazon has to say:
"Behind the scenes, Amazon SimpleDB creates and manages multiple geographically distributed replicas of your data automatically to enable high availability and data durability."
This sounds reassuring, but doesn't include much detail.  Are how far apart are these "geographically distributed" replicas?  They appear to be different "availability zones" in the same AWS region.  How independent are availability zones?  For instance, do they have independent long-haul backbone connections?  The Amazon FAQ how isolated are Availability Zones from one another doesn't address this.

We also don't know exactly how the replication works.  When I issue a write and SimpleDB reports it as having completed, has the data already been replicated?  How widely?  In a given replica, is it sitting in a buffer or has it gotten onto the actual disk?  In short, exactly how sure are we that the write will not be lost?  Pretty sure, probably; but it's hard to know.

Datastore: this is backed by Bigtable, which in turn is built on GFS.  In GFS, by the time a write completes, it has already been replicated to multiple storage servers within one data center.  However, per the talk linked above, it may not yet have been replicated outside of the data center (similar to an Amazon "availability zone").  Datastore uses asynchronous replication to copy writes out of the data center.  In the event of a data center outage, a (usually small) number of writes could be lost.

What about availability?

Memcache, presumably, is highly available.

EC2 is subject to a brief outage whenever an instance fails; this probably does not endanger our 99.99% availability target.  Much more seriously, a given EC2 instance becomes unavailable whenever its availability zone fails.  Such failures are not unknown (google "EC2 outage"), so we cannot consider an individual EC2 instance to be highly available.

EBS availability is not affected by machine failures, but is affected by failure of an availability zone, so it's not much more available than EC2.

RDS availability is presumably similar to EC2, with brief outages when a machine fails.  RDS is also subject to occasional downtime for maintenance -- see http://aws.amazon.com/rds/faqs/#12.

SimpleDB availability should in principle be very good, since it's replicated across availability zones.  In practice, as noted above, it's hard to evaluate this.

Datastore availability is bounded above by App Engine availability.  Unfortunately, that's well short of our 99.99% target -- google "app engine outage".

In short, none of these services fully guarantees durability and availability.  As Google's Ryan Barrett explains in the linked talk, such a guarantee would require synchronous replication across a significant geographic distance, at a high cost in write latency.  But some of the services may be good enough for most everyday uses.  Let me summarize:

ServiceDurabilityAvailability
MemcachenoneExcellent
EC2 diskWeak (lost on instance failure)Fair
EBSGoodFair
RDSSame as EBS?Same as EBS?
SimpleDBExcellent?Very good?
DatastoreGoodFair

The only service that really scores well is SimpleDB, and that only if we make some assumptions about synchronicity of replication and independence of availability zones.  Datastore also seems "good enough", if you're already using App Engine and thus are limited by its availability anyway.  With EC2, EBS, and RDS, if you're running a live service then you'll need some way of replicating your data across zones, which can be very complex.

No one distributes your data geographically.

The Amazon services are available in multiple regions -- currently "N. Virginia", "N. California" and Ireland.  No Asian presence, but it's a start, at least.  However, none of these services replicate data across regions.  Any given object resides in a single region, so access to that object from other regions will be slow.  AppEngine is even more limited, apparently operating from a single location (at a time).

At least they're transactional.

RDS, SimpleDB, and Datastore all provide transactions and consistent reads.  EBS and EC2 disk should also provide both, though this is of limited utility since you're limited to accessing a given volume from a single machine.  Memcache presumably provides consistent reads, but not transactions AFAIK.

The fact that support for transactions and consistency is so widespread speaks to the importance of these features to developers.

A handy, if depressing, chart.

Here's a summary of the properties we've discussed.  Services which meet one of my object storage goals are green, or yellow-green if caveats exist.  Services which miss the goal (or have major caveats) are yellow, orange, or red.


BackendRead latencyWrite latencyDurabilityAvailabilityTransactionalityGeographic Distribution
SimpleDB140514excellent?very good?excellentfair
RDS338382good?fair?excellentfair
Datastore (transactional)805 1100goodfairexcellentnone
Datastore (nontransactional)4371100goodfairweaknone
GAE Memcache2425noneexcellentfairnone
EBS15050goodfairgoodfair
EC2 disk3123weakfairgoodfair

All of the options have a fair amount of yellow (or worse).  In other words, there is no off-the-shelf solution that supports what I would call a professional grade object store.  What, then, is a professional to do?  I'll say more about that in subsequent posts.

The least bad option, on this chart, is probably SimpleDB.  SimpleDB, by the standards I've defined, is slow and lacks replication across regions; but it's at least not abysmally slow, and it scores well on the other criteria.

Appendix: benchmark details.

Here are the precise operations being benchmarked.

SimpleDB reads: fetch one record from a domain of one million small records, using AmazonSimpleDB.getAttributes(new GetAttributesRequest().withItemName(...)).  Consistent and inconsistent reads seem to have identical latencies except at the very tip of the long tail, so I've reported on consistent reads.  Writes: update one record in the same domain, using AmazonSimpleDB.putAttributes(...).

RDS reads: fetch one record from a table of one million small records, using select * from data where id = ....  Writes: update one record in the same table, using update data set value='...' where id = ....  This is on an RDS "small DB instance".

Datastore reads: fetch one record from a pool of one million small records, divided into 1000 entity groups of 1000 records each.  The transactional version is DatastoreService.beginTransaction(); DatastoreService.get(...singleton keyset...); Transaction.commit();.  The non-transactional version includes only the get().  Writes update one of the records, using DatastoreService.put(...) inside a transaction.

GAE Memcache reads: fetch one record from the cache, using CacheManager.getInstance().getCacheFactory().createCache(Collections.emptyMap()).get(...).  The keys are selected randomly from a range of one million keys, but there is no way to force the memcache to remain populated, so unlike the previous benchmarks, most invocations return no data.  Writes update one record, using a similar sequence but ultimately invoking Cache.put(...).

EBS reads: read an aligned 4096 byte block from an 8GB file stored in Amazon's EBS.  Writes write one aligned block, using RandomAccessFile.write() on a file opened with mode rwd.  (See http://www.docjar.com/docs/api/java/io/RandomAccessFile.html#mode.)

EC2 local disk reads: read an aligned 4096 byte block from an 8GB segment of a raw local disk on Amazon's EC2 (no filesystem involved).  Writes are as for EBS.  This is on an EC2 "small" instance.

No comments:

Post a Comment