Wednesday, July 13, 2011


When evaluating a cloud service, it’s critical to know how reliable it is. “Reliability” is a multifaceted concept, encompassing availability, performance, stability, etc. All of these factors can impact your life as a hosting customer.

Unfortunately, solid data on reliability is hard to come by. I touched on this in my recent post SLAs Considered Harmful, but it’s hardly news. If you spend time on the relevant discussion groups, you’ll see any number of discussions of reliability, but the supporting data rarely gets much more concrete than “we’ve been using service X for a while and it’s worked pretty well”.

The discussion following the April 2011 AWS outage brought this home to me. Here was a major event, with significant implications for cloud hosting providers and customers. Critical to our collective understanding of the event is a quantitative analysis of the impact. For instance, what fraction of EBS volumes were affected, and for how long? How does this compare to “typical” EBS availability? How does EC2 local disk compare? What about comparable offerings from other service providers? There were extensive and, for the most part, well-considered discussions all across the net, but very little hard data, making the whole discussion somewhat academic.

For instance: one strategy that would have protected you from the AWS event would be to spread your service across regions. However, inter-region network access is slower and less reliable. How much so? Enough to negate the reliability benefit of not relying on a single region? We simply don’t have the data to make this determination.

Once upon a time, weather data was similarly anecdotal. “I used to have a farm up north, the storms we got there were awful.” But organized record-keeping, and (more recently) satellite data, have changed all that. There’s a wealth of weather data available, and if you need to make a decision based on the probability of severe storms in location X vs. location Y, you can get that information.

It’s time we did the same for cloud services. I call the idea Cloudsat (by analogy to Landsat).

What is Cloudsat?

The goal, remember, is to provide hard information about the availability, performance, and stability of cloud services. A straightforward way to accomplish this is to invoke some basic operation at regular intervals, and record the success/failure and latency of each invocation. The results, aggregated over time, provide a good picture of each aspect of service reliability.

I’ve been doing exactly this for about a year now. I’m running a few benchmark servers on AWS and Google App Engine, performing simple operations like “read one record from SimpleDB” every 10 seconds, and logging the results. See, for example,, or really most of my posts from 2010; live data is at The results have been interesting, but they’re confined to a small handful of servers. Using dedicated benchmark servers to sample the full spectrum of cloud providers, regions, zones, server types, etc. would be prohibitively expensive. What we really need is crowdsourced benchmarks -- a collective, volunteer effort to aggregate performance data from across the cloud. The result would be a continuous, global survey of cloud service performance: Cloudsat.

Needless to say, a simple benchmark can’t tell you everything. For instance, my SimpleDB benchmark doesn’t tell us anything about how performance varies with the amount of data transferred, use of indexes, or many other variables. However, even a simple benchmark, repeated over time and across servers, reveals quite a bit about how robust a service is. Load distribution problems will manifest as latency variation between servers; stability problems will manifest as occasional error spikes; and so on.

Think of this as a complement to traditional benchmarking. A traditional benchmark gives you a snapshot of performance for a workload of your chosing. Cloudsat data will, hopefully, tell you whether you can expect that performance to remain stable over time. In the next few sections, I describe how Cloudsat could be implemented.

Data model

The Cloudsat database is a large collection of benchmark results. Each benchmark result consists of:

  • A timestamp (when the was benchmark executed)
  • Which operation was benchmarked (e.g. “execute Math.sin() 1,000,000 times”)
  • The outcome (operation succeeded with latency X, or operation failed)
  • Where the operation took place
    • Service provider (Amazon, Google, RackSpace, …)
    • Region / zone
    • Server type
    • Server ID
    • Owner ID (who owns/leases this particular server)

Data collection

To crowdsource data collection, we need a benchmarking tool that volunteers can install on their servers. This tool must:

  • Impose minimal impact on the server.
  • Be easy to install in a wide variety of server environments.
  • Require little or no work to maintain.

An obvious approach is to write a simple demon that wakes up at regular intervals, say every 10 seconds, and executes a series of cheap and simple benchmarks. The results would be uploaded to a central repository. To minimize load on the repository, the demon would probably accumulate results in a local log file, and upload them in small batches.


The repository receives data batches from the collectors, and stores them for later retrieval and analysis. Desired features of the repository:

  • Accept data uploads with high reliability.
  • Provide real-time display (e.g. graphing, histograms) of the uploaded data.
  • Support large-scale aggregations and analysis, e.g. compute a latency histogram for some service across multiple servers and long time periods.

I envision the repository as consisting of log batches and histogram batches. A log batch contains raw data: all measurements from a single server for some period of time, say one hour. A histogram batch aggregates multiple log batches, and collapses individual measurements across one or more of the dimensions defined in the data model (time, server ID, etc.).

So, for instance, one particular histogram batch might store latency for SimpleDB reads, aggregated across all servers in the us-east-1a, for a one-week period, in 5-minute buckets. This batch would contain a total of 2016 histograms (one week divided by 5 minutes). The same data might also be aggregated at one-minute resolution, and stored in a set of seven batch files, one per day (to keep the size of each batch manageable). Coarse aggregation supports broad analysis, e.g. graphing latency over a long time period. Fine aggregation supports detailed analysis, e.g. graphing over smaller time periods. The repository would store a suite of histogram batches at various levels of aggregation in time and space.

Log batches and histogram batches would be stored as files in Amazon S3, and be on the order of 100K to 1MB in size.


In principle CloudSat can be a fairly simple project. However, it does raise some challenges:

  • Robustness of data gathering. For instance, suppose an EC2 instance that is running the CloudSat benchmark demon experiences network problems. If those network problems render the instance unable to post to the CloudSat repository, then a sampling bias results; the problematic period will be excluded from the CloudSat data. It would be a shame if data gathering stops just when things get interesting.
  • Data trustworthiness.
    • Disentangling general service performance from client-induced overload -- is a problem the service’s fault, or the client’s fault? For instance, again suppose that an EC2 instance is experiencing network problems. That could be due to a problem with the EC2 network, but it could also be caused by the instance attempting to use an unreasonable amount of bandwidth. The two cases are hard to distinguish, but only one is relevant when evaluating the reliability of the EC2 network.
      • Server failures are a particularly vexing instance of this problem. When a server shuts down or restarts, how do we know whether it’s due to an infrastructure failure, or a deliberate action?
    • A misconfigured CloudSat demon might report incorrect data, or tag the data incorrectly (e.g. misreport a RackSpace server as being on EC2).
    • CloudSat demons run on untrusted servers, so someone could deliberately feed bad data into the system.
  • Universality: designing a benchmark demon that can run on all mainstream server types, operating systems, etc.
    • PaaS (Platform As A Service) systems, such as Google’s App Engine, pose a particular challenge because they often don’t support conventional demon processes.
  • Constructing useful, stable benchmarks that don’t impose significant load on the host server. For instance:
    • A meaningful benchmark of random-access disk I/O requires a multi-GB data set, to defeat caching. This can use a nontrivial fraction of a server’s disk space, and take significant time to initialize.
    • Measuring network performance requires a second server to talk to. It’s not obvious how this partner server should be chosen.

First steps

What’s the minimal version of CloudSat that would do enough to be interesting? One possible target is to gather information about CPU and network performance, across convential Linux servers on a variety of hosting providers. This would yield enough data to be interesting, while minimizing implementation challenges. It would require:

  • A simple Linux-only implementation of the benchmark demon.
    • Periodically executes a simple CPU benchmark. (Note, the goal here is to monitor CPU availability, not absolute performance on a particular workload.)
    • Periodically pings a set of known servers.
    • Spools the results to a log file.
    • Periodically uploads the log file to the CloudSat repository and then discards the local copy.
  • A scalable implementation of the repository.
    • Log batches and histogram batches stored in Amazon S3, with metadata in SimpleDB.
    • Acceptor service receives log batches from benchmark demon instances.
    • Cron job creates new histogram batches according to hard-coded rules.
    • Reporting service returns time-series data, using a simple query planner to identify the appropriate set of log batches and/or histogram batches.
  • A Web UI for generating graphs and histograms from the reporting service.

Call for participation

To get CloudSat off the ground, I could use help in the following areas:

  • Validating the need. Would CloudSat data be of interest to you? What would you like to see? Let me know!
  • Gathering data. Would you be willing to run a CloudSat demon on your servers? If so, what sort of servers (OS, hosting provider, etc.) do you use? If not... why not?
  • Implementing the benchmark demon. The demon should be fairly simple, but I’ve never written a packaged Linux program before. I could use advice on an implementation approach to maximize portability and ease of installation.
  • Spreading the word. Please forward this post!

If you’re interested, comment on this blog post or contact me at


  1. I really like your ideas, but I noticed this post is from 2011! How is the CloudSat idea coming along?

    I would love to help in any way possible.

    Recently I set out to benchmark the disk I/O performance on some Amazon instances.
    In my testing, I was finding that the "Standard" volume was getting better I/O performance than any of the "IOPS" volume.
    I tried between 500 and 2400 IOPS and found the "Standard" volume consistenly beat the supposedly faster volumes.
    I found this confusing, and haven't really gotten an explanation.
    Something like CloudSat would really come in handy here.

    Thanks so much for your post, and look forward to hearing about any updates!

  2. Thanks for the response! CloudSat is still just an idea, but I'm hoping to turn it into reality soon, in the context of my current startup ( Send me an e-mail if you'd like to talk more about it -- steve@ that domain.

    As for IOPS performance: I haven't tested this, but my understanding is that reserved IOPS is both a reservation and a cap. If you reserve 500 IOPS, then you're (somewhat) guaranteed to be able to get a rapid response to 500 operations per second, but if you issue more than 500 operations per second I'll believe you'll be throttled. Regular EBS volumes are best-effort, but also aren't throttled in the same way. So if you reserved 500 IOPS but issued 600 operations/second, you might get better performance on a regular volume. If you were issuing fewer operations than you'd reserved, then it's interesting / surprising that you saw worse performance with reserved IOPS.

    I published a much more detailed look at EBS performance last year: This is now somewhat dated, and doesn't cover reserved IOPS at all, but you might find it interesting.