Friday, May 28, 2010

More on the RDS latency oscillation

[Sorry for the long gap in posts; I've been working on other things.  Posts are likely to be somewhat sporadic for the foreseeable future, especially as I have quite a bit of travel coming up this summer.]

In Three Latency Anomalies, I mentioned a periodic oscillation in latency for simple read and write operations in Amazon's RDS (hosted MySQL) service.  I opened a support ticket with Amazon, and eventually received this response:

Hi Steve,

The latency you're seeing may be related to the DB Instance class (db.m1.small) that you're running. The I/O performance on the db.m1.small is Moderate, vs. High for the Large and Extra Large DB Instance classes. We are working on publishing guidelines around this variance on the Amazon RDS product page.

You might try running your test on one of the "High" I/O performing systems and see if that improves the latency issues that you're experiencing.

Regards,
The Amazon RDS Team

The I/O explanation didn't sound right to me, because the variance appears even when repeatedly re-reading a single record, which ought to be cached.  But I decided to try their advice and experiment with a "large" RDS instance.  While I was at it, I added a couple more small instances and an additional client.  This gave me 8 client/server pairs to test -- 2 clients, each issuing requests to 4 servers.  The operations benchmarked are:
  • Fetch one small record (always the same record)
  • Fetch a small record randomly selected from a table of 1,000,000 records
  • Update one small record (always the same record)
  • Update a small record randomly selected from a table of 1,000,000 records
All operations use the same table, which is shared across clients (but, obviously, not across servers).  Here is the last 20 hours worth of latency data for the first operation, on the original client / server pair:

Here is the same graph, for the same client but a second"small" RDS instance:

Not only does it show the same oscillation, the oscillations are aligned.  Uh oh: could this be a client issue after all?  Well, that's why I launched a second client (on a different EC2 instance).  Here are the results for the second client, talking to the first server:

Very similar to the graph from the first client.  Here's the second client talking to the second server:

So, two different clients talking to two different servers, all presumably on different machines, showing latency oscillations that not only have the same frequency, but are precisely aligned.  I won't bother including graphs here for the third small server, but they look very similar to the above.

In theory, the only common element in these tests should be the network; and even there, I'd expect independent network paths to be involved.  In any case, my us-east-1b network measurements show no such oscillation.  So this additional data merely confirms the mystery.

I haven't shown data yet for the "large" server.  Here is the graph for client 2 fetching a single record:

This has the same periodicity as the other graphs, but the absolute latency values are much lower: 99th percentile latency oscillates between 5 and 20 ms, as opposed to something like 25 - 250 for the small servers.  Finally, just to make things really fun, client 1 talking to the large server does not show periodicity:

The periodicity is present in the 90th percentile plot, but it's not apparent in the 99th percentile.  I happen to have several days' worth of data for this pair, which confirms that only the 90th percentile plot has much periodicity:

One explanation would be that, for this particular server pair, there is an additional component to the latency tail that prevents 99th percentile latency from dropping below 20ms.  That could be explained by any client or network issue that imposed a 20ms delay for 1 to 10 percent of operations.

Conclusion: I still have no idea what causes the periodicity, but I'm even more convinced that something interesting is involved.  I'll pass the new data back to the RDS team and see what they can make of it.

Thursday, May 6, 2010

Browser network latencies

As an experiment, I spent part of last week writing a simple HTML5-based multiplayer video game (a clone of the original Maze War).  It works surprisingly well; browsers have come a long way.

A critical parameter for this sort of game is communication latency.  When I shoot at you, how soon does the incoming missile appear on your screen?  If the latency is not very short, the game feels "laggy".  In my playtesting, lag wasn't a major problem, but occasionally it seemed like things would hiccup a bit.  I decided to do some more scientific testing.  I wrote a simple web page that issues an XMLHttpRequest once per second, and then reports the latency back to my microbenchmark server.  Here is the resulting histogram:



Operation# samplesMin10th %ileMedianMean90th %ile99th %ile99.9th %ileMax
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)13929104.0 ms98.9 ms105.0 ms123.0 ms132.0 ms417.0 ms713.0 ms8.64 sec

(This reports latency for an XMLHttpRequest from Google Chrome 5.0.375.29 beta, on a MacBook Pro running OS X 10.6.3, over not-very-good 1500Kbps/500Kbps DSL in the S.F. Bay Area, to an Amazon EC2 server in Virginia.  All benchmarks discussed in this post use the same setup, except as noted.  All benchmarks also ran for the same time period, and so experienced the same network conditions, again except as noted.)

These figures aren't too shabby; 100ms or so strikes me as a reasonable response time.  However, we have a fairly long tail, so players will see occasional lag.  Here is the same data, graphed over time:

There was a spike around 9 PM, but if I exclude that time period, the histogram doesn't change much.  The jittery latency seems to be present consistently over time.

Surprise!  It's not the network

An obvious explanation for this latency tail is delays in the network.  To verify that, I measured ping times between the same pair of machines.  Surprisingly, the ping data shows a much shorter tail:


Operation# samplesMin10th %ileMedianMean90th %ile99th %ile99.9th %ileMax
Ping (N. Calif. -> EC2 US East)4171791.9 ms89.6 ms93.5 ms97.2 ms101.0 ms140.0 ms414.0 ms907.0 ms

And most of the tail turns out to be localized to the 9 PM event:


Excluding that time period, most of the tail vanishes from the ping histogram: 99.9th percentile ping latency is only 154ms!  Maybe my DSL isn't quite as bad as I'd thought.

It's partly the server

The tail in the XMLHttpRequest data must originate elsewhere.  Perhaps on the server?  My EC2 server is juggling quite a few tasks by now: it's running dozens of benchmarks, pulling additional benchmark data from App Engine, accumulating histograms, and serving the XMLHttpRequests for this latest benchmark.  Overall CPU utilization is still fairly low, and CPU-sensitive benchmarks running on the server aren't showing much of a tail, but still perhaps there is some effect.  So I fired up a second server, configured identically to the first but with no workload, and ran the same test there.  The results are faster than for the original server, but only slightly:

Operation# samplesMin10th %ileMedianMean90th %ile99th %ile99.9th %ileMax
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)13929104.0 ms98.9 ms105.0 ms123.0 ms132.0 ms417.0 ms713.0 ms8.64 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B10446104.0 ms99.4 ms108.0 ms114.0 ms117.0 ms323.0 ms629.0 ms791.0 ms
(I've repeated the original data for comparison.  The lightly-loaded "B" server is shown in the last row.)

It's partly HTTP

So, server CPU contention (or Java GC) may be playing a minor role, but to fully explain the latency tail we have to keep digging.  Perhaps Web Sockets, by eliminating most of the overhead of the HTTP protocol, would help?  Happily, Chrome now has Web Socket support.  I decided to try several connection methods: XMLHttpRequest, Web Sockets, and JSONP (a trick wherein you issue a request by dynamically creating a <script src=...> tag, and the server returns a JavaScript file whose execution delivers the response).  JSONP has the useful property of not being bound by the "same origin" security policy, enabling one additional connection method: "best of two".  In this method, I have the browser issue a Web Socket request to one server, and a simultaneous JSONP request to the other.  The latency is measured as elapsed time from when we begin sending the first of these two requests, until either response is received.  Here are the results:

Operation# samples10th %ileMedianMean90th %ile99th %ile99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East)1392998.7 ms104.0 ms123.0 ms155.0 ms416.0 ms688.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)1392998.9 ms105.0 ms123.0 ms132.0 ms417.0 ms713.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East)1393089.6 ms93.8 ms108.0 ms122.0 ms375.0 ms582.0 ms
JSONP (OS X Chrome N. Calif. -> EC2 US East) B1044798.6 ms104.0 ms116.0 ms132.0 ms341.0 ms652.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B1044699.4 ms108.0 ms114.0 ms117.0 ms323.0 ms629.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B1044789.5 ms93.1 ms96.3 ms96.8 ms119.0 ms556.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B1044689.6 ms93.6 ms98.4 ms101.0 ms219.0 ms551.0 ms
Ping (N. Calif. -> EC2 US East)4171789.6 ms93.5 ms97.2 ms101.0 ms140.0 ms414.0 ms
Ping (N. Calif. -> EC2 US East) B3914689.5 ms93.1 ms94.8 ms96.9 ms107.0 ms424.0 ms

For each server, Web Sockets are clearly the fastest of the three connection techniques.  On the lightly loaded "B" server, the Web Socket latencies are almost as good as the ping latencies.  However, if I exclude the 9PM spike, there is still a noticeable difference in the tails: 99th and 99.9th percentile latencies for Web Socket requests are then 127 and 396 milliseconds respectively, while the equivalent ping latencies are 104 and 113 milliseconds.

It's interesting that the best-of-2 technique does not perform well.  To the extent that the latency tail is caused by server CPU contention, best-of-2 should be a big improvement.  It's unclear how much it would help with network issues, and it definitely can't help with client issues.  The poor performance suggests that client issues contribute significantly to the latency tail.

And it's partly client load

Now, we've more or less controlled for network latency and server CPU.  What factor could explain the remaining difference between ping and Web Socket latencies?  Perhaps it's something in the browser.  During the preceding tests, I had at least a dozen Chrome tabs open, some to complex sites such as Gmail.  I restarted Chrome with only one tab, open to the page that executes the benchmark:

Operation# samples10th %ileMedianMean90th %ile99th %ile99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B103898.6 ms103.0 ms115.0 ms134.0 ms248.0 ms492.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B1038106.0 ms112.0 ms112.0 ms117.0 ms211.0 ms446.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B103789.5 ms93.0 ms94.8 ms96.6 ms100.0 ms118.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B103789.5 ms93.1 ms95.9 ms96.9 ms107.0 ms369.0 ms

These results are from a short run (only about an hour), at a different time period than the others, and so should be taken with a grain of salt.  However, the Websocket figures look really good, comparable to raw ping times.  The HTTP-based connection techniques still show a significant tail.  (It's worth noting that this might not be entirely due to HTTP overhead; it could also reflect overhead in the browser implementation of the respective connection techniques.  JSONP in particular is a rather baroque approach and requires DOM manipulation and dynamic compilation.)

A convenient outage -- score one for best-of-two

During an earlier benchmark run (not reported here), my "B" server went offline.  Amazon's explanation:

5:39 PM PDT We are investigating instance connectivity in the US-EAST-1 region.
5:55 PM PDT A subset of instances in a single Availability Zone became unavailable due to a localized power distribution failure. We are in the process of restoring power now.

No doubt this was inconvenient for some people, but for me it provided a handy test of the best-of-2 connection method.  Indeed, all of the other "B" server benchmarks began failing, but the best-of-2 results were unperturbed.

This event also provided an interesting peek into Amazon's handling of outages.  The event began at around 5:15, at least for my machine.  Amazon's service dashboard was first updated at 5:39, roughly 24 minutes later.  Hence, when I first detected the problem, I had no information as to whether it was something I'd done, an Amazon problem specific to my machine, or a more widespread Amazon problem.  It would be nice if the AWS dashboard included up-to-the-minute service health metrics based on automated monitoring.  It would also be nice if health information was propagated to the status page for individual EC2 instances.  All the while that the machine was offline, the dashboard page for that instance continued to indicate that the instance was up and running.

One other quibble: the status reports never actually indicate which Availability Zone was impacted (it was us-east-1b).  Seems silly not to provide that information.

Stress test

Just for kicks, I decided to repeat the tests while saturating my DSL line (by uploading a bunch of photos to Picasa).  The results are not pretty:

Operation# samples10th %ileMedianMean90th %ile99th %ile99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B613101.0 ms419.0 ms559.0 ms1.35 sec1.75 sec2.35 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B612110.0 ms388.0 ms543.0 ms1.35 sec1.69 sec5.03 sec
Websocket (OS X Chrome N. Calif. -> EC2 US East) B61291.1 ms398.0 ms524.0 ms1.31 sec1.69 sec2.35 sec
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B61291.2 ms372.0 ms524.0 ms1.46 sec1.69 sec2.14 sec

Here, all of the connection methods were equally bad.  (It would have been interesting to run ping tests at the same time, to see whether TCP congestion control was part of the problem, but I neglected do so.)

Conclusion: good luck, you'll need it

With Web Sockets, a lightly loaded client, a lightly loaded server, a lightly loaded DSL line, and favorable winds, it's possible to get uniformly fast communication between a browser and server.  Disturb any of those conditions, and a latency tail rears up.  Where does that leave us?

In principle, you can control server load.  Of the major browsers, so far only Chrome implements Web Sockets as far as I know.  (Flash provides similar functionality, which I may benchmark for a later post.)  Client load and DSL load are entirely out of our control; the best one might manage is to give feedback to the user.  And, while Google is rumored to have some influence over the weather, when it comes to favorable winds the rest of us just have to hope for the best.