Am I Stronger Yet?: May 2010

As an experiment, I spent part of last week writing a simple HTML5-based multiplayer video game (a clone of the original Maze War). It works surprisingly well; browsers have come a long way.

A critical parameter for this sort of game is communication latency. When I shoot at you, how soon does the incoming missile appear on your screen? If the latency is not very short, the game feels "laggy". In my playtesting, lag wasn't a major problem, but occasionally it seemed like things would hiccup a bit. I decided to do some more scientific testing. I wrote a simple web page that issues an XMLHttpRequest once per second, and then reports the latency back to my microbenchmark server. Here is the resulting histogram:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	104.0 ms	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms	8.64 sec

(This reports latency for an XMLHttpRequest from Google Chrome 5.0.375.29 beta, on a MacBook Pro running OS X 10.6.3, over not-very-good 1500Kbps/500Kbps DSL in the S.F. Bay Area, to an Amazon EC2 server in Virginia. All benchmarks discussed in this post use the same setup, except as noted. All benchmarks also ran for the same time period, and so experienced the same network conditions, again except as noted.)

These figures aren't too shabby; 100ms or so strikes me as a reasonable response time. However, we have a fairly long tail, so players will see occasional lag. Here is the same data, graphed over time:

There was a spike around 9 PM, but if I exclude that time period, the histogram doesn't change much. The jittery latency seems to be present consistently over time.

Surprise! It's not the network

An obvious explanation for this latency tail is delays in the network. To verify that, I measured ping times between the same pair of machines. Surprisingly, the ping data shows a much shorter tail:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
Ping (N. Calif. -> EC2 US East)	41717	91.9 ms	89.6 ms	93.5 ms	97.2 ms	101.0 ms	140.0 ms	414.0 ms	907.0 ms

And most of the tail turns out to be localized to the 9 PM event:

Excluding that time period, most of the tail vanishes from the ping histogram: 99.9th percentile ping latency is only 154ms! Maybe my DSL isn't quite as bad as I'd thought.

It's partly the server

The tail in the XMLHttpRequest data must originate elsewhere. Perhaps on the server? My EC2 server is juggling quite a few tasks by now: it's running dozens of benchmarks, pulling additional benchmark data from App Engine, accumulating histograms, and serving the XMLHttpRequests for this latest benchmark. Overall CPU utilization is still fairly low, and CPU-sensitive benchmarks running on the server aren't showing much of a tail, but still perhaps there is some effect. So I fired up a second server, configured identically to the first but with no workload, and ran the same test there. The results are faster than for the original server, but only slightly:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	104.0 ms	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms	8.64 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	10446	104.0 ms	99.4 ms	108.0 ms	114.0 ms	117.0 ms	323.0 ms	629.0 ms	791.0 ms

(I've repeated the original data for comparison. The lightly-loaded "B" server is shown in the last row.)

It's partly HTTP

So, server CPU contention (or Java GC) may be playing a minor role, but to fully explain the latency tail we have to keep digging. Perhaps Web Sockets, by eliminating most of the overhead of the HTTP protocol, would help? Happily, Chrome now has Web Socket support. I decided to try several connection methods: XMLHttpRequest, Web Sockets, and JSONP (a trick wherein you issue a request by dynamically creating a <script src=...> tag, and the server returns a JavaScript file whose execution delivers the response). JSONP has the useful property of not being bound by the "same origin" security policy, enabling one additional connection method: "best of two". In this method, I have the browser issue a Web Socket request to one server, and a simultaneous JSONP request to the other. The latency is measured as elapsed time from when we begin sending the first of these two requests, until either response is received. Here are the results:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East)	13929	98.7 ms	104.0 ms	123.0 ms	155.0 ms	416.0 ms	688.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East)	13930	89.6 ms	93.8 ms	108.0 ms	122.0 ms	375.0 ms	582.0 ms
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	10447	98.6 ms	104.0 ms	116.0 ms	132.0 ms	341.0 ms	652.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	10446	99.4 ms	108.0 ms	114.0 ms	117.0 ms	323.0 ms	629.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	10447	89.5 ms	93.1 ms	96.3 ms	96.8 ms	119.0 ms	556.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	10446	89.6 ms	93.6 ms	98.4 ms	101.0 ms	219.0 ms	551.0 ms
Ping (N. Calif. -> EC2 US East)	41717	89.6 ms	93.5 ms	97.2 ms	101.0 ms	140.0 ms	414.0 ms
Ping (N. Calif. -> EC2 US East) B	39146	89.5 ms	93.1 ms	94.8 ms	96.9 ms	107.0 ms	424.0 ms

For each server, Web Sockets are clearly the fastest of the three connection techniques. On the lightly loaded "B" server, the Web Socket latencies are almost as good as the ping latencies. However, if I exclude the 9PM spike, there is still a noticeable difference in the tails: 99th and 99.9th percentile latencies for Web Socket requests are then 127 and 396 milliseconds respectively, while the equivalent ping latencies are 104 and 113 milliseconds.

It's interesting that the best-of-2 technique does not perform well. To the extent that the latency tail is caused by server CPU contention, best-of-2 should be a big improvement. It's unclear how much it would help with network issues, and it definitely can't help with client issues. The poor performance suggests that client issues contribute significantly to the latency tail.

And it's partly client load

Now, we've more or less controlled for network latency and server CPU. What factor could explain the remaining difference between ping and Web Socket latencies? Perhaps it's something in the browser. During the preceding tests, I had at least a dozen Chrome tabs open, some to complex sites such as Gmail. I restarted Chrome with only one tab, open to the page that executes the benchmark:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	1038	98.6 ms	103.0 ms	115.0 ms	134.0 ms	248.0 ms	492.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	1038	106.0 ms	112.0 ms	112.0 ms	117.0 ms	211.0 ms	446.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	1037	89.5 ms	93.0 ms	94.8 ms	96.6 ms	100.0 ms	118.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	1037	89.5 ms	93.1 ms	95.9 ms	96.9 ms	107.0 ms	369.0 ms

These results are from a short run (only about an hour), at a different time period than the others, and so should be taken with a grain of salt. However, the Websocket figures look really good, comparable to raw ping times. The HTTP-based connection techniques still show a significant tail. (It's worth noting that this might not be entirely due to HTTP overhead; it could also reflect overhead in the browser implementation of the respective connection techniques. JSONP in particular is a rather baroque approach and requires DOM manipulation and dynamic compilation.)

A convenient outage -- score one for best-of-two

During an earlier benchmark run (not reported here), my "B" server went offline. Amazon's explanation:

5:39 PM PDT We are investigating instance connectivity in the US-EAST-1 region.

5:55 PM PDT A subset of instances in a single Availability Zone became unavailable due to a localized power distribution failure. We are in the process of restoring power now.

No doubt this was inconvenient for some people, but for me it provided a handy test of the best-of-2 connection method. Indeed, all of the other "B" server benchmarks began failing, but the best-of-2 results were unperturbed.

This event also provided an interesting peek into Amazon's handling of outages. The event began at around 5:15, at least for my machine. Amazon's service dashboard was first updated at 5:39, roughly 24 minutes later. Hence, when I first detected the problem, I had no information as to whether it was something I'd done, an Amazon problem specific to my machine, or a more widespread Amazon problem. It would be nice if the AWS dashboard included up-to-the-minute service health metrics based on automated monitoring. It would also be nice if health information was propagated to the status page for individual EC2 instances. All the while that the machine was offline, the dashboard page for that instance continued to indicate that the instance was up and running.

One other quibble: the status reports never actually indicate which Availability Zone was impacted (it was us-east-1b). Seems silly not to provide that information.

Stress test

Just for kicks, I decided to repeat the tests while saturating my DSL line (by uploading a bunch of photos to Picasa). The results are not pretty:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	613	101.0 ms	419.0 ms	559.0 ms	1.35 sec	1.75 sec	2.35 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	612	110.0 ms	388.0 ms	543.0 ms	1.35 sec	1.69 sec	5.03 sec
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	612	91.1 ms	398.0 ms	524.0 ms	1.31 sec	1.69 sec	2.35 sec
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	612	91.2 ms	372.0 ms	524.0 ms	1.46 sec	1.69 sec	2.14 sec

Here, all of the connection methods were equally bad. (It would have been interesting to run ping tests at the same time, to see whether TCP congestion control was part of the problem, but I neglected do so.)

Conclusion: good luck, you'll need it

With Web Sockets, a lightly loaded client, a lightly loaded server, a lightly loaded DSL line, and favorable winds, it's possible to get uniformly fast communication between a browser and server. Disturb any of those conditions, and a latency tail rears up. Where does that leave us?

In principle, you can control server load. Of the major browsers, so far only Chrome implements Web Sockets as far as I know. (Flash provides similar functionality, which I may benchmark for a later post.) Client load and DSL load are entirely out of our control; the best one might manage is to give feedback to the user. And, while Google is rumored to have some influence over the weather, when it comes to favorable winds the rest of us just have to hope for the best.

Am I Stronger Yet?

Friday, May 28, 2010

More on the RDS latency oscillation

Thursday, May 6, 2010

Browser network latencies

Followers

Blog Archive