Am I Stronger Yet?: 2010

Saturday, October 16, 2010

Practical Web Performance Optimization

In web development nowadays, the importance of minimizing page load times is widely acknowledged. (See, for instance, http://www.stevesouders.com/blog/2010/05/07/wpo-web-performance-optimization/.) There are all manner of best practices for reducing load time. However, many of these practices make day-to-day development more complicated. In this post, I'll discuss some techniques we're using to reconcile performance with ease of development.

Consider the "primordial" approach to web serving: a site is a collection of files (static resources and/or dynamic templates) sitting in a directory. JavaScript and CSS code is written in a modular fashion, grouping code into files in whatever organization makes it easiest to locate and manage. Templating aside, files are served raw, always using the latest copy in the filesystem. This is quite nice for development:

You can edit using whatever tools you like.
Changes are immediately visible in the browser. No need to restart the server, or invoke any sort of recompilation tool.
Comments, formatting, and file organization are preserved, making it easy to inspect and debug your JavaScript, CSS, and HTML code.

However, the resulting site is likely to break many rules of web performance. Notably:

1. Each script module, CSS module, and image requires a separate HTTP request.

2. No CDN (content distribution network) is used -- all requests are served directly from the application server.

3. Content is not cacheable. (Or, if you do mark some files as cacheable, you'll run into trouble when deploying a new version.)

4. JavaScript, CSS, and HTML code are not minified.

Development mode vs. production mode

The usual approach to this convenience / efficiency tradeoff is to implement two serving modes. In development mode, the "primordial" model is used. In production mode, a preprocessing step is used to minify script code, combine scripts, push content to a CDN, and otherwise prepare the site for efficient serving.

This approach can get the job done, but it is a pain to work with. It's hard to debug problems on the production site, because the code is minified and otherwise rearranged. It's hard to test performance fixes, because the performance characteristics of development mode are very different than production. Developers may have to manually invoke the preprocessing step, and problems arise if they fail to do so. Web development is hard enough without this added complexity.

The preprocessing approach is also limited to static content. For instance, it is difficult to minify dynamic pages with this approach.

Solution: optimize on the fly

In our site, optimizations are applied on the fly, in a servlet filter that postprocesses each response. (Of course, the server is free to cache the optimized version of static files.) Current optimizations include:

Minify JavaScript and CSS files.
Minify HTML pages, including embedded JavaScript and CSS.
Rewrite script/style/image URLs to support caching (described below).

Forthcoming extensions:

Asset bundling
Rule checking
Asset repository to avoid version skew (described below)

Because optimization is done on the fly, it is never "stale"; developers don't need to remember to re-run the optimizer. At least as important, it can be enabled or disabled on the fly. When investigating a problem on the production site, a developer can disable minification and debug the code directly. Or they can turn on minification in a development build and observe the effects.

Optimization is enabled and disabled using a simple configuration system we've built. Configuration flags can be specified at the server, session, or request level. The session mechanism is particularly convenient: simply by invoking a special form, a developer can disable minification for themselves on the production server, without affecting the way content is served to other users. And toggling minification in a development build doesn't require a server restart.

URL rewriting for cacheability and CDNs

For performance, it's critical to allow browsers to cache asset files (e.g. images and scripts). However, this causes problems when those files are updated. The best solution I'm aware of is to change the filename whenever an asset is modified. This allows you to mark the files as permanently cacheable. However, it's a hassle to implement: whenever you modify a file, you have to update all links referencing it.

Our servlet filter scans HTML content for asset references (e.g. <script src=...> or <img src=...>). For any such reference, if the reference refers to a static file in our site, the filter rewrites the reference to include a fingerprint of the file contents. Thus, references automatically adjust when an asset file is changed.

When our server receives a request for an asset file, it looks for a fingerprint in the filename. If the fingerprint is present, we set the response headers to allow indefinite caching. Otherwise, we mark the response as noncacheable.

When we adopt a CDN, we'll use this same mechanism to rewrite asset references to point to the CDN.

One issue we haven't yet tackled is rewriting asset references that appear in JavaScript code. Fortunately, the mechanism is failsafe: if our filter doesn't rewrite a reference, that asset will be served as noncacheable.

File bundling

A page often references many asset files. Performance can be improved by bundling these into a smaller number of larger files. For JavaScript and CSS files, this is fairly straightforward; for images, it requires CSS sprites. In both cases, we run into another tradeoff between convenience and performance -- development is easier with unbundled files.

Once again, on-the-fly rewriting comes to the rescue. When the servlet filter sees multiple <script> or <style> references in a row, it can substitute a single reference to a bundled file. When it seems multiple CSS image references, it can substitute references to a single sprite image. The configuration system can be used to toggle this on a server, session, or request level.

Rule checking

Some web performance rules require nonlocal changes to a page or site. For instance (from http://developer.yahoo.com/performance/rules.html):

Put stylesheets at the top
Put scripts at the bottom
Avoid CSS expressions

Over time, we might extend our servlet filter to implement some of these optimizations by rewriting the response. But in the meantime, it is relatively straightforward for the filter to detect and report violations. It is easier to perform this detection at runtime than in a static lint tool, because we can observe the final, post-template-engine version of the page. And unlike tools like YSlow, we don't have to manually test each release of the site.

Consistency across server updates

When you push a new version of your site to the production server, a "version skew" problem can occur. Suppose a user opens a page just before the new version is activated. They might wind up receiving the old version of the page, but the new version of the scripts. This can cause script errors. If your site is distributed across multiple servers, the window of vulnerability can be many minutes (depending on your push process). This is not a performance issue, but it involves references from a page to its assets, so it ties into the mechanisms involved in asset caching.

Happily, the URL rewriting technique described in earlier offers a solution to version skew. When a server receives a request for an asset file, it simply serves the file version that matches the fingerprint in the URL. This ensures that all asset files come from the same site version as the main page.

This solution does assume that all servers have access to all versions of the site. This can be addressed by copying asset files to a central repository, such as Amazon S3, during the push process.

Conclusion

A surprisingly large number of web performance optimization rules can be implemented with little or no impact on day-to-day site development. Equally important, optimizations can be implemented with session-based disable flags, allowing problems to be debugged directly on the production site. We've been using this approach in our site development, with good results so far.

Saturday, October 2, 2010

Practical XSS defense

This blog has been quiet lately. I've been working on the nuts and bolts of what we^[1] hope will eventually be a major web site, and that hasn't lent itself to the sort of material I want here. I'm trying to post only when I have something at least mildly novel and meaty to talk about -- as opposed to, say, photos from my three-day trip to the land of Tomcat configuration. We're starting to get past the boilerplate into more interesting work, so I hope to start posting more often again.

One thing I set out to tackle this week is XSS defense. Traditionally, it's a tedious and error-prone task. In this post, I'll present an attempt to improve on this.

Wikipedia provides a good introduction to the subject. As a brief refresher, XSS (Cross-Site Scripting) is an attack where a malicious user enters deliberately malformed data into your system. If your site is not properly protected, it may display that data on a web page in such a way that the browser interprets it as JavaScript code, allowing the attacker to take control of the browser of any victim who views the page.

Escaping

There are a variety of defenses against XSS. The most common is escaping -- transforming the data in such a way that the browser will not interpret it as code. For instance, when including a user-supplied value in a web page, characters like "<" and "&" should be rewritten as HTML entities -- "<" and "&". Old hat, probably, to most of you reading this.

Escaping, if done correctly, is a solid defense against XSS. However, getting it right is notoriously difficult. You must apply the escaping in every single place where your code inserts user-supplied data into a web page. That can easily be thousands of locations. Miss a single one, and your site is vulnerable. It's sometimes necessary to perform the encoding at different levels in the code, making it hard to keep track of which strings have already been encoded. Worse yet, different sorts of encoding are needed depending on context: HTML entity encoding, URL encoding, backslash encoding, etc. I've even seen cases where two levels of encoding were needed -- for instance, when a value is included in a JavaScript string literal, and the JavaScript code later places that string in the innerHtml of some DOM node. If you get the encoding wrong, you may again be vulnerable.

Escaping also tends to uglify your code. In JSP, <%= username %> might have to become <%= Escaper.entityEncode(userName) %>. The impact on code readability and maintainability is nontrivial. Some templating systems handle this better than others; we're using plain 'ol JSP, which offers no special support. So, we weren't very enthusiastic about this approach.

Validation and filtering

Another well-known defense is to validate and/or filter your input: disallow users from entering special characters like <, or filter out those characters. If these characters never enter your system, you don't have to worry about escaping them at output time.

Most web sites have fewer inputs than outputs, so airtight input filtering is easier to achieve than airtight output escaping. Also, a single filter can render data safe for inclusion in a variety of contexts, unlike output escaping where you have to be careful to use the correct escaping mode according to context. (Though this requires a broadminded definition of "special characters" -- see this link for a discussion.)

Input filtering does have some drawbacks. Most notably, it is visible to users. Some dangerous characters, like ' and &, appear frequently in ordinary situations, and users will be annoyed if they can't use them. For this reason, filtering is most commonly used for specific data types, such phone numbers, rather than free-form text such as a message subject.

Another drawback of input filtering is that, if you find a bug in your input filter, you have to re-validate all existing data. That can be a huge burden in practice, especially if you're under time pressure to close a security hole. With output escaping, as soon as you fix the bug and push a new build, you're protected.

Design criteria

We're building an actual web site; rubber is meeting road. Theory aside, we need a concrete plan for XSS defense. Ideally, the solution would satisfy the following criteria:

Easy to use -- no complex rules to be remembered while coding.
Minimal impact on code readability and maintainability.
Works with JSP (our current templating system), and portable to other templating systems.
Little or no user-visible impact.
Auditability -- it should be possible to scan the code with an automated tool and identify any possible XSS holes.
Defense in depth -- multiple, independent / redundant mechanisms preventing XSS exploits.

Also, I'd like a pony. (Well, not really. One of these, maybe.) Output escaping fails on ease of use, and is difficult to audit; input filtering fails on user impact, and by itself does not provide defense in depth. Time to get creative.

Input transformation

Again, one problem with input filtering is that it can cause serious annoyance to users. As noted in one of the pages linked above, imagine poor Mr. O'Malley's frustration when he can't type the ' in his name.

What if, instead of forbidding dangerous characters, we replace them with safer substitutes? The most common offender, the single quote, has a very acceptable substitute -- the curved single quotes, ‘ and ’. When we process a form submission, we could perform this substitution automatically. Mr. O'Malley might not even notice that he's now Mr. O’Malley, and if he did notice, he probably wouldn’t mind.

This appeals to me. The main objection to input filtering is the impact on users, and this mitigates that impact. The impact is not eliminated completely, so this won't work in all situations. But in our application, it should be usable for almost all inputs.

The mapping I'm currently envisioning is as follows:

' -> ‘ or ’ (depending on context)

" -> “ or ” (depending on context)

< -> (

> -> )

& -> +

\ -> /

This isn't sufficient in every situation. For instance, URLs need a different input transformation -- the mapping above will break some URLs, and doesn't rule out "javascript:" or other dangerous links. And no reasonable input transformation will suffice if you include user-supplied values in an unquoted tag attribute -- quoting is essential. But if you're good about quoting, this transformation suffices for most common situations. And it scores pretty well on my "I want a pony" design criteria:

Ease of use: simply replace every instance of request.getParameter("name") with something like XssUtil.getSafe(request, "name"). (With exceptions only for those unusual cases where the transformation is not acceptable.)
Impact on code readability: the new form is not especially bulkier than the old.
Template compatibility: input transformation has no impact on the templating system.
Auditability: it's easy to grep your codebase for unprotected calls to getParameter.

That leaves only user impact, which I've discussed; and defense in depth. Defense in depth brings me to my next topic.

Script tagging

Most approaches to XSS defense involve enumerating, and protecting, every spot on a page where scripts might creep in. The idea is for the page to be "clean" as it emerges from the templating system. As we've seen above, that's difficult.

Instead, let's accept that a malicious script might sneak into the generated page. If we have a whitelist of places where scripts are supposed to appear, we could filter out any unwanted ones. This would work as follows:

1. Tag all "good" scripts -- scripts that we're deliberately including in the page -- with a special marker. In JSP, the coding idiom might be something like this:

<%= scriptToken() %>

...regular JavaScript code here...

</script>

It's important that an attacker not be able to guess the special marker. This is easily ensured by generating a random marker for each request.

2. After the templating system executes, run the generated page through an HTML parser to identify all scripts on the page. (For this to be robust, we'll actually need an HTML "cleaner" that rewrites the page in a way that all browsers can be trusted to parse properly.) Here, "all scripts" means all constructs that could trigger undesired actions in the browser: <script> tags, <style> tags, onclick handlers, URLs with protocols other than http/https/mailto, etc. Like any HTML cleaner, the system should be based on a whitelist -- any tag or attribute not in the whitelist is considered dangerous.

3. Whenever we see a script with the special marker, remove the marker and leave the script alone. If we see any scripts without the marker, strip them out, and flag the page for manual inspection.

I don't recall seeing this approach suggested before, but at first blush it seems sound to me. Of course, browsers are complicated beasts, and I may be missing something. If you can poke a hole, please let me know!

This approach does impose processing overhead, to parse each generated page. However, cycles are cheap nowadays; security holes are expensive, as is programmer time. Also, the same processing pass can perform other useful functions, such as HTML minification. How does the approach stack up on my design criteria?

Easy of use: pretty good. It only requires adding a bit of boilerplate at the beginning of every script block; in practice, it might look something like <%= scriptToken() %>.
Code impact: the token is not a big deal for a script block. It will be more annoying in something small like an onclick handler.
Template compatibility: inserting a token should be easy in any templating system.
User impact: none.
Auditability: excellent. Runtime reporting of marker-less scripts makes the system self-auditing.
Defense in depth: this approach is completely independent of input transformation, so combining the two achieves a layered defense. It could also be combined with traditional output escaping.

Conclusions

The combination of input transformation and script tagging yields a layered, auditable defense against XSS attacks, with less programming burden than traditional output escaping. Output escaping will still be needed in a few places where input transformation is impractical, such as when embedding a user-supplied value in a URL.

If there is interest, I might take the time to open-source this code once when it's completed.

Notes

[1] My old partners in crime from Writely, Sam Schillace and Claudia Carpenter

Monday, July 26, 2010

Three-month review

On a whim, this afternoon I decided to take a look at the long-term graphs for some of my benchmarks. Most of them now have at least three month's worth of data, and I was curious to see whether there have been changes over time.

Most of the graphs look pretty uneventful. This computational benchmark from EC2 (measuring Math.sin) is typical:

Basically unchanged. (The early blip in median latency might be a statistical artifact: the histogram shows a large peak around 130ns, and a smaller peak at 250ns; perhaps the large peak comprises more or less precisely half of the samples, so as it wanders from 49% to 51%, the median could flit back and forth between the two peaks..)

The Math.sin benchmark running on App Engine shows a little more variation:

This presumably follows from the fact that App Engine allows clients to share a CPU.

Shared storage services show much more variation. Here is the latency to read a single record from an Amazon SimpleDB database (without consistency):

Here's the same graph, with the vertical scale capped at 100ns to make the lower percentiles more visible:

The lower percentiles look steadier, as one might expect -- intermittent issues are likely to be fringe issues, and thus to disproportionately affect the higher percentiles.

App Engine Datastore read and write latencies wander around in a similar manner -- here's the graph for (non-transactional) reads:

EBS read and write latencies are comparatively stable, with some variation at the 99th percentile. EC2 local-disk latencies are even more stable. Amazon RDS latency is another story -- look at this graph for reads (of a single record from a pool of 1M records):

There have been a couple of distinct improvements in 99th percentile latency. Expanding the Y axis shows more detail for the 90th percentile:

It has wandered around, though it improved sharply a few weeks ago along with the 99th percentile, and has been quite steady since then. Further expansion (not shown here) reveals the median and 10th percentile latencies to have been rock-solid all along.

Here is the graph to write one record to RDS:

Again we see a sharp improvement around July 9th. Zooming in shows that the 10th percentile and median values have actually gotten worse over time:

While RDS read latencies have improved substantially, the bizarre 6-hour cycle is still present. This graph, showing a couple of days on either side of the July 9th improvement, makes that clear:

(Note, in this graph the Y axis has been clipped at 300ms; many of the peaks are actually as high as 750ms.)

Of course, I have no way of knowing what caused any of these variations. I haven't gone near these servers in quite some time, so the RDS latency improvements in June and July can't be anything I did. Just another reminder that you can't take the performance of a shared service for granted. I did receive a communication in late June from AWS support that "we are currently working on adjustments that would reduce latency in the db.m1.small instances"; this may explain the drop on July 9th.

I'll revisit this topic every few months. Until then, back to the logging project...

Saturday, July 17, 2010

Instrumentation: what's going on in that server?

As part of their investigation into the RDS latency anomaly, Amazon asked me for a log of RDS request latencies. Despite the fact that the entire purpose of the benchmark program is to measure, record, and analyze request latencies, there was no way to obtain an explicit log. I had to modify the program and push a new build.

This spurred me to think more systematically about my approach to data gathering. When developing and maintaining any sort of online service, it's important to be able to analyze the past and present behavior of your servers. The primary tool for this purpose is usually some sort of server log (e.g. stdout or syslog), though many other tools are used as well. In this post, I'll use the generic term "instrumentation" to cover all mechanisms that enable inspection of the past behavior of a server system.

The questions one might have about a system's behavior are virtually unbounded, and a plethora of instrumentation mechanisms have been invented to answer them. Too many, really: real-world systems often use a hodgepodge of instrumentation mechanisms. This is difficult to learn and use, and yields only patchwork functionality. In this post, I look for a more organized approach.

How do we use instrumentation?

What sorts of questions might one want to answer using server instrumentation? It's a broad space. Here are some representative examples:

Why was that slow?  You've loaded a page from your own service, and you notice that it took several seconds to appear. It's supposed to be faster. What went wrong? Where was the time spent? Note that you might not know, a priori, specifically which server handled your request. Also note that the request might have invoked nested requests (e.g. to a database server); if the nested requests were slow, you might have to look into the database server to understand the problem.

Why didn't that work?  A user reports that they got an error message while using your site. They don't remember the details, but they can tell you their account name, and roughly when the error occurred. What happened exactly? You'd like to dig up the request that triggered the error, error details (e.g. stack crawl, relevant parameters and server state), and the user's interaction history leading up to the error. Again, you might not know which specific server was involved, and the error might have bubbled up from a deeper server tier.

Are my servers healthy right now?  What is the average request latency? Are we serving [m]any errors? For a variety of server metrics, are the current values in line with past behavior?

I just tweaked something, what was the impact?  How did server metrics change when the tweak was made? Any speedup or slowdown? Are any new errors occurring?

What is this user's history on the site?  Perhaps you've noticed what looks like suspicious behavior from a particular user -- you suspect they may be a spammer. How have they been using your service?

What is this server up to?  Suppose CPU usage spikes on one of your servers. What is that server up to -- what operations is it performing, is there any unusual activity?

How many unique daily users does my site have?  This leads into more complex "analytics" questions -- breakdown of visitors by geographic region, referral source, trajectory through the site, etc.

Which records were modified by module X during a certain time period?  You might need to know this to clean up after a bug.

On what occasions has a certain operation been performed, and how long did it take on each occasion -- the question Amazon asked me about RDS requests.

Why instrumentation is needed

Logging, event counters, and other "historical record" instrumentation mechanisms are not the only tools for understanding the behavior of a software system. There are also debuggers, black-box monitoring, scripted tests, and many other approaches. However, proactively recording the internal behavior of live production servers can often be invaluable:

It allows you to investigate problems that you can't reproduce. ("This user says X happened to her yesterday morning.")

Corollary: for problems that you can reproduce, it can save you the often-considerable hassle of reproducing them.

It allows you to investigate problems that only arise in the live production environment. ("Performance degrades every day during the morning traffic peak. Our artificial load generator doesn't reproduce the pattern.")
It allows you to detect patterns over time. ("The site seems slower lately... has something changed?")

Instrument everything

Given that you are going to instrument your program, what should you instrument? Which events are worth logging? Which variables should be recorded? It's hard to know in advance. If you instrument too much, you will have a flood of data that may be hard to interpret, and you may even overload some server resource such as disk bandwidth. If you instrument too little, you may not get the information you need.

We have at least two tools available to avoid this dilemma. One is configurable instrumentation -- the ability to dynamically disable some instrumentation points at runtime, perhaps via a configuration file. (Configurable "logging levels" -- e.g. discard all log messages at FINE and below -- are a simple example of this.) If you liberally instrument your program, configurable instrumentation allows you to throttle the output down to a manageable level, while retaining the ability to enable detailed output on a moment's notice.

The second tool is temporal winnowing -- retain detailed data for a short period, then thin it out. You might retain FINER logging in RAM for 100 seconds, FINE on local disk for a day, and INFO in a central repository for a month. This provides detailed information for the recent past (the period you're most often interested in), while minimizing overall storage costs.

It's also worth recalling that disks have gotten very cheap. Suppose we allow ourselves 1MB/sec of (moderately) compressed log output. That allows for a lot of logging per request, even on a server handling hundreds or thousands of requests per second. And yet retaining that output for a week requires only 600GB of disk space -- a fraction of a single modern hard disk. So: "when in doubt, write it out".

A unified data model

To answer the sorts of questions mentioned above, we need several kinds of information:

Traditional logging -- a textual record of events of interest.
Rate counters -- how often do various events occur?
Latency measurements -- how long does each operation of interest take to complete?
Latency breakdowns -- how does the time spent in an operation break down into sub-operations?
Time-varying values -- what is the history of, say, heap size or CPU utilization?

Here is an attempt to unify these under a single mechanism. I use the traditional logging model as my starting point; it's simple, easy to understand, and has stood the test of time. Imagine that we are writing a servlet to display some sort of dynamic content. We can sprinkle in a few log calls (using a hypothetical "Log" class whose precise API I leave unspecified):

  Log.info("rendering item % on behalf of user %", itemId, userId);
  if (...) {
   Log.fine("case 1; % subrecords", count);
   ...
  } else {

Log.fine("case 2; % subrecords", count);

...

}

Log.fine("finished rendering item");

Nothing novel here. Clearly this provides our first category of information -- a textual log of events of interest. With a little pattern matching, we can get rate counters as well. For instance, to determine the number of item pages that have been displayed, we can count the number of log messages that begin with "rendering item". (If this is too fragile for your taste, you could adopt the convention of beginning each log message with a stable identifier -- Log.info("pages/render/item % on behalf of user %", itemId, userId)).

For latency measurements, we need to identify the start and end of an operation of interest. This can be done by indicating which Log calls mark an operation start or end. Here's the same code fragment again, with two slight tweaks (highlighted in bold):

Log.startInfo("rendering item % on behalf of user %", itemId, userId);

...

if (...) {

Log.fine("case 1; % subrecords", count);

...

} else {

Log.fine("case 2; % subrecords", count);

...

}

Log.endFine();

The log calls at the start and end of the render operation have been explicitly identified. (This naming convention -- "startInfo" and so forth -- is clumsy, especially when the start and end are logged at different severity levels. For an actual implementation, I would try to come up with something better. I've also glossed over many other details, such as what happens if an exception is thrown between the start and end calls.) And the message text has been omitted for the "end" call, as the Log class now has enough information to generate an appropriate message automatically. The output from this code might look something like the following:

  05:01:37.109 rendering item 8461 on behalf of user 196584
  05:01:37.110 | case 1; 9 subrecords
  05:01:37.254 | [completed in 0.145 seconds]

If Log.startXXX / Log.endXXX calls are nested, we can easily extract the fourth data category, hierarchical latency breakdowns. (That leads to the interesting question of how to track sub-operations that execute on a different thread, or even in a different server; Google's Dapper project shows an approach.)

Finally, we have time-varying values. These can be recorded by periodically emitting them into the log:

  Log.info("free memory currently %", Runtime.freeMemory());

The history can then be obtained using a simple pattern match. Again, for more reliably pattern matching, we could use a stable identifier instead of the prose string "free memory currently".

This unified "everything in the log" approach provides many benefits. For instance:

A single Log call enables multiple forms of analysis -- you get traditional logging, rate counting, and latency graphing with a single hook.
There's only one instrumentation system to learn and deploy.
"Joins" become possible (e.g. graph the latency of operation X, considering only invocations made during requests from user Y).
The log becomes comprehensive; you can see all available data in one place, rather than jumping around between different instrumentation systems.

Infrastructure requirements

To answer questions about system behavior using nothing more than log output would require significant infrastructure. Logs would need to be stored, rotated, cataloged, summarized or indexed, and analyzed in fairly sophisticated ways. In effect, I'm proposing a data warehousing approach to log management (despite knowing nothing about the field of data warehousing). I'll dive into this in a later post.

Though the infrastructure required to support this approach is significant, it is important not to let the infrastructure drive the overall design. The infrastructure only has to be written once, but it will be used every day. This "use" falls into two categories:

1. Instrumenting a program -- adding Log calls while writing code.
2. Analyzing a system -- pulling up logs and graphs to understand what's going on.

The primary goal should be to make these two activities as cheap and easy as possible; let the infrastructure bend itself toward that goal.

Existing work?

There must be literature about logging and system forensics, but I'm not especially familiar with it. If you're reading this and you can point me to any interesting papers, I'd love to see them. (You can comment on this post, or e-mail me at steve at snewman.net.) I'm especially interested in unified approaches to system instrumentation (encompassing logging, performance measurement, etc.), and real-world solutions for multi-server systems.

Friday, June 11, 2010

A quick survey of client latencies

In an earlier post, I examined network latencies as measured by Google Chrome. As a follow-up, I've repeated the experiment using a variety of browsers, hardware devices, and network links. Here is a quick overview of the results:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
Chrome Websocket / DSL	897	89.6 ms	94.0 ms	105.0 ms	105.0 ms	273.0 ms	881.0 ms
Chrome / DSL	897	102.0 ms	111.0 ms	121.0 ms	118.0 ms	337.0 ms	1.07 sec
Firefox / DSL	1793	98.9 ms	105.0 ms	116.0 ms	116.0 ms	273.0 ms	1.07 sec
Safari / DSL	1755	98.5 ms	103.0 ms	111.0 ms	107.0 ms	270.0 ms	1.19 sec
iPhone / DSL	1182	145.0 ms	156.0 ms	232.0 ms	294.0 ms	1.03 sec	3.94 sec
Nexus One / DSL	1158	132.0 ms	147.0 ms	589.0 ms	233.0 ms	1.91 sec	68.4 sec
iPhone / AT&T 2G	1063	487.0 ms	629.0 ms	875.0 ms	1.22 sec	5.53 sec	15.9 sec
Nexus One / T-Mobile 3G	949	286.0 ms	320.0 ms	345.0 ms	381.0 ms	915.0 ms	2.27 sec
Nexus One / AT&T 2G	240	351.0 ms	438.0 ms	601.0 ms	668.0 ms	5.49 sec	6.31 sec
Nexus One / T-Mobile 2G	510	378.0 ms	409.0 ms	429.0 ms	479.0 ms	790.0 ms	2.61 sec

[Fine print: All of these measurements involve a client in the S.F. Bay Area area talking to an Amazon EC2 server in Virginia. In light of the previous post, both the client and server were minimally loaded. The first result uses Web Sockets, the others all use XMLHttpRequest (only one of my test browsers support Web Sockets). The client device, if not mentioned, is a MacBook Pro; the browser, if not mentioned, is the device's built-in browser. The iPhone is a first-generation model (shocking, I know). All cellular network tests involved a full strength signal (according to the phone's status display), except for "iPhone / AT&T 2G" which had 3 or 4 out of 5 bars. The DSL samples were all taken at roughly the same period of time, from the same location, and thus experienced the same DSL conditions. Your mileage may vary. Past performance is no guarantee of future results.]

This is a very un-scientific study: the sample counts are small, and each test was made from a single location over a brief period. Still, some trends are clear:

As in my previous experiments, median latencies from the MacBook are quite good, and the tail is not terrible either.
There are no glaring differences between browsers.
Web sockets, currently supported only by Chrome (among the browsers I tested), continue to look like a good idea.
Even over the DSL line (via WiFi), both cell phones were substantially slower than the MacBook, with a very long tail.
All of the cellular connections were much slower than DSL. 3G is much faster than 2G, and not all 2G connections are created equal.

No surprises here. I wasn't sure what to expect from the cellular connections. It's perhaps surprising that they're so much slower than DSL in this test. The amount of data exchanged is small -- around 600 bytes, most of which is request headers. I don't know the nominal bandwidth of these 2G connections, but let's suppose it's 128Kbps. 600 bytes is 4800 bits, which would take roughly 37 milliseconds to transmit at 128Kbps. Yet even the 10th percentile figures are at least 200 milliseconds slower than the same device communicating via WiFi + DSL.

Impact of request size

I decided to directly test the impact of request size on latency. According to Firebug, when I run this benchmark in Firefox, the request headers are as follows:

Host: metrics2.amistrongeryet.com

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip,deflate

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Keep-Alive: 115

Connection: keep-alive

Referer: http://metrics2.amistrongeryet.com/CloudMetricsEC2/netbench.jsp

Cookie: sid=s1; browserId=dummy; sid=s1

for a total of 475 bytes. (The other components of the request and response are small by comparison, so I don't bother to detail them here.) I was able to easily surpress the User-Agent, Accept, and Accept-Language headers using XMLHttpRequest.setRequestHeader (e.g. request.setRequestHeader("User-Agent", "")), reducing the request headers to 263 bytes. Repeating the Firefox / DSL test, this yielded a small but definite improvement:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
Regular headers	642	100.0 ms	110.0 ms	116.0 ms	117.0 ms	458.0 ms	510.0 ms
Reduced headers	641	98.8 ms	104.0 ms	113.0 ms	116.0 ms	399.0 ms	488.0 ms

With additional effort, it should be possible to further shrink both the request and response, but on DSL the savings don't appear to be large. The setRequestHeader trick doesn't seem to work on the Nexus browser, so I wasn't able to directly reproduce this comparison on a cellular link. Instead, I did a side-by-side test of the regular request headers vs. "padded" headers (adding an extra header containing 500 bytes of text). Here are those results:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
T-Mobile 3G	310	257.0 ms	282.0 ms	299.0 ms	357.0 ms	527.0 ms	579.0 ms
T-Mobile 3G (padded)	309	285.0 ms	318.0 ms	338.0 ms	400.0 ms	653.0 ms	1.5 sec
T-Mobile 2G	426	380.0 ms	426.0 ms	454.0 ms	549.0 ms	1.01 sec	1.25 sec
T-Mobile 2G (padded )	426	421.0 ms	471.0 ms	499.0 ms	621.0 ms	948.0 ms	1.12 sec

All tests were on a Nexus One with a full-strength signal. Once again, request size has only a moderate impact.

Impact of WiFi

Another variable to be isolated is a wired connection vs. WiFi. Here are figures for Chrome running on two different Macs, both connected to the same DSL line, one via WiFi and one via Ethernet:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
XMLHttpRequest / Ethernet	2227	108.0 ms	113.0 ms	119.0 ms	120.0 ms	326.0 ms	706.0 ms
XMLHttpRequest / Wifi	2069	108.0 ms	113.0 ms	124.0 ms	124.0 ms	397.0 ms	516.0 ms
Websocket / Ethernet	2226	90.4 ms	97.9 ms	102.0 ms	106.0 ms	198.0 ms	894.0 ms
Websocket / Wifi	2069	92.7 ms	101.0 ms	110.0 ms	107.0 ms	385.0 ms	540.0 ms

There's a difference, but it's fairly minor, at least through the 90th percentile (and this sample is too small to place much weight in the 99th percentile figures). Clearly, WiFi does not explain the speed difference between a smartphone and a conventional computer when both are connecting to a DSL line.

The conclusion seems to be that phones are just plain fundamentally slower. This could be a property of raw processing speed, the browser implementation, competition from background tasks, or all of the above; but it does not seem to relate to factors I can control in an experiment. Encouragingly, the newer phone (Nexus One) and newer network (3G) both perform better than the original iPhone and 2G, so perhaps we can look forward to a general trend of improvement.

Appendix: Histograms

There are no surprises in any of the histograms for these benchmarks, but as histograms seem to be a theme for this blog, I'll toss in a few. Here's the "Chrome / DSL" XMLHttpRequest benchmark:

Many of the histograms look like that -- a sharp spike around 100ms, and a tail out to 1 second or so, with not much structure in the tail. (More structure might be visible with a larger sample count.) The iPhone and Nexus DSL tests had a broader peak -- here's the iPhone:

The block of timings in the 200-300ms range don't seem to be explained by WiFi or other network issues, so perhaps they reflect garbage collection, CPU contention, or some other environmental factor on the phone.

The cellular links had much broader peaks; here's the iPhone again (this is the same AT&T 2G test reported above):

Other cellular links have histograms similar to this one, though not as wide.

Friday, May 28, 2010

More on the RDS latency oscillation

[Sorry for the long gap in posts; I've been working on other things. Posts are likely to be somewhat sporadic for the foreseeable future, especially as I have quite a bit of travel coming up this summer.]

In Three Latency Anomalies, I mentioned a periodic oscillation in latency for simple read and write operations in Amazon's RDS (hosted MySQL) service. I opened a support ticket with Amazon, and eventually received this response:

Hi Steve,

The latency you're seeing may be related to the DB Instance class (db.m1.small) that you're running. The I/O performance on the db.m1.small is Moderate, vs. High for the Large and Extra Large DB Instance classes. We are working on publishing guidelines around this variance on the Amazon RDS product page.

You might try running your test on one of the "High" I/O performing systems and see if that improves the latency issues that you're experiencing.

Regards,

The Amazon RDS Team

The I/O explanation didn't sound right to me, because the variance appears even when repeatedly re-reading a single record, which ought to be cached. But I decided to try their advice and experiment with a "large" RDS instance. While I was at it, I added a couple more small instances and an additional client. This gave me 8 client/server pairs to test -- 2 clients, each issuing requests to 4 servers. The operations benchmarked are:

Fetch one small record (always the same record)
Fetch a small record randomly selected from a table of 1,000,000 records
Update one small record (always the same record)
Update a small record randomly selected from a table of 1,000,000 records

All operations use the same table, which is shared across clients (but, obviously, not across servers). Here is the last 20 hours worth of latency data for the first operation, on the original client / server pair:

Here is the same graph, for the same client but a second"small" RDS instance:

Not only does it show the same oscillation, the oscillations are aligned. Uh oh: could this be a client issue after all? Well, that's why I launched a second client (on a different EC2 instance). Here are the results for the second client, talking to the first server:

Very similar to the graph from the first client. Here's the second client talking to the second server:

So, two different clients talking to two different servers, all presumably on different machines, showing latency oscillations that not only have the same frequency, but are precisely aligned. I won't bother including graphs here for the third small server, but they look very similar to the above.

In theory, the only common element in these tests should be the network; and even there, I'd expect independent network paths to be involved. In any case, my us-east-1b network measurements show no such oscillation. So this additional data merely confirms the mystery.

I haven't shown data yet for the "large" server. Here is the graph for client 2 fetching a single record:

This has the same periodicity as the other graphs, but the absolute latency values are much lower: 99th percentile latency oscillates between 5 and 20 ms, as opposed to something like 25 - 250 for the small servers. Finally, just to make things really fun, client 1 talking to the large server does not show periodicity:

The periodicity is present in the 90th percentile plot, but it's not apparent in the 99th percentile. I happen to have several days' worth of data for this pair, which confirms that only the 90th percentile plot has much periodicity:

One explanation would be that, for this particular server pair, there is an additional component to the latency tail that prevents 99th percentile latency from dropping below 20ms. That could be explained by any client or network issue that imposed a 20ms delay for 1 to 10 percent of operations.

Conclusion: I still have no idea what causes the periodicity, but I'm even more convinced that something interesting is involved. I'll pass the new data back to the RDS team and see what they can make of it.

Thursday, May 6, 2010

Browser network latencies

As an experiment, I spent part of last week writing a simple HTML5-based multiplayer video game (a clone of the original Maze War). It works surprisingly well; browsers have come a long way.

A critical parameter for this sort of game is communication latency. When I shoot at you, how soon does the incoming missile appear on your screen? If the latency is not very short, the game feels "laggy". In my playtesting, lag wasn't a major problem, but occasionally it seemed like things would hiccup a bit. I decided to do some more scientific testing. I wrote a simple web page that issues an XMLHttpRequest once per second, and then reports the latency back to my microbenchmark server. Here is the resulting histogram:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	104.0 ms	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms	8.64 sec

(This reports latency for an XMLHttpRequest from Google Chrome 5.0.375.29 beta, on a MacBook Pro running OS X 10.6.3, over not-very-good 1500Kbps/500Kbps DSL in the S.F. Bay Area, to an Amazon EC2 server in Virginia. All benchmarks discussed in this post use the same setup, except as noted. All benchmarks also ran for the same time period, and so experienced the same network conditions, again except as noted.)

These figures aren't too shabby; 100ms or so strikes me as a reasonable response time. However, we have a fairly long tail, so players will see occasional lag. Here is the same data, graphed over time:

There was a spike around 9 PM, but if I exclude that time period, the histogram doesn't change much. The jittery latency seems to be present consistently over time.

Surprise! It's not the network

An obvious explanation for this latency tail is delays in the network. To verify that, I measured ping times between the same pair of machines. Surprisingly, the ping data shows a much shorter tail:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
Ping (N. Calif. -> EC2 US East)	41717	91.9 ms	89.6 ms	93.5 ms	97.2 ms	101.0 ms	140.0 ms	414.0 ms	907.0 ms

And most of the tail turns out to be localized to the 9 PM event:

Excluding that time period, most of the tail vanishes from the ping histogram: 99.9th percentile ping latency is only 154ms! Maybe my DSL isn't quite as bad as I'd thought.

It's partly the server

The tail in the XMLHttpRequest data must originate elsewhere. Perhaps on the server? My EC2 server is juggling quite a few tasks by now: it's running dozens of benchmarks, pulling additional benchmark data from App Engine, accumulating histograms, and serving the XMLHttpRequests for this latest benchmark. Overall CPU utilization is still fairly low, and CPU-sensitive benchmarks running on the server aren't showing much of a tail, but still perhaps there is some effect. So I fired up a second server, configured identically to the first but with no workload, and ran the same test there. The results are faster than for the original server, but only slightly:

Operation	# samples	Min	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile	Max
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	104.0 ms	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms	8.64 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	10446	104.0 ms	99.4 ms	108.0 ms	114.0 ms	117.0 ms	323.0 ms	629.0 ms	791.0 ms

(I've repeated the original data for comparison. The lightly-loaded "B" server is shown in the last row.)

It's partly HTTP

So, server CPU contention (or Java GC) may be playing a minor role, but to fully explain the latency tail we have to keep digging. Perhaps Web Sockets, by eliminating most of the overhead of the HTTP protocol, would help? Happily, Chrome now has Web Socket support. I decided to try several connection methods: XMLHttpRequest, Web Sockets, and JSONP (a trick wherein you issue a request by dynamically creating a <script src=...> tag, and the server returns a JavaScript file whose execution delivers the response). JSONP has the useful property of not being bound by the "same origin" security policy, enabling one additional connection method: "best of two". In this method, I have the browser issue a Web Socket request to one server, and a simultaneous JSONP request to the other. The latency is measured as elapsed time from when we begin sending the first of these two requests, until either response is received. Here are the results:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East)	13929	98.7 ms	104.0 ms	123.0 ms	155.0 ms	416.0 ms	688.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East)	13929	98.9 ms	105.0 ms	123.0 ms	132.0 ms	417.0 ms	713.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East)	13930	89.6 ms	93.8 ms	108.0 ms	122.0 ms	375.0 ms	582.0 ms
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	10447	98.6 ms	104.0 ms	116.0 ms	132.0 ms	341.0 ms	652.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	10446	99.4 ms	108.0 ms	114.0 ms	117.0 ms	323.0 ms	629.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	10447	89.5 ms	93.1 ms	96.3 ms	96.8 ms	119.0 ms	556.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	10446	89.6 ms	93.6 ms	98.4 ms	101.0 ms	219.0 ms	551.0 ms
Ping (N. Calif. -> EC2 US East)	41717	89.6 ms	93.5 ms	97.2 ms	101.0 ms	140.0 ms	414.0 ms
Ping (N. Calif. -> EC2 US East) B	39146	89.5 ms	93.1 ms	94.8 ms	96.9 ms	107.0 ms	424.0 ms

For each server, Web Sockets are clearly the fastest of the three connection techniques. On the lightly loaded "B" server, the Web Socket latencies are almost as good as the ping latencies. However, if I exclude the 9PM spike, there is still a noticeable difference in the tails: 99th and 99.9th percentile latencies for Web Socket requests are then 127 and 396 milliseconds respectively, while the equivalent ping latencies are 104 and 113 milliseconds.

It's interesting that the best-of-2 technique does not perform well. To the extent that the latency tail is caused by server CPU contention, best-of-2 should be a big improvement. It's unclear how much it would help with network issues, and it definitely can't help with client issues. The poor performance suggests that client issues contribute significantly to the latency tail.

And it's partly client load

Now, we've more or less controlled for network latency and server CPU. What factor could explain the remaining difference between ping and Web Socket latencies? Perhaps it's something in the browser. During the preceding tests, I had at least a dozen Chrome tabs open, some to complex sites such as Gmail. I restarted Chrome with only one tab, open to the page that executes the benchmark:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	1038	98.6 ms	103.0 ms	115.0 ms	134.0 ms	248.0 ms	492.0 ms
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	1038	106.0 ms	112.0 ms	112.0 ms	117.0 ms	211.0 ms	446.0 ms
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	1037	89.5 ms	93.0 ms	94.8 ms	96.6 ms	100.0 ms	118.0 ms
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	1037	89.5 ms	93.1 ms	95.9 ms	96.9 ms	107.0 ms	369.0 ms

These results are from a short run (only about an hour), at a different time period than the others, and so should be taken with a grain of salt. However, the Websocket figures look really good, comparable to raw ping times. The HTTP-based connection techniques still show a significant tail. (It's worth noting that this might not be entirely due to HTTP overhead; it could also reflect overhead in the browser implementation of the respective connection techniques. JSONP in particular is a rather baroque approach and requires DOM manipulation and dynamic compilation.)

A convenient outage -- score one for best-of-two

During an earlier benchmark run (not reported here), my "B" server went offline. Amazon's explanation:

5:39 PM PDT We are investigating instance connectivity in the US-EAST-1 region.

5:55 PM PDT A subset of instances in a single Availability Zone became unavailable due to a localized power distribution failure. We are in the process of restoring power now.

No doubt this was inconvenient for some people, but for me it provided a handy test of the best-of-2 connection method. Indeed, all of the other "B" server benchmarks began failing, but the best-of-2 results were unperturbed.

This event also provided an interesting peek into Amazon's handling of outages. The event began at around 5:15, at least for my machine. Amazon's service dashboard was first updated at 5:39, roughly 24 minutes later. Hence, when I first detected the problem, I had no information as to whether it was something I'd done, an Amazon problem specific to my machine, or a more widespread Amazon problem. It would be nice if the AWS dashboard included up-to-the-minute service health metrics based on automated monitoring. It would also be nice if health information was propagated to the status page for individual EC2 instances. All the while that the machine was offline, the dashboard page for that instance continued to indicate that the instance was up and running.

One other quibble: the status reports never actually indicate which Availability Zone was impacted (it was us-east-1b). Seems silly not to provide that information.

Stress test

Just for kicks, I decided to repeat the tests while saturating my DSL line (by uploading a bunch of photos to Picasa). The results are not pretty:

Operation	# samples	10th %ile	Median	Mean	90th %ile	99th %ile	99.9th %ile
JSONP (OS X Chrome N. Calif. -> EC2 US East) B	613	101.0 ms	419.0 ms	559.0 ms	1.35 sec	1.75 sec	2.35 sec
XMLHttpRequest (OS X Chrome N. Calif. -> EC2 US East) B	612	110.0 ms	388.0 ms	543.0 ms	1.35 sec	1.69 sec	5.03 sec
Websocket (OS X Chrome N. Calif. -> EC2 US East) B	612	91.1 ms	398.0 ms	524.0 ms	1.31 sec	1.69 sec	2.35 sec
Websocket best-of-2 (OS X Chrome N. Calif. -> EC2 US East) B	612	91.2 ms	372.0 ms	524.0 ms	1.46 sec	1.69 sec	2.14 sec

Here, all of the connection methods were equally bad. (It would have been interesting to run ping tests at the same time, to see whether TCP congestion control was part of the problem, but I neglected do so.)

Conclusion: good luck, you'll need it

With Web Sockets, a lightly loaded client, a lightly loaded server, a lightly loaded DSL line, and favorable winds, it's possible to get uniformly fast communication between a browser and server. Disturb any of those conditions, and a latency tail rears up. Where does that leave us?

In principle, you can control server load. Of the major browsers, so far only Chrome implements Web Sockets as far as I know. (Flash provides similar functionality, which I may benchmark for a later post.) Client load and DSL load are entirely out of our control; the best one might manage is to give feedback to the user. And, while Google is rumored to have some influence over the weather, when it comes to favorable winds the rest of us just have to hope for the best.

Am I Stronger Yet?