Friday, May 28, 2010

More on the RDS latency oscillation

[Sorry for the long gap in posts; I've been working on other things.  Posts are likely to be somewhat sporadic for the foreseeable future, especially as I have quite a bit of travel coming up this summer.]

In Three Latency Anomalies, I mentioned a periodic oscillation in latency for simple read and write operations in Amazon's RDS (hosted MySQL) service.  I opened a support ticket with Amazon, and eventually received this response:

Hi Steve,

The latency you're seeing may be related to the DB Instance class (db.m1.small) that you're running. The I/O performance on the db.m1.small is Moderate, vs. High for the Large and Extra Large DB Instance classes. We are working on publishing guidelines around this variance on the Amazon RDS product page.

You might try running your test on one of the "High" I/O performing systems and see if that improves the latency issues that you're experiencing.

Regards,
The Amazon RDS Team

The I/O explanation didn't sound right to me, because the variance appears even when repeatedly re-reading a single record, which ought to be cached.  But I decided to try their advice and experiment with a "large" RDS instance.  While I was at it, I added a couple more small instances and an additional client.  This gave me 8 client/server pairs to test -- 2 clients, each issuing requests to 4 servers.  The operations benchmarked are:
  • Fetch one small record (always the same record)
  • Fetch a small record randomly selected from a table of 1,000,000 records
  • Update one small record (always the same record)
  • Update a small record randomly selected from a table of 1,000,000 records
All operations use the same table, which is shared across clients (but, obviously, not across servers).  Here is the last 20 hours worth of latency data for the first operation, on the original client / server pair:

Here is the same graph, for the same client but a second"small" RDS instance:

Not only does it show the same oscillation, the oscillations are aligned.  Uh oh: could this be a client issue after all?  Well, that's why I launched a second client (on a different EC2 instance).  Here are the results for the second client, talking to the first server:

Very similar to the graph from the first client.  Here's the second client talking to the second server:

So, two different clients talking to two different servers, all presumably on different machines, showing latency oscillations that not only have the same frequency, but are precisely aligned.  I won't bother including graphs here for the third small server, but they look very similar to the above.

In theory, the only common element in these tests should be the network; and even there, I'd expect independent network paths to be involved.  In any case, my us-east-1b network measurements show no such oscillation.  So this additional data merely confirms the mystery.

I haven't shown data yet for the "large" server.  Here is the graph for client 2 fetching a single record:

This has the same periodicity as the other graphs, but the absolute latency values are much lower: 99th percentile latency oscillates between 5 and 20 ms, as opposed to something like 25 - 250 for the small servers.  Finally, just to make things really fun, client 1 talking to the large server does not show periodicity:

The periodicity is present in the 90th percentile plot, but it's not apparent in the 99th percentile.  I happen to have several days' worth of data for this pair, which confirms that only the 90th percentile plot has much periodicity:

One explanation would be that, for this particular server pair, there is an additional component to the latency tail that prevents 99th percentile latency from dropping below 20ms.  That could be explained by any client or network issue that imposed a 20ms delay for 1 to 10 percent of operations.

Conclusion: I still have no idea what causes the periodicity, but I'm even more convinced that something interesting is involved.  I'll pass the new data back to the RDS team and see what they can make of it.

5 comments:

  1. Did you ever hear anything back from the RDS team?

    ReplyDelete
  2. Yes, there was a little back-and-forth -- just requests on their end for more information. Sadly, we never got to the bottom of it; it got to the point where I was spending more time on the investigation than I could justify. I may come back to this at some point, but for now it remains a mystery.

    ReplyDelete
  3. Hi there, i'm doing some research about amazon rds services, and i'm trying to pinpoint latency issues, how did you measure latency to aws servers? plain ping is a no response for me... so if you could help me out it would be great...

    thx :D

    ReplyDelete
  4. I simply executed a simple SQL statement and measured the end-to-end execution time. E.g. a SELECT from a table that only contains one row, meaning that the query execution time should be minimal.

    If the SQL server is busy with other workloads, then this is not a good way to measure network latency, but it is a reasonable way to measure overall server performance.

    ReplyDelete