Here's a histogram for the null microbenchmark:
The code behind this measurement boils down to:
long startTime = System.nanotime();
benchmark.run(); // an empty method
long duration = System.nanotime() - startTime;
The measured duration is pinned to a minimum of 1 nanosecond, for benefit of my histogram class (which uses logarithmic buckets). So the 1 ns bucket, in this case, represents 0 measured time.
The first thing I notice is that all of the measurements are a precise multiple of 1 microsecond. The obvious conclusion is that System.nanotime(), on this machine, is based on a microsecond clock.
The second thing I notice is that the majority of samples, around 27K of 43K total, were measured at 1 microsecond. Most of the rest were evenly distributed between 0 ("1 ns") and 2 microseconds. Then there is a tail of longer timings.
I'm somewhat at a loss to explain the trimodal distribution. Presumably System.nanotime() is taking some time to execute. Suppose it takes 800 nanoseconds. Then, given the 1 microsecond resolution, we'd expect 80% of the samples to measure at 1 microsecond, and 20% at 0. That would fit the observed peaks at 1ns and 1µs, but wouldn't explain the peak at 2µs. Conversely, if System.nanotime() takes 1200 nanoseconds, that would explain the peaks at 1µs and 2µs but not the peak at 1ns. Something more complex must be occurring. Perhaps it takes right about 1 microsecond, but with a significant amount of jitter.
Then there's the long tail; there's one sample way out at 15 µs. This might have something to do with the System.nanotime() implementation, or it might involve some external event, such as a CPU interrupt.
I don't have any pressing need to understand how System.nanotime() works, so I'm not going to dig further into this example. However, it serves to illustrate that you can learn a lot about an operation simply by looking at its latency histogram. At least as important, the act of explaining the histogram to yourself highlights any behavior you don't understand, identifying areas for further investigation. (It helps if the histogram has high resolution and is based on a large number of samples, as in this case.)
Real-world histograms aren't always complicated. Here's the histogram for a multiply-add loop:
The code:
int d = 1;
for (int i = 1; i <= repetitions; i++)
d *= i;
dummy += d;
where "dummy' is an external variable which prevents the compiler from treating this as dead code. I measure System.nanotime() before and after, and divide by "repetitions", which is set to 1 million for this test. Virtually all of the samples fall into a single histogram bucket at about 1.15 ns. There's a tail, but it accounts for a very small fraction of the samples. The loop (all million iterations) takes about a millisecond, so it's not surprising that we occasionally get interrupted, explaining the tail. No mysteries in this histogram. A similar loop which invokes Math.sin(), surprisingly, yields a more complex histogram:
Here the code is:
double d = 1;
for (int i = 0; i < repetitions; i++)
d *= Math.sin(i);
dummy += (int)d;
To finish, let's look at a couple of histograms measuring filesystem access times on EC2. Here's the histogram for reading 4KB of data, at a random 4K-aligned position, from an 8GB file in a filesystem mounted on Amazon's Elastic Block Store:
(click for full-size image)
Unsurprisingly, the histogram is dominated by a broad peak around 13ms. This is presumably dominated by disk seek time; 13ms is a reasonable figure. The long tail could represent any number of things, such as queuing on a busy disk.
More interesting is the peak around 15 µs, and the double peaks at 500/800 µs. At a guess, the first represents a hit in the disk block cache on the local machine, and the double peak represents a hit in the cache on the EBS server. If these guesses are correct, we can deduce many things, such as the approximate size of "our share" of each block cache, and the RPC time to the EBS server. The doubled nature of the 500µs peak doubtless hides something interesting as well. (I thought it might represent variation over time, but even when narrowing down to a single hour's data, the double peak remains.)
We can double-check some of this analysis by varying test parameters. Here is a histogram of the same operation on the same file, but selecting offsets only in the first 512K of the file. We invoke each operation every 10 seconds, and other tests involving the same file touch this 512K every 20 seconds or so, effectively pinning it in the local block cache:
A test using the first 256MB of the file, yields a profile similar to that for the 8GB test, but the peak at 15 µs is slightly more pronounced, suggesting a slightly higher cache hit rate.
As an exercise, let's use the 8GB graph to estimate the size of the local block cache. The peak around 15µs contains roughly 1400 samples (eyeballed), or about 6% of the total of ~23,000. So at any time, the cache holds 6% of that 8GB file, or about 500MB. The same machine is performing an identical test on an 8GB file stored on local disk. We'd expect that file to consume a similar amount of cache, and a glance at its histogram suggests that this is indeed the case. So far we've accounted for 1GB of block cache, and I don't expect much other filesystem usage going on. As a cross-check, here's the output of "free -m" on the machine:
total used free shared buffers cached
Mem: 1707 1699 7 0 590 607
I'm not sure exactly what the distinction is between "buffers" and "cached", but my understanding is that they are both used mostly for disk blocks. 590MB + 607MB == 1.17GB, which agrees nicely with our guesstimate of 1GB cache usage for the microbenchmark data files.
In conclusion:
- You can learn a lot about an operation by looking at its latency histogram.
- It's a useful exercise to explain each feature of the histogram to yourself. If there's something you can't explain, it means you don't fully understand the system.
- If you want to understand the behavior of a production system, you need to look at a histogram from the production system.
- High-resolution histograms are important. For instance, the 1µs quantization of System.nanotime wouldn't have been visible at lower resolution. Neither would the doubled peak at around 500µs in the EBS read time.
Great post! Regarding the double peaks at 800us, I wonder if you can learn more by disabling your VM's buffer cache. Also, I wonder if there is caching on the host machine.
ReplyDeleteRegarding the broad peak around 13ms on the EBS read, i think that might 3 peaks smashed together. Might try plotting it on a linear scale.
ReplyDeleteKen: thanks! Yes, disabling (or resizing) the buffer cache would be one way to learn more.
ReplyDeleteMaulik: my histogram code uses logarithmic buckets -- each bucket is of the same logarithmic "width", i.e. there is a bucket for [1ns, 1.1ns], one for [1.1ns, 1.21ns], etc. And I don't permanently record the raw execution times, only the histograms. So there's no data available beyond what's visible in the histogram as currently rendered -- each bar represents a single bucket.
I could adjust the code to capture more detail, but my guess is that the system is noisy enough to scramble any finer detail anyway. In a week or two we'll have more data and might be able to see better.
If you want to dig around, I'm constantly tweaking the reporting site with additional tools (poorly documented for now). For instance, http://amistrongeryet.com/op_detail.jsp?op=file3_read_4K_of_8G_ebs&hoursAgo=50&hours=&buckets=99 will show a per-hour histogram for each of the 50 most recent hours. (I currently store histograms by the hour, so this is absolutely all of the data available for that time period.) If variations over time are masking details -- for instance, if our volume is moving from server to server and different servers have different loads -- this might help. There's lots of interesting structure in these hourly plots, but each represents only ~360 data points so it could all be noise.