Reproduce, Measure, and Automate
The symptom our support engineer noticed was a "sluggish" system. More accurately, we saw that query latencies increased and overall system throughput dropped. So our first step was to reproduce the problem under some approximation of the customer's workload. It was a bit tricky to get all of the prerequisite conditions just right, but we finally landed on the right set of circumstances.
Our most commonly used profiling tool is based on the Linux oprofile infrastructure. But we had to modify it to better suit our system. Why? Because substantial portions of our codebase are written in an event driven/continuation passing style programming model. Call graph reporting in ofprofile is highly dependent on the stack, and the C stack often does not contain enough information to provide a valuable analysis in our system.
To get around this problem, we modified oprofile to gather additional information from the system. The key concept was adding a method for tagging a series of continuation calls and getting oprofile to sample these tags. The tag itself includes enough information to map the executing code to a logical system module hierarchy. The screen shot below shows an example output from one of our reporting tools.
Coming back to our original problem, the screen shot shows the output of the profile analysis tool during the dip in performance. The tool allows us to isolate the report to a specific processor. In this case we're looking at the output for our management core.
Interpreting the results we see that:
- The execution engine dominates performance, and most of that is in the storage engine
- The storage engine spends 63% of all its cycles on CRC32 calculations
- Our particular maintenance task was erroneously bound to the management core
- The task should have a lower priority
The root cause of the performance degradation was a scheduling issue. However, it caused me to take a closer look at our crc algorithm. We use the crc implementation from zlib, and it should be reasonably well optimized. Whipping up a quick benchmark, I we see that the zlib implementation takes about 41us to checksum a 32k block. In the test, we were reading about 224MB/s per node. That's 293ms of checksums in a second. No wonder we saw a performance drop.
I looked around and found a paper from Intel describing their slice-by-eight crc approach. A quick benchmark revealed that the same 32k checksum could be done in 27us -- a 37% percent improvement over zlib's version.
With Nehalem-based processors, Intel introduced an SSE instruction for a hardware based crc computation. I haven't had a chance to conduct my own benchmarks, but the following results suggest that we could get a 2-3x speedup over slice by eight.
I'm releasing the source for the tools referenced in this post. The user space oprofile daemon replacement requires our database runtime which I am not making public.