We kept on tweaking our Haskell RTS after we reached “stable enough” in part 1 of this series, trying to address two main things:
- Handle bursts of high concurrency more efficiently
- Avoid throttling due to CFS quotas in Kubernetes
- If you’re unfamiliar with it, here’s a comprehensive article from Omio
We also learned some interesting properties of the parallel garbage collector in the process.
TL;DR
- Profile your app in production
- Threadscope is your friend
- Disabling the parallel garbage collector is probably a good idea
- Increasing
-A
is probably a good idea
-N⟨x⟩ == available_cores
is bad
We ran into this problem in the previous post of our series, where we tried to set:
-N3
--cpu-manager-policy=static
requests.cpu: 3
requests.limits: 3
We tried this configuration because we were hoping to disable Kubernetes CFS quotas and only rely on CPU affinity to prevent noisy neighbours and worker nodes overload.
Trying this out I saw p99 response times rise from 16ms to 29ms, enough to affect stability of our upstream services.
Confused, I reached out for help on the Haskell Discourse.
Threadscope
Folks on Discourse were quick to help me drill down into GC as a possible cause for slowness, but I had no idea how to go about tuning it, if not at random.
The first helpful advice I got was to use Threadscope, a graphical viewer for thread profile information generated by GHC.
Capturing event logs for ThreadScope
The first thing I had to do to be able to use ThreadScope was to build a version of our app with the -eventlog
flag in our package.yaml
:
executables:
quiz-engine-http:
dependencies:
...
ghc-options:
- -threaded
+ - -eventlog
main: Main.hs
...
This makes it so our app ships with the necessary instrumentation, which we can turn on and off at launch.
Then I had to enable event logging by launching our app with the -l
RTS flag, like so:
$ quiz-engine-http +RTS -N3 -M5.8g -l -RTS
This makes it so Haskell logs events to a file while it’s running. I decided to make a single Pod use these settings alongside the rest of our fleet, taking production traffic.
Last, I had to grab the event log, which gets dumped to a file like your-executable-name.eventlog
. That could be done with kubectl cp
.
The log grew around 1.2MB/s, and Threadscope takes a while to load large event logs, so I went for short recording sessions of around 3min.
Launching ThreadScope
With the event log in hand, I could finally launch ThreadScope:
$ threadscope quiz-engine-http.eventlog
ThreadScope showed me a chart of app execution vs GC execution and a bunch of metrics.
Interesting metrics
Productivity was the first interesting metric I saw. It tells you what % of time your actual application code is running, the remainder of which is taken over by GC.
In our case, we had 88.2% productivity, so 11.8% of the time our app was doing nothing, waiting for the garbage collector to run.
Our average GC pause was 20μs long, or 0.0002s. Really fast.
GHC made 103,060 Gen 0 collections in the 210s period, which is a bit ridiculous. This means we did 490 pauses per second, or one pause every 2ms. Our app’s average response time is 1.8ms, so with 3 capabilities, we were running GC on average once every 6 requests.
In comparison, we made 243 Gen 1 collections, so a little over 1s. Gen 1 was OK.
Is the parallel GC helping me?
Another quick suggestion I got on Discourse was disabling the parallel garbage collector, so I went on to test that, with Threadscope by my side.
I used the -qg
RTS flag to disable parallel GC, and -qn⟨x⟩
for keeping it enabled, but only using ⟨x⟩
threads.
This is how ThreadScope metrics and our 99th percentile response times were affected by the different settings:
RTS setting | p99 | productivity | g0 pauses/s | avg g0 pause |
---|---|---|---|---|
-N3 -qn3 | 29ms | 88.2% | 488 | 0.2ms |
-N3 -qn2 | 21ms | 89.8% | 558 | 0.1ms |
-N3 -qg | 17ms | 88.9% | 593 | 0.1ms |
Pauses times seemed to improve, but we don’t have enough resolution in Threadscope to see whether it was a 0.01ms improvement, or a full 0.1ms improvement.
- Collections got more frequent, for reasons unknown
- Productivity lowered when we went down from 2 threads to 1 thread
- p99 response time was the best when the parallel GC was disabled
Conclusion: the parallel GC wasn’t helping us at all
Is our allocation area the right size?
The last of the helpful suggestions we got on Discourse was tweaking -A
, which controls the size of Gen 0.
The docs warn:
Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion).
What does cache behavior mean here? Googling led me to this StackOverflow answer by Simon Marlow explaining that using -A
higher than the CPU’s L2 cache size means we lower the L2 hit rate.
Our AWS instances are running Intel Xeon Platinum 8124M
, which has 1MB of L2 cache per core, and the default -A
is 1MB, so any increase would already spell a reduced L2 hit rate for us.
We compared 3 different scenarios:
RTS setting | p99 | productivity | g0 pauses/s | avg g0 pause |
---|---|---|---|---|
-N3 -qn3 -A1m | 29ms | 88.2% | 488 | 0.2ms |
-N3 -qn3 -A3m | 18ms | 95.6% | 144 | 0.2ms |
-N3 -qn3 -A128m | 16ms | 99.6% | 1.2 | 2ms |
The L2 cache hit rate penalty didn’t seem to affect the sorts of computations we are running, as -A128m still has the fastest p99 response time.
-A128m
seemed a bit ridiculous, but we had memory to spare, so we went with it. The 2ms average pause was close to our p75 response time, so it seemed fine to stop the world once per second for the time of 1 request slow'ish request to take out the trash.
Unlocking higher values for -N
Our app had been having hiccups in production. For a second a database would get slow and would cause our Haskell processes, which usually handle around 2-4 in-flight requests at a time, to flood with 20-40 of them.
Eating through this pile of requests would often take less than a minute, but would then cascade upstream into request queueing and some high-latency alerts, informing us that a high percentage of our users were having a frustrating experience with our website.
Whenever this happened, we did not see CPU saturation. CPU usage remained around 65-70%. It made me think our Haskell threads were not being effective, and higher parallelism could help us leverage our cores better, even at the cost of some context switching.
I was eager to try a higher -N
than the taskset
core count I gave to our processes, but was unable to until now, because setting -N
higher than the core count would bring the productivity metric down quickly, and would increase p99 response times.
With our new findings, and a close eye on nonvoluntary_ctxt_switches
in /proc/[pid]/status
, I managed to get to us to -N6
, which seemed enough to reduce the frequency of our hiccups to a few times a month, versus daily.
These were our final RTS settings, with -N6
, compared to what we started:
RTS setting | p99 | productivity | g0 pauses/s | avg g0 pause |
---|---|---|---|---|
-N3 -qn3 -A1m | 29ms | 88.2% | 488 | 0.2ms |
-N6 -qg -A128m | 13ms | 99.5% | 0.8 | 4.2ms |
These numbers were captured on GHC 8.8.4. We did upgrade to GHC 8.10.6 to try the new non-moving garbage collector, but saw no improvement.
Conclusion
Haskell has pretty good instrumentation to help you tune garbage collection. I was intimidated by the prospect of trying to tune it without building a mental model of all the settings available first, but profiling our workload in production proved easy to set up and quick to iterate on.
Juliano Solanho @julianobs Engineer at NoRedInk
Thank you Brian Hicks and Ju Liu for draft reviews and feedback! 💜