Quantcast
Channel: NoRedInk
Viewing all articles
Browse latest Browse all 193

Tuning Haskell RTS for Kubernetes, Part 2

$
0
0

We kept on tweaking our Haskell RTS after we reached “stable enough” in part 1 of this series, trying to address two main things:

  1. Handle bursts of high concurrency more efficiently
  2. Avoid throttling due to CFS quotas in Kubernetes

We also learned some interesting properties of the parallel garbage collector in the process.

TL;DR

  • Profile your app in production
  • Threadscope is your friend
  • Disabling the parallel garbage collector is probably a good idea
  • Increasing -A is probably a good idea

-N⟨x⟩ == available_cores is bad

We ran into this problem in the previous post of our series, where we tried to set:

We tried this configuration because we were hoping to disable Kubernetes CFS quotas and only rely on CPU affinity to prevent noisy neighbours and worker nodes overload.

Trying this out I saw p99 response times rise from 16ms to 29ms, enough to affect stability of our upstream services.

Confused, I reached out for help on the Haskell Discourse.

Threadscope

Folks on Discourse were quick to help me drill down into GC as a possible cause for slowness, but I had no idea how to go about tuning it, if not at random.

The first helpful advice I got was to use Threadscope, a graphical viewer for thread profile information generated by GHC.

Capturing event logs for ThreadScope

The first thing I had to do to be able to use ThreadScope was to build a version of our app with the -eventlog flag in our package.yaml:

 executables:
   quiz-engine-http:
     dependencies:
       ...
     ghc-options:
       - -threaded
+      - -eventlog
     main: Main.hs
     ...

This makes it so our app ships with the necessary instrumentation, which we can turn on and off at launch.

Then I had to enable event logging by launching our app with the -l RTS flag, like so:

$ quiz-engine-http +RTS -N3 -M5.8g -l -RTS

This makes it so Haskell logs events to a file while it’s running. I decided to make a single Pod use these settings alongside the rest of our fleet, taking production traffic.

Last, I had to grab the event log, which gets dumped to a file like your-executable-name.eventlog. That could be done with kubectl cp.

The log grew around 1.2MB/s, and Threadscope takes a while to load large event logs, so I went for short recording sessions of around 3min.

Launching ThreadScope

With the event log in hand, I could finally launch ThreadScope:

$ threadscope quiz-engine-http.eventlog

ThreadScope showed me a chart of app execution vs GC execution and a bunch of metrics.

ThreadScope screenshot showing 3 horizontal bars, depicting a heatmap of sorts with CPU time spent on app code in green and GC time in orange

Another threadscope tab, showing a table with statistics on collections in total and per garbage collector generation. I'll summarize the relevant numbers in the next section

Interesting metrics

Productivity was the first interesting metric I saw. It tells you what % of time your actual application code is running, the remainder of which is taken over by GC.

In our case, we had 88.2% productivity, so 11.8% of the time our app was doing nothing, waiting for the garbage collector to run.

Our average GC pause was 20μs long, or 0.0002s. Really fast.

GHC made 103,060 Gen 0 collections in the 210s period, which is a bit ridiculous. This means we did 490 pauses per second, or one pause every 2ms. Our app’s average response time is 1.8ms, so with 3 capabilities, we were running GC on average once every 6 requests.

In comparison, we made 243 Gen 1 collections, so a little over 1s. Gen 1 was OK.

Is the parallel GC helping me?

Another quick suggestion I got on Discourse was disabling the parallel garbage collector, so I went on to test that, with Threadscope by my side.

I used the -qg RTS flag to disable parallel GC, and -qn⟨x⟩ for keeping it enabled, but only using ⟨x⟩ threads.

This is how ThreadScope metrics and our 99th percentile response times were affected by the different settings:

RTS settingp99productivityg0 pauses/savg g0 pause
-N3 -qn329ms88.2%4880.2ms
-N3 -qn221ms89.8%5580.1ms
-N3 -qg17ms88.9%5930.1ms

Pauses times seemed to improve, but we don’t have enough resolution in Threadscope to see whether it was a 0.01ms improvement, or a full 0.1ms improvement.

  • Collections got more frequent, for reasons unknown
  • Productivity lowered when we went down from 2 threads to 1 thread
  • p99 response time was the best when the parallel GC was disabled

Conclusion: the parallel GC wasn’t helping us at all

Is our allocation area the right size?

The last of the helpful suggestions we got on Discourse was tweaking -A, which controls the size of Gen 0.

The docs warn:

Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion).

What does cache behavior mean here? Googling led me to this StackOverflow answer by Simon Marlow explaining that using -A higher than the CPU’s L2 cache size means we lower the L2 hit rate.

Our AWS instances are running Intel Xeon Platinum 8124M, which has 1MB of L2 cache per core, and the default -A is 1MB, so any increase would already spell a reduced L2 hit rate for us.

We compared 3 different scenarios:

RTS settingp99productivityg0 pauses/savg g0 pause
-N3 -qn3 -A1m29ms88.2%4880.2ms
-N3 -qn3 -A3m18ms95.6%1440.2ms
-N3 -qn3 -A128m16ms99.6%1.22ms

The L2 cache hit rate penalty didn’t seem to affect the sorts of computations we are running, as -A128m still has the fastest p99 response time.

-A128m seemed a bit ridiculous, but we had memory to spare, so we went with it. The 2ms average pause was close to our p75 response time, so it seemed fine to stop the world once per second for the time of 1 request slow'ish request to take out the trash.

Unlocking higher values for -N

Our app had been having hiccups in production. For a second a database would get slow and would cause our Haskell processes, which usually handle around 2-4 in-flight requests at a time, to flood with 20-40 of them.

Eating through this pile of requests would often take less than a minute, but would then cascade upstream into request queueing and some high-latency alerts, informing us that a high percentage of our users were having a frustrating experience with our website.

Whenever this happened, we did not see CPU saturation. CPU usage remained around 65-70%. It made me think our Haskell threads were not being effective, and higher parallelism could help us leverage our cores better, even at the cost of some context switching.

I was eager to try a higher -N than the taskset core count I gave to our processes, but was unable to until now, because setting -N higher than the core count would bring the productivity metric down quickly, and would increase p99 response times.

With our new findings, and a close eye on nonvoluntary_ctxt_switches in /proc/[pid]/status, I managed to get to us to -N6, which seemed enough to reduce the frequency of our hiccups to a few times a month, versus daily.

These were our final RTS settings, with -N6, compared to what we started:

RTS settingp99productivityg0 pauses/savg g0 pause
-N3 -qn3 -A1m29ms88.2%4880.2ms
-N6 -qg -A128m13ms99.5%0.84.2ms

These numbers were captured on GHC 8.8.4. We did upgrade to GHC 8.10.6 to try the new non-moving garbage collector, but saw no improvement.

Conclusion

Haskell has pretty good instrumentation to help you tune garbage collection. I was intimidated by the prospect of trying to tune it without building a mental model of all the settings available first, but profiling our workload in production proved easy to set up and quick to iterate on.


Juliano Solanho @julianobs Engineer at NoRedInk

Thank you Brian Hicks and Ju Liu for draft reviews and feedback! 💜


Viewing all articles
Browse latest Browse all 193

Trending Articles