Tuning Haskell RTS for Kubernetes, Part 2

We kept on tweaking our Haskell RTS after we reached “stable enough” in part 1 of this series, trying to address two main things:

Handle bursts of high concurrency more efficiently
Avoid throttling due to CFS quotas in Kubernetes
- If you’re unfamiliar with it, here’s a comprehensive article from Omio

We also learned some interesting properties of the parallel garbage collector in the process.

TL;DR

Profile your app in production
Threadscope is your friend
Disabling the parallel garbage collector is probably a good idea
Increasing -A is probably a good idea

`-N⟨x⟩ == available_cores` is bad

We ran into this problem in the previous post of our series, where we tried to set:

-N3
--cpu-manager-policy=static
requests.cpu: 3
requests.limits: 3

We tried this configuration because we were hoping to disable Kubernetes CFS quotas and only rely on CPU affinity to prevent noisy neighbours and worker nodes overload.

Trying this out I saw p99 response times rise from 16ms to 29ms, enough to affect stability of our upstream services.

Confused, I reached out for help on the Haskell Discourse.

Threadscope

Folks on Discourse were quick to help me drill down into GC as a possible cause for slowness, but I had no idea how to go about tuning it, if not at random.

The first helpful advice I got was to use Threadscope, a graphical viewer for thread profile information generated by GHC.

Capturing event logs for ThreadScope

The first thing I had to do to be able to use ThreadScope was to build a version of our app with the -eventlog flag in our package.yaml:

 executables:
   quiz-engine-http:
     dependencies:
       ...
     ghc-options:
       - -threaded
+      - -eventlog
     main: Main.hs
     ...

This makes it so our app ships with the necessary instrumentation, which we can turn on and off at launch.

Then I had to enable event logging by launching our app with the -l RTS flag, like so:

$ quiz-engine-http +RTS -N3 -M5.8g -l -RTS

This makes it so Haskell logs events to a file while it’s running. I decided to make a single Pod use these settings alongside the rest of our fleet, taking production traffic.

Last, I had to grab the event log, which gets dumped to a file like your-executable-name.eventlog. That could be done with kubectl cp.

The log grew around 1.2MB/s, and Threadscope takes a while to load large event logs, so I went for short recording sessions of around 3min.

Launching ThreadScope

With the event log in hand, I could finally launch ThreadScope:

$ threadscope quiz-engine-http.eventlog

ThreadScope showed me a chart of app execution vs GC execution and a bunch of metrics.

ThreadScope screenshot showing 3 horizontal bars, depicting a heatmap of sorts with CPU time spent on app code in green and GC time in orange

Another threadscope tab, showing a table with statistics on collections in total and per garbage collector generation. I'll summarize the relevant numbers in the next section

Interesting metrics

Productivity was the first interesting metric I saw. It tells you what % of time your actual application code is running, the remainder of which is taken over by GC.

In our case, we had 88.2% productivity, so 11.8% of the time our app was doing nothing, waiting for the garbage collector to run.

Our average GC pause was 20μs long, or 0.0002s. Really fast.

GHC made 103,060 Gen 0 collections in the 210s period, which is a bit ridiculous. This means we did 490 pauses per second, or one pause every 2ms. Our app’s average response time is 1.8ms, so with 3 capabilities, we were running GC on average once every 6 requests.

In comparison, we made 243 Gen 1 collections, so a little over 1s. Gen 1 was OK.

Is the parallel GC helping me?

Another quick suggestion I got on Discourse was disabling the parallel garbage collector, so I went on to test that, with Threadscope by my side.

I used the -qg RTS flag to disable parallel GC, and -qn⟨x⟩ for keeping it enabled, but only using ⟨x⟩ threads.

This is how ThreadScope metrics and our 99th percentile response times were affected by the different settings:

RTS setting	p99	productivity	g0 pauses/s	avg g0 pause
-N3 -qn3	29ms	88.2%	488	0.2ms
-N3 -qn2	21ms	89.8%	558	0.1ms
-N3 -qg	17ms	88.9%	593	0.1ms

Pauses times seemed to improve, but we don’t have enough resolution in Threadscope to see whether it was a 0.01ms improvement, or a full 0.1ms improvement.

Collections got more frequent, for reasons unknown
Productivity lowered when we went down from 2 threads to 1 thread
p99 response time was the best when the parallel GC was disabled

Conclusion: the parallel GC wasn’t helping us at all

Is our allocation area the right size?

The last of the helpful suggestions we got on Discourse was tweaking -A, which controls the size of Gen 0.

The docs warn:

Increasing the allocation area size may or may not give better performance (a bigger allocation area means worse cache behaviour but fewer garbage collections and less promotion).

What does cache behavior mean here? Googling led me to this StackOverflow answer by Simon Marlow explaining that using -A higher than the CPU’s L2 cache size means we lower the L2 hit rate.

Our AWS instances are running Intel Xeon Platinum 8124M, which has 1MB of L2 cache per core, and the default -A is 1MB, so any increase would already spell a reduced L2 hit rate for us.

We compared 3 different scenarios:

RTS setting	p99	productivity	g0 pauses/s	avg g0 pause
-N3 -qn3 -A1m	29ms	88.2%	488	0.2ms
-N3 -qn3 -A3m	18ms	95.6%	144	0.2ms
-N3 -qn3 -A128m	16ms	99.6%	1.2	2ms

The L2 cache hit rate penalty didn’t seem to affect the sorts of computations we are running, as -A128m still has the fastest p99 response time.

-A128m seemed a bit ridiculous, but we had memory to spare, so we went with it. The 2ms average pause was close to our p75 response time, so it seemed fine to stop the world once per second for the time of 1 request slow'ish request to take out the trash.

Unlocking higher values for `-N`

Our app had been having hiccups in production. For a second a database would get slow and would cause our Haskell processes, which usually handle around 2-4 in-flight requests at a time, to flood with 20-40 of them.

Eating through this pile of requests would often take less than a minute, but would then cascade upstream into request queueing and some high-latency alerts, informing us that a high percentage of our users were having a frustrating experience with our website.

Whenever this happened, we did not see CPU saturation. CPU usage remained around 65-70%. It made me think our Haskell threads were not being effective, and higher parallelism could help us leverage our cores better, even at the cost of some context switching.

I was eager to try a higher -N than the taskset core count I gave to our processes, but was unable to until now, because setting -N higher than the core count would bring the productivity metric down quickly, and would increase p99 response times.

With our new findings, and a close eye on nonvoluntary_ctxt_switches in /proc/[pid]/status, I managed to get to us to -N6, which seemed enough to reduce the frequency of our hiccups to a few times a month, versus daily.

These were our final RTS settings, with -N6, compared to what we started:

RTS setting	p99	productivity	g0 pauses/s	avg g0 pause
-N3 -qn3 -A1m	29ms	88.2%	488	0.2ms
-N6 -qg -A128m	13ms	99.5%	0.8	4.2ms

These numbers were captured on GHC 8.8.4. We did upgrade to GHC 8.10.6 to try the new non-moving garbage collector, but saw no improvement.

Conclusion

Haskell has pretty good instrumentation to help you tune garbage collection. I was intimidated by the prospect of trying to tune it without building a mental model of all the settings available first, but profiling our workload in production proved easy to set up and quick to iterate on.

Juliano Solanho @julianobs Engineer at NoRedInk

Thank you Brian Hicks and Ju Liu for draft reviews and feedback! 💜

Tuning Haskell RTS for Kubernetes, Part 2

TL;DR

`-N⟨x⟩ == available_cores` is bad

Threadscope

Capturing event logs for ThreadScope

Launching ThreadScope

Interesting metrics

Is the parallel GC helping me?

Is our allocation area the right size?

Unlocking higher values for `-N`

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Praye – Wodin (Throwback Music)

Kodad Mandal Sarpanch Wardmumber Mobile Numbers List Part II Nalgonda...

Firefighters attend car crash in Melton Mowbray

Adidas World Cup 2006 TTF Font

The Who – Who’s Next (1971/2023) [High Fidelity Pure Audio Blu-Ray Disc]

Outlook.com issue with window 8

Maryland: State Police report 416 DWI / DUI drivers during December 2014;...

SMI SM320AC MPTool

Final Purple Gang-Related Indictment Ensnared ‘Candy’ Davidson In Drug Bust...

99 God Status for Whatsapp, Facebook

charmilles roboform E998

Ek Bar Baby Selfish Hoke Apne Liye Jiyo Na Lyrics Translation | Race 3

CAMDEN CAMPERS SALE IS ON NOW THIS CRACKING VW AUTOHOMES KOMET HAS BEEN...

Font Brazil World Cup 2004 kits

Presence detection with LD2410 and BH1750 - i2c doesnt work

Missing man located Bayview Avenue and Wilket Road area, Alexander Klopot, 31

Why do I get 'Access is Denied' when using Set-Service with Admin privileges?

Lady Gaga – MAYHEM (Bonus Tracks Version) [iTunes Rip M4A]

Java error when using Sky Go app

TL;DR

-N⟨x⟩ == available_cores is bad

Threadscope

Capturing event logs for ThreadScope

Launching ThreadScope

Interesting metrics

Is the parallel GC helping me?

Is our allocation area the right size?

Unlocking higher values for -N

Conclusion

Trending Articles

`-N⟨x⟩ == available_cores` is bad

Unlocking higher values for `-N`