We deal with Haskell resource leaks the same way you would in C++ or Java.
We have production monitors on every host that show basic metrics like memory, disk, and CPU utilization. Atop that, we added a tracker for the number of suspended Haskell threads. (that is, threads which are not blocked on I/O, but are also not running)
We found that the machines are usually able to handle requests as soon as they come in, so if the number of Haskell threads goes above 0 for any length of time, the machine is about an hour away from melting down.
We can restart the process without losing any connections, so this leaves us a very comfortable margin of error.
Once we know we have a problem, it's usually pretty simple to run the heap profiler on the process and look at recent commits. We continuously deploy, so there's only about a 10 minute delay before a particular commit is running in front of customers. This makes tracking regressions down really fast.
Even in cases where we can't figure out why a bit of code is leaking, we can almost always identify it and revert it until we understand what's going on.
> We can restart the process without losing any connections
Would you mind expanding on this a bit? I'm not too familiar with Haskell, but I am familiar with various was of blocking new connections while allowing existing connections to complete, either at the load-balancer level or built-in each individual process.
What Haskell stack are you using, and how are graceful restarts accomplished?
Good to know. Does it work for everything that's an fd in Linux? I know you've got to treat sockets and files differently in some cases (or at least did once)...
einhorn [1] implements this model and is pretty effective. Used in production at Stripe and other places. (It's written in Ruby, but can run application processes in any language.)
Basically, catch SIGINT, then stop listening to a socket/port. Finish all current requests and exit. The "watcher" parent process will restart the process with the new executable. Repeat for all other processes listening to the socket/port.
"we added a tracker for the number of suspended Haskell threads" - would you mind sharing how you did that? I couldn't see any obvious GHC APIs for it.
We rolled our own implementation. Our WAI application action increments a counter and decrements it again whenever an HTTP request is received and completed.
It doesn't track threads created as a part of HTTP request handling, but we don't allow those actions to forkIO anyway. There hasn't been any demand for it.
We have production monitors on every host that show basic metrics like memory, disk, and CPU utilization. Atop that, we added a tracker for the number of suspended Haskell threads. (that is, threads which are not blocked on I/O, but are also not running)
We found that the machines are usually able to handle requests as soon as they come in, so if the number of Haskell threads goes above 0 for any length of time, the machine is about an hour away from melting down.
We can restart the process without losing any connections, so this leaves us a very comfortable margin of error.
Once we know we have a problem, it's usually pretty simple to run the heap profiler on the process and look at recent commits. We continuously deploy, so there's only about a 10 minute delay before a particular commit is running in front of customers. This makes tracking regressions down really fast.
Even in cases where we can't figure out why a bit of code is leaking, we can almost always identify it and revert it until we understand what's going on.