Avoid these OTP Supervision performance pitfalls

By: Derek Kraan / 2019-01-17

As powerful as Elixir/Erlang’s OTP is, it’s also easy to nuke your performance by accidentally introducing a bottleneck into the system. Here are 2 ways you can turn a Supervisor into a bottleneck and how to fix them.

Doing time consuming work in init/1

This is perhaps the classic example of how to torpedo supervisor performance. A supervisor will call start_link to start a process, which waits until the init/1 callback has completed to return.

defmodule MyGenServer do
def init(_arg) do
state = expensive_operation()
{:ok, state}
end
end

The entire time your init process is running, the supervisor is unable to respond to any other messages, which includes restarting processes that may have terminated.

To fix this, we should avoid doing any work in init that will not complete immediately. Any other work that we need to do to initialize the process needs to be done after init but before our process handles any other messages. The fix for this one is pretty easy (in Elixir 1.7+):

defmodule MyGenServer do
def init(_arg) do
{:ok, nil, {:continue, :expensive_operation}}
end def handle_continue(:expensive_operation, nil) do
{:noreply, expensive_operation()}
end
end

Use handle_continue/2 to do expensive initialization out of band.

This callback will be called before any other messages are processed, so there is no danger of another message sneaking in before your process is initialized.

Bonus: handle_continue/2 is also useful any time you want to do some additional work after replying to a message, for example in a handle_call/3 callback.

Using Supervisor.terminate_child

Yes, you read that right. Just using Supervisor.terminate_child/2 is asking for trouble.

Let’s talk about how Supervisor.terminate_child/2 works.

  1. It sends a :shutdown signal to the child pid.
  2. It waits for shutdown (specified in the child spec, default 5000ms).
  3. If the process has not yet shut down, it sends a :kill signal to the child pid.
  4. It waits again for the process to exit.
  5. When the process has exited, it removes it from its list of supervised processes.

All this waiting is occupying the supervisor, and preventing it from processing any other messages, which means that it cannot restart any dead processes, cannot start any new processes, and cannot terminate any other processes. In short, the supervisor will become a bottleneck.

If you have not specified otherwise, the supervisor could be taken hostage by terminate_child for 5 entire seconds, which is basically an eternity in computer-time. If you have specified a higher shutdown in the child spec of a process, then bad news, it’s going to take even longer.

How can we mitigate this? This was completely non-obvious to me, but thanks to this helpful tip from Jose, I know the answer and will share it with you here:

Don’t use Supervisor.terminate_child/2.

If we shouldn’t use terminate_child/2, then what should we use? We should use {:exit, :normal} from within the child process. We also need to make sure that the restart is set correctly in the child spec, otherwise our process will simply be restarted by the supervisor when we do this. To tell the supervisor not to restart a process if it has exited normally, we need to set restart: :transient.

You can either exit from within a GenServer when it has detected that it’s done its work, or trigger the exit from without by sending it a message asking it to exit.

defmodule MyGenServer
def handle_call(:do_exit, _from, state) do
{:stop, :normal, :ok, nil}
end
end

I hope this blog post has inspired you to learn more about the Supervisor’s internals! I want to end with a link to the source code of DynamicSupervisor. If you haven’t taken a look yet, consider this your invitation to do so. Everyone can benefit from having looked at its internals to understand better how it works.

Are you aware of any other Supervision performance pitfalls? Write a short blog post about it and I’ll update this post to link to it.

Drop us a line

Get the ball rolling on your new project, fill out the form below and we'll be in touch quickly.

Recent Posts

Where to put startup code in Elixir

By: Derek Kraan / 2019-12-06

Walkman - isolate your tests from the world

By: Derek Kraan / 2019-07-22

Introducing MerkleMap: improving Horde's performance

By: Derek Kraan / 2019-05-20

What's new in Horde v0.5.0

By: Derek Kraan / 2019-05-06

Why should every process be supervised?

By: Derek Kraan / 2019-04-01

Implementing Connection Draining in Phoenix

By: Derek Kraan / 2019-01-24