Understanding Supervisors in Gleam

| 5 minutes read

In my last post, I mentioned that supervision was one of the harder concepts to wrap my head around in Gleam. Coming from Elixir, where GenServers and supervision trees feel almost magical, Gleam’s approach felt… overly explicit at first. Why do I need to pass around Subjects? Where is the via_tuple? Why can’t I just name my process and be done with it?

After spending more time building with it, I’ve come to realize the explicitness isn’t a limitation—it’s the point. Let me walk you through what I learned.

The “Let It Crash” Mindset

If you’re coming from Java or Python, supervisors feel weird. You’re used to wrapping everything in try/catch, logging the error, and hoping for the best. In the OTP world, you do the opposite—you let things crash and let the supervisor handle the fallout.

A supervisor is just a process that watches other processes. When one dies, it restarts it. That’s it. The magic is in how you configure the restarts.

In Gleam, you build a supervisor like this:

import gleam/otp/static_supervisor
import gleam/otp/supervision

let assert Ok(sup) =
  static_supervisor.new(static_supervisor.OneForOne)
  |> static_supervisor.add(supervision.worker(my_actor))
  |> static_supervisor.start()

The static_supervisor.new() takes a restart strategy. The add() calls define your children. Then start() launches everything.

Choosing the Right Restart Strategy

This is where I made my first mistake. I started with OneForOne for everything because it sounded simplest. And for most cases, it is. But understanding the differences saved me from some subtle bugs.

OneForOne

If a child crashes, only that child restarts. Use this when your processes are independent.

static_supervisor.new(static_supervisor.OneForOne)

If I have a web server and a background worker, I don’t want the web server to restart just because the worker hit a bad input. OneForOne keeps them isolated.

OneForAll

If one child dies, everyone dies and restarts together. This sounds dramatic, but it’s essential when your processes share state.

static_supervisor.new(static_supervisor.OneForAll)

Imagine a connection pool and a query executor—if the pool gets corrupted, the executor is useless. You want both to restart cleanly.

RestForOne

This one tripped me up. If a child crashes, it and every child started after it are terminated and restarted.

static_supervisor.new(static_supervisor.RestForOne)

I haven’t needed this yet, but I can see it being useful when you have a pipeline where later stages depend on earlier ones.

The Part That Confused Me: Child Specs

When you add a child to a supervisor, you need a child specification. Most of the time, it’s just supervision.worker(my_start_function). But there are three restart types that control when a child gets restarted:

  • Permanent (default): Always restarted, no matter how it died.
  • Transient: Only restarted if it crashes abnormally (not a normal exit).
  • Temporary: Never restarted. Run once and done.

I found Transient useful for one-off tasks. If a batch job completes successfully, it exits normally and shouldn’t be restarted. If it crashes, you probably want it to retry.

supervision.worker(start_batch_job)
|> supervision.restart(supervision.Transient)

You can also chain other options:

supervision.worker(start_my_actor)
|> supervision.restart(supervision.Transient)
|> supervision.timeout(ms: 10000)
|> supervision.significant(True)

Restart Tolerance: The Silent Killer

Here’s something that bit me: supervisors have a restart intensity limit. By default, a supervisor allows 2 restarts within 5 seconds. If a child keeps crashing, the supervisor gives up and shuts down completely.

During development, I was constantly hitting this limit while testing crash scenarios. The fix is simple:

static_supervisor.new(static_supervisor.OneForOne)
|> static_supervisor.restart_tolerance(intensity: 10, period: 5)

Set it higher during development. Tune it down for production once you know your processes are stable.

Putting It All Together

Here’s the example I wish I had when I started. Two actors—a sender and a receiver—communicating through named processes, both supervised with OneForOne. The twist: both can crash, and the supervisor restarts them independently.

import gleam/erlang/process
import gleam/int
import gleam/io
import gleam/otp/actor
import gleam/otp/static_supervisor
import gleam/otp/supervision

pub const receiver_name_prefix = "receiver"

pub type ReceiverMessage {
  Print(Int)
}

fn start_receiver_actor(name: process.Name(ReceiverMessage)) {
  actor.new(process.new_subject())
  |> actor.named(name)
  |> actor.on_message(fn(state, msg: ReceiverMessage) {
    case msg {
      Print(number) -> {
        case number {
          5 -> panic as "crash receiver on 5"
          _ -> {
            io.println("Receiver Actor got: " <> int.to_string(number))
            actor.continue(state)
          }
        }
      }
    }
  })
  |> actor.start
}

pub type SenderMessage {
  Tick
}

fn start_sender_actor(receiver_name: process.Name(ReceiverMessage)) {
  let receiver = process.named_subject(receiver_name)

  actor.new_with_initialiser(1000, fn(self: process.Subject(SenderMessage)) {
    process.send(self, Tick)
    Ok(actor.initialised(#(receiver, 0, self)))
  })
  |> actor.on_message(fn(state, _msg: SenderMessage) {
    let #(receiver, count, self) = state
    let next_count = count + 1

    io.println("Sender Actor sending: " <> int.to_string(next_count))

    case int.modulo(next_count, 10) {
      Ok(9) -> panic as "Intentional crash on number 9!"
      _ -> {
        process.send(receiver, Print(next_count))
        process.send_after(self, 1000, Tick)
        actor.continue(#(receiver, next_count, self))
      }
    }
  })
  |> actor.start
}

pub fn main() {
  io.println("Starting supervisor with one-for-one strategy...")

  let receiver_name = process.new_name(receiver_name_prefix)

  let assert Ok(_sup) =
    static_supervisor.new(static_supervisor.OneForOne)
    |> static_supervisor.restart_tolerance(intensity: 10, period: 5)
    |> static_supervisor.add(supervision.worker(fn() {
      start_receiver_actor(receiver_name)
    }))
    |> static_supervisor.add(supervision.worker(fn() {
      start_sender_actor(receiver_name)
    }))
    |> static_supervisor.start()

  process.sleep_forever()
}

The key insight here is named processes. The receiver registers itself with actor.named(name). The sender resolves that name on startup with process.named_subject() (inside the start function). When the receiver crashes and restarts, it re-registers under the same name. When the sender crashes and restarts, it re-resolves the receiver. Without named processes, a restarted actor gets a new PID and everyone holding the old reference is out of luck.

What I’d Tell My Past Self

Supervisors clicked for me when I stopped thinking of crashes as failures. A crash is just a signal saying “something went wrong, start over from a clean state.” It’s not an error to be caught—it’s a state transition to be managed.

The hardest part was getting over the reflex to handle everything defensively. In Gleam, you write the happy path, panic when things are unrecoverable, and let the supervisor deal with the consequences. Your code ends up simpler, and your system ends up more resilient.

If you’re just starting with Gleam’s OTP, start with OneForOne and named processes. That combination covers 90% of what you’ll need. The rest you’ll figure out when you need it—the compiler has your back.