Skip to main content

Command Palette

Search for a command to run...

Postmortem on PacoPlay crash 2023-01-18

Updated
3 min read
Postmortem on PacoPlay crash 2023-01-18

On January 18th, 2023, at 19:00 the server for PacoPlay broke and you were no longer able to play.

The last game was https://pacoplay.com/game/6558 played between Yorick and Raimond. No moves were possible anymore and the timer broke. Yorick quickly reported that over Discord so I was able to take a look at the issue half an hour later.

After taking a look, it turned out that all the games were broken and you were no longer able to play at all. Restarting the Server would help, but every time anyone opened that game the server would crash again.

By removing the offending game from the database I was able to restart the server and this time you were able to play without issue. That was at 20:00, so we had an outage of one hour. For now, I was able to leave and play some Ticket to Ride with friends.

What was causing the issue?

To make sure we won't be hitting that same issue again, I made a copy of the database with the "evil" game inside to try things out in my development environment. Opening the game thankfully reproduced the issue and showed that the timer was indeed at fault:

thread '' panicked at 'called Result::unwrap() on an Err value: OutOfRangeError(())', src/ws/wake_up_queue.rs:27:64

The wake-up queue is the part of the server that allows it to wake up when the timer runs out and this forcefully ends the game.

The offending function is a helper method that allows me to pass a chrono::DateTime<Utc> into a function where I would usually expect a std::time::Instant:

pub fn put_utc(key: impl Into<String>, wake_up: chrono::DateTime<Utc>) {
    let wake_up = wake_up - Utc::now();
    let wake_up = std::time::Instant::now()
        + wake_up.to_std().unwrap();

    put(key, wake_up);
}

pub fn put(key: impl Into<String>, wake_up: Instant) { .. }

I am calling .unwrap() in here, so it is no surprise that this crashes. I should have handled that better. So what is the condition under which .to_std() doesn't return a proper value? Let's look at the documentation:

Creates a std::time::Duration object from time::Duration

This function errors when duration is less than zero. As standard library implementation is limited to non-negative values.

This method call asks the timer to please ring in the past. That we only see this happening now is probably the result of some rare timing condition. We try to set a timer a little bit in the future - but then wait too long and it is already in the past.

A fix to this is avoiding the unwrap. For example by just setting a 0-second timer instead.

pub fn put_utc(key: impl Into<String>, wake_up: chrono::DateTime<Utc>) {
  let wake_up = wake_up - Utc::now();
  let wake_up = std::time::Instant::now()
      + wake_up.to_std().unwrap_or(Duration::from_secs(0));

  put(key, wake_up);
}

I restarted the local server another time and lo and behold:

Since the "Timer sometimes goes into negative" issue is still unresolved, I can also see that it took me 3 hours and 45 minutes this time to locate the bug.