Application Disaster Recovery with Incus Containers

By chimo on (updated on )

Recently, one of my self-hosted services suffered a partial upgrade failure. These are the steps I followed to get back to a working version, then troubleshoot the issue.

I wrote earlier this year about my Incus ("LXD" back then) backup and patching strategies. Things mostly work the same today. I've been lucky enough that the software I'm using has been stable to the point where I don't recall having to restore anything from backup due to breakage. Up until recently, that is.

One of the applications running on a container failed to upgrade properly, and partially broke. I say "partially" because things mostly worked, except for a few specific features. The fact that the breakage was "partial" is important because it means I didn't notice right away that something was wrong. I do my patching on Saturdays at 10am, with a backup taken just before. It wasn't until Wednesday that I notice some URLs would return HTTP Error 500.

The approach I took is as follows:

  1. Take a snapshot of the current state before doing anything, in case I break it more than it already is:
    `incus snapshot create $container broken-snapshot`
  2. Try reverting to the pre-upgrade snapshot:
    `incus snapshot restore $container pre-patch-20241116`

Things starting working again. I was relieved that I didn't need to also restore the postgres container to an earlier version and then have to try to reconcile the data/structure between Saturday, November 16 and Wednesday, November 20.

A couple of days later, I had some free time to troubleshoot the upgrade issue. Since I'm the only user of this self-hosted service, I figured I'd just try the upgrade again on the "live" container. If it broke again, I could troubleshoot it, and worse case just restore from a working version. The upgrade failed again, but I managed to work-around the issue and have it to upgrade successfully in the end.

As a side-note, if restoring the "live" container to a previous version is not an option, copying the container (or a snapshot) to a new container is also possible:

incus copy $container/snapshot-name new-container

So, do I have any "lessons learned" out of this? Well, one thing I want to add now is error reporting for those unattended, scheduled upgrades. For this particular case, it would be as simple as flagging when `apk upgrade` returns a failure. It might be tricker for some of the other software, but I'll start with that and see how it goes.

2024-11-26 Update: I've recently added a quick & simple notification mechanism for failed upgrades.