OTA & design for failure

Geert Van HeckeIIoT

IIoT & intelligent OTA firmware updates: design for failure

As Amazon Web Service’s (AWS) Vice President and CTO, Werner Vogels, famously said, “Everything fails all the time”.

Let’s face it, he’s right. No matter how hard you try to make your firmware update process fault-tolerant, failure is not a matter of ‘if’ but ‘when’.

What Vogels meant by his famous quote was that it essentially doesn’t matter how hard you try to avoid individual failure. Failure will happen. No matter what.

You can invest huge sums in the very best hardware. But who’s to say that a power cut or network outage wont occur? Regardless of how good your hardware, processes or team’s capability is, the occasional failure is unstoppable.

So Vogels’ message (and ours too!) is that instead of pouring all your energy and time into trying to prevent failure from happening, plan for it.

Invert the logic. Embrace failure and approach everything you do with the goal of recovering as quickly as possible from that failure.

Overcome connectivity issues

A device needs to be online during the download time without interruption. In some places device connectivity might be a real issue. This will not be the case inside your factory. However, if you use a mobile network for your controllers, this might be a real problem.

To overcome these connectivity issues, we should split updates in multiple small files. By consequence you’ll probably need multiple temporary links.

The IIoT device in its turn then downloads all these update files. If the download of 1 of these files gets interrupted, it downloads the file again.

Downloading that 1 small file again is less of an issue as downloading a complete bigger update again. Of course it is the responsibility of the device to check if it has correctly downloaded all update files. The device must reconstruct it again into 1 single update.

Automatic recovery

We have already foreseen the event that the download process of the new firmware gets interrupted. That’s great. But the update process itself might get interrupted as well because of a power interruption or another unexpected cause.

So we still must protect the device still against these kinds of problems. And in the event of a failed update, the device needs to rollback to its previous state.

Let’s see how we can design a process that deals with these kind of failures: the A/B update model.

The A/B update model

IIoT & intelligent OTA firmware updates: design for failure - partitions
OTA partitions
  1. The device boots from its boot partition and runs all apps from “app partition A” and stores and reads any data from the “data partition”.
  2. the system performs an OTA firmware update and installs this update on “app partition B” and updates the boot partition configuration.
  3. The device reboots, this time running all apps from “app partition B” and storing and reading any data from the “data partition”
  4. In the case that partition B does not load correctly, the system fails back to “app partition A” and executes the update process again.
  5. In the case the load from “app partition B” executed correctly, “app partition A” synchronises with “app partition B”.

This blog is part of our blog series IIoT Intelligent Firmware Updates.

Want to know more?

Get in Touch