Another Day In The Code Mines: What your Internet Of Things startup can learn from LockState

The company LockState has been in the news recently for sending an over-the-air update to one of their smart lock products which "bricked" over 500 of these locks. This is a pretty spectacular failure on their part, and it's the kind of thing that ought to be impossible in any kind of well-run software development organization, so I think it's worthwhile to go through a couple of the common-sense processes that you can use to avoid being the next company in the news for something like this.

The first couple of these are specific to the problem of shipping the wrong firmware to a particular model, but the others apply equally well to an update that's for the right target, but is fundamentally broken, which is probably the more-likely scenario for most folks.

Mark your updates with the product they go to
The root cause of this incident was apparently that LockState had produced an update intended for one model of their smart locks, and somehow managed to send that to a bunch of locks that were a different model. Once the update was installed, those locks were unable to connect to the Internet (one presumes they don't even boot), and so there was no way for them to update again to replace the botched update.

It's trivially-easy to avoid this issue, using a variety of different techniques. Something as simple as using a different file name for firmware for different devices would suffice. If not that, you can have a "magic number" at a known offset in the file, or a digital signature that uses a key unique to the device model. Digitally-signed firmware updates are a good idea for a variety of other reasons, especially for a security product, so I'm not sure how they managed to screw this up.

Have an automated build & deployment process
Even if you've got a good system for marking updates as being for a particular device, that doesn't help if there are manual steps that require someone to explicitly set them. You should have a "one button" build process which allows you to say "I want to build a firmware update for *this model* of our device, and at the end you get a build that was compiled for the right device, and is marked as being for that device.

Have a staging environment
Every remote firmware update process should have the ability to be tested internally via the same process end-users would use, but from a staging environment. Ideally, this staging environment would be as similar as possible to what customers use, but available in-company only. Installing the bad update on a single lock in-house before shipping it to customers would have helped LockState avoid bricking any customer devices. And, again - this process should be automated.

Do customer rollouts incrementally
LockState might have actually done this, since they say only 10% of their locks were affected by this problem. Or they possibly just got lucky, and their update process is just naturally slow. Or maybe this model doesn't make up much of the installed base. In any case, rolling out updates to a small fraction of the installed base, then gradually ramping it up over time, is a great way to ensure that you don't inconvenience a huge slice of your user base all at once.

Have good telemetry built into your product
Tying into the previous point, wouldn't it be great if you could measure the percentage of systems that were successfully updating, and automatically throttle the update process based on that feedback? This eliminates another potential human in-the-loop situation, and could have reduced the damage in this case by detecting automatically that the updated systems were not coming back up properly.

Have an easy way to revert firmware updates
Not everybody has the storage budget for this, but these days, it seems like practically every embedded system is running Linux off of a massive Flash storage device. If you can, have two operating system partitions, one for the "current" firmware, and one for the "previous" firmware. At startup, have a key combination that swaps the active install. That way, if there is a catastrophic failure, you can get customers back up and running without having them disassemble their products and send them in to you, which is apparently how LockState is handling this.

If your software/hardware allows for it, you can even potentially automate this entirely - have a reset watchdog timer that gets disabled at the end of boot, and if the system reboots more than once without checking in with the watchdog, switch back to the previous firmware.

Don't update unnecessarily
No matter how careful you are, there are always going to be some cases where a firmware update goes bad. This can happen for reasons entirely out of your control, like defective hardware that just happens to work with version A of the software, but crashes hard on version B.

And of course the easiest way to avoid having to ship lots of updates is sufficient testing (so you have fewer critical product defects to fix), and reducing the attack surface of your product (so you have fewer critical security issues that yo need to address on a short deadline.

Another Day In The Code Mines

Monday, August 21, 2017

What your Internet Of Things startup can learn from LockState

No comments: