Another Day In The Code Mines: August 2017

Monday, August 28, 2017

A brief history of the Future

Lessons learned from API design under pressure

It was August of 2009, and the WebOS project was in a bit of trouble. The decision had been made to change the underlying structure of the OS from using a mixture of JavaScript for applications, and Java for the system services, to using JavaScript for both the UI and the system services. This decision was made for a variety of reasons, primarily in a bid to simplify and unify the programming models used for application development and services development. It was hoped that a more-familiar service development environment would be helpful in getting more developers on board with the platform. It was also hoped that by having only one virtual machine running, we'd save on memory.

Initially, this was all built on top of a customized standalone V8 JavaScript interpreter, with a few hooks to system services. Eventually, we migrated over to Node.js, when Node was far enough along that it looked like an obvious win, and after working with the Node developers to improve performance on our memory-limited platform.

The problem with going from Java to JavaScript

As you probably already know, despite the similarity in the names, Java and JavaScript are very different languages. In fact, the superficial similarities in syntax were only making things harder for the application and system services authors trying to translate things from Java to JavaScript.

In particular, the Java developers were used to a multi-threaded environment, where they could spin off threads to do background tasks, and have them call blocking APIs in a straightforward fashion. Transitioning from that to JavaScript's single-threaded, events and callbacks model was proving to be quite a challenge. Our code was rapidly starting to look like a bunch of "callback hell" spaghetti.

The proposed solution

As one of the most-recent additions to the architecture group, I was asked to look into this problem, and see if there was something we could do to make it easier for the application and service developers to write readable and maintainable code. I went away and did some research, and came back with an idea, which we called a Future. The Future was a construct based on the idea of a value that would be available "in the future". You could write your code in a more-or-less straight-line fashion, and as soon as the data was available, it'd flow right through the queued operations.

If you're an experienced JavaScript developer, you might be thinking at this point "this sounds a lot like a Promise", and you'd be right. So, why didn't we use Promises? At this point in history, the Proamises/A spec was still in active discussion amongst the CommonJS folks, and it was not at all obvious that it'd become a popular standard (and in fact, it took Promises/A+ for that to happen). The Node.js core had in fact just removed their own Promises API in favor of a callback-based API (this would have been around Node.js v0.1, I think).

The design of the Future

Future was based on ideas from SmallTalk(promise), Java(future/promise), Dojo.js(deferred), and a number of other places. The primary design goals were:

Make it easy to read through a piece of asynchronous code, and understand how it was supposed to flow, in the "happy path" case
Simplify error handling - in particular, make it easy to bail out of an operation if errors occur along the way
To the extent possible, use Future for all asynchronous control flow

You can see the code for Future, because it got released along with the rest of the WebOS Foundations library as open source for the Open WebOS project.

My design brief looked something like this:

A Future is an object with these properties and methods:
.result The current value of the Future. If the future does not yet have a value, accessing the result property raises an exception. Setting the result of the Future causes it to execute the next "then" function in the Future's pipeline.

.then(next, error) Adds a stage to the Future's pipeline of steps. The Future is passed as a parameter to the function "next". The "next" function is invoked when a value is assigned to the future's result, and the (optional) "error" function is invoked if the previous stage threw an exception. If the "next" function throws an exception, the exception is stored in the Future, and will be re-thrown if the result of the Future is accessed.

This is more-or-less what we ended up implementing, but the API did get more-complicated along the way. Some of this was an attempt to simplify common cases that didn't match the initial design well. Some of it was to make it easier to weld Futures into callback-based code, which was ultimately a waste of time, in that Future pretty much wiped out all competing approaches to flow control. And one particular set of changes was thrown in at the last minute to satisfy a request that should just have been denied (see What went wrong, below).

What went right

We shipped a "minimal viable product" extremely quickly

Working from the initial API design document, Tim got an initial version of Future out to the development team in just a couple of days, which had all of the basics working. We continued to iterate for quite a while afterwards, but we were able to start the process of bring people up to speed quickly.

We did, in fact, eliminate "callback hell" from our code base

After the predictable learning curve, the former Java developers really took to the new asynchronous programming model. We went from "it sometimes kind of works", to "it mostly works" in an impressively-short time. Generally speaking, the Future-based code was shorter, clearer, and much easier to read. We did suffer a bit in ease of debugging, but that was as much due to the primitive debugging tools on Node as it was to the new asynchronous model.

We doubled-down on our one big abstraction

Somewhat surprisingly to me, the application teams also embraced Futures. They actually re-wrote significant parts of their code to switch over to Future-based APIs at a deeper level, and to allow much more code sharing between the front end and back end of the Mail application, for example. This code re-use was on the "potential benefits" list, but it was much more of a win than anyone originally expected.

We wrote a bunch of additional libraries on top of Future, for all sorts of asynchronous tasks - for file I/O, database access, network and telecoms, for the system bus (dbus) interface, basically anything that you might have wanted to access on the platform, was available as a Future-based API.

The Future-based code was very easy to reason about in the "happy path" case

One of the best things about all this, is that with persistent use of Futures everywhere, you could write code that looked like this:

downloadContacts().then(mergeContacts).then(writeNewContacts).then(success, error)

Most cases were a bit more-complicated than that (often using inline functions), but the pattern of only handling the success case, and just letting errors propagate, was very common. And in fact, the "error" case was, as often as not, logging a message and rescheduling the task for later.

The all-or-nothing error propagation technique fit (most of) our use cases really well

The initial use case of the Future was for a WebOS feature called "Synergy". This was a framework for combining data from multiple sources into a single uniform format for the applications. So, for example, you could combine your contacts from Facebook, Google, and Yahoo into a single address book list, and WebOS would automatically de-dubplicate and link related contacts, and sync changes made on the phone to the proper remote service that the data originally came from. Similarly, all of your incoming e-mail went into the same "Mail" database on-device.

In a multi-stage synchronization process like this, there are all sorts of ways that the operation can fail - the remote server might be down, or the network might be flaky, or the user might decide to put the phone into airplane mode in the middle of a sync operation. In the vast majority of cases, we didn't actually care what the error was, just that an error had occurred. When an error happened, the usual response was to leave the on-phone data the way it was, and try again later. In those cases where "fuck it, I give up" was not the right error handling strategy, the rough edges of the error-handling model were a bit easier to see.

What went wrong

The API could have been cleaner/simpler

It didn't take long before we were adding convenience features to make some of the use cases simpler. Hence, the "whilst" function on Future, which was intended to make it easier to iterate over a function that returned Futures. There were a couple of other additions that also got a very small amount of use, and could have easily been replaced by documentation of the "right" way to do things.

Future had more-complex internal state than was strictly needed

If you look at Promises, they've really only got the minimal amount of state, and you chain functions together by returning a Promise from each stage. Instead of having lots and lots of Futures linked together to make a pipeline of operations, Future was the pipeline. I think that at some level this both decreased heap churn by not creating a bunch of small objects, and it probably made it somewhat easier to debug broken pipelines (since all of the stages were visible). Obviously, if we'd known that Promises were going to become a big thing in JavaScript, we would have stayed a lot closer to the Promises/A spec.

Error handling was still a bit touchy, for non-transactional cases

If you had to write code that actually cared about handling errors, then the "error" function was actually located in a pretty terrible place, you'd have all these happy-path "then" functions, and one error handler in the middle. Using named functions instead of anonymous inline functions helped a bit with this, but I would still occasionally get called in to help debug a thrown exception that the developer couldn't find the source for.

It would have been really nice to have a complete stack trace for the exception that was getting re-thrown, but we unfortunately didn't have stack traces available in both the application context and the service context. In the end, "thou shalt not throw an exception unless it's uniquely identifiable" was almost sufficient to resolve this.

I caved on a change to the API that I should have rejected

Fairly late in the process, someone came to me and said "I don't like the 'magic' in the way the result property works. People don't expect that accessing a property will throw an exception, so you should provide an API to access the state of the Future via function calls, rather than property access". At this point, we had dozens of people successfully using the .result API, and very little in the way of complaints about that part of the design.

I agreed to make the addition, so we could "try it out" and see whether the functional API was really easier or clearer to use. Nobody seemed to think so, except for the person who asked for it. Since they were using it, it ended up having to stay in the implementation. and since it was in the implementation, it got documented, which just confused later users (especially third parties), who didn't understand why there were two different ways to accomplish the same tasks.

How do I feel about this, 8 years later?

Pretty good, actually. Absent a way to see into the future, I think we made a pretty reasonable decision with the information we had available. The Bedlam team did an amazing bit of work, and WebOS got rapidly better after the big re-architecturing. In the end, it was never quite enough to displace any of the major mobile OSes, but I still miss some features of Synergy, even today. After all the work Apple has done over the years to improve contact sync, it's still not quite as good (and not nearly as open to third parties) as our solution was.

Monday, August 21, 2017

What your Internet Of Things startup can learn from LockState

The company LockState has been in the news recently for sending an over-the-air update to one of their smart lock products which "bricked" over 500 of these locks. This is a pretty spectacular failure on their part, and it's the kind of thing that ought to be impossible in any kind of well-run software development organization, so I think it's worthwhile to go through a couple of the common-sense processes that you can use to avoid being the next company in the news for something like this.

The first couple of these are specific to the problem of shipping the wrong firmware to a particular model, but the others apply equally well to an update that's for the right target, but is fundamentally broken, which is probably the more-likely scenario for most folks.

Mark your updates with the product they go to
The root cause of this incident was apparently that LockState had produced an update intended for one model of their smart locks, and somehow managed to send that to a bunch of locks that were a different model. Once the update was installed, those locks were unable to connect to the Internet (one presumes they don't even boot), and so there was no way for them to update again to replace the botched update.

It's trivially-easy to avoid this issue, using a variety of different techniques. Something as simple as using a different file name for firmware for different devices would suffice. If not that, you can have a "magic number" at a known offset in the file, or a digital signature that uses a key unique to the device model. Digitally-signed firmware updates are a good idea for a variety of other reasons, especially for a security product, so I'm not sure how they managed to screw this up.

Have an automated build & deployment process
Even if you've got a good system for marking updates as being for a particular device, that doesn't help if there are manual steps that require someone to explicitly set them. You should have a "one button" build process which allows you to say "I want to build a firmware update for *this model* of our device, and at the end you get a build that was compiled for the right device, and is marked as being for that device.

Have a staging environment
Every remote firmware update process should have the ability to be tested internally via the same process end-users would use, but from a staging environment. Ideally, this staging environment would be as similar as possible to what customers use, but available in-company only. Installing the bad update on a single lock in-house before shipping it to customers would have helped LockState avoid bricking any customer devices. And, again - this process should be automated.

Do customer rollouts incrementally
LockState might have actually done this, since they say only 10% of their locks were affected by this problem. Or they possibly just got lucky, and their update process is just naturally slow. Or maybe this model doesn't make up much of the installed base. In any case, rolling out updates to a small fraction of the installed base, then gradually ramping it up over time, is a great way to ensure that you don't inconvenience a huge slice of your user base all at once.

Have good telemetry built into your product
Tying into the previous point, wouldn't it be great if you could measure the percentage of systems that were successfully updating, and automatically throttle the update process based on that feedback? This eliminates another potential human in-the-loop situation, and could have reduced the damage in this case by detecting automatically that the updated systems were not coming back up properly.

Have an easy way to revert firmware updates
Not everybody has the storage budget for this, but these days, it seems like practically every embedded system is running Linux off of a massive Flash storage device. If you can, have two operating system partitions, one for the "current" firmware, and one for the "previous" firmware. At startup, have a key combination that swaps the active install. That way, if there is a catastrophic failure, you can get customers back up and running without having them disassemble their products and send them in to you, which is apparently how LockState is handling this.

If your software/hardware allows for it, you can even potentially automate this entirely - have a reset watchdog timer that gets disabled at the end of boot, and if the system reboots more than once without checking in with the watchdog, switch back to the previous firmware.

Don't update unnecessarily
No matter how careful you are, there are always going to be some cases where a firmware update goes bad. This can happen for reasons entirely out of your control, like defective hardware that just happens to work with version A of the software, but crashes hard on version B.

And of course the easiest way to avoid having to ship lots of updates is sufficient testing (so you have fewer critical product defects to fix), and reducing the attack surface of your product (so you have fewer critical security issues that yo need to address on a short deadline.

Sunday, August 13, 2017

Why I hate the MacBook Pro Touchbar

Why I hate the MacBook Pro Touchbar

The Touchbar that Apple added to the MacBook Pro is one of those relatively-rare instances in which they have apparently struck the wrong balance between form and function. The line between “elegant design” and “design for its own sake” is one that they frequently dance on the edge of, and occasionally fall over. But they get it right often enough that it’s worth sitting with things for a while to see if the initial gut reaction is really accurate.

I hated the Touchbar pretty much right away, and I generally haven’t warmed up to it at all over the last 6 months. Even though I’ve been living with it for a while, I have only recently really figured out why I don’t like it.

What does the Touchbar do?

One of the functions of the Touchbar, of course, is serving as a replacement for the mechanical function keys at the top of the keyboard. It can also do other things, like acting as a slider control for brightness, or quickly allowing you to quickly select elements from a list. Interestingly, it’s the “replacement for function keys” part of the functionality that gets most of the ire, and I think this is useful for figuring out where the design fails.

What is a “button” on a screen, anyway?

Back in the dark ages of the 1980s, when the world was just getting used to the idea of the Graphical User Interface, the industry gradually settled on a series of interactions, largely based on the conventions of the Macintosh UI. Among other things, this means “push buttons” that highlight when you click the mouse button on them, but don’t actually take an action until you release the mouse button. If you’ve used a GUI that’s takes actions immediately on mouse-down, you might have noticed that they feel a bit “jumpy”, and one reason the Mac, and Windows, and iOS (mostly) perform actions on release is exactly because action on mouse-down feels “bad”.

Why action on release is good:

Feedback — When you mouse-down, or press with your finger, you can see what control is being activated. This is really important to give your brain context for what happens next. If there’s any kind of delay before the action completes, you will see that the button is “pressed”, and know that your input was accepted. This reduces both user anxiety, and double-presses.

Cancelability — In a mouse-and-pointer GUI, you can press a button, change your mind, and move the mouse off before releasing to avoid the action. Similar functionality exits on iOS, by moving your finger before lifting it. Even if you hardly ever use this gesture, it’s there, and it promotes a feeling of being in control.

Both of these interaction choices were made to make the on-screen “buttons” feel and act more like real buttons in the physical world. In the case of physical buttons or keyswitches, the feedback and the cancelability are mostly provided by the mechanical motion of the switch. You can rest your finger on a switch, look at which switch it is, and then decide that you’d rather do something else and take your finger off, with no harm done. The interaction with a GUI button isn’t exactly the same, but it provides for “breathing space” in your interaction with the machine, which is the important thing.

The Touchbar is (mostly) action on finger-down

With a very few exceptions [maybe worth exploring those more?], the Touchbar is designed to launch actions on finger-down. This is inconsistent with the rest of the user experience, and it puts a very high price on having your finger slip off of a key at the top of the keyboard. This is exacerbated by bad decisions made by third-party developers like Microsoft, who ridiculously put the “send” function in Outlook on the Touchbar, because if there was ever anything I wanted to make easier, it’s sending an email before I’m quite finished with it.

How did it end up working that way?

I’m not sure why the designers at Apple decided to make things work that way, given their previous experience with GUI design on both the Mac and iOS. If I had to take a guess, the logic might have gone something like this:

The Touchbar is, essentially, a replacement for the top row of keys on the keyboard. Given that the majority of computer users are touch-typists, then it makes sense to have the Touchbar buttons take effect immediately, in the same way that the physical keyswitches do. Since the user won’t be looking at the Touchbar anyway, there’s no need to provide the kind of selection feedback and cancelability that the main UI does.

There are a couple of places where this logic goes horribly wrong. First off, a whole lot of people are not touch typists, so that’s not necessarily the the right angle to come at this from. Even if they were, part of the whole selling point of the Touchbar is that it can change, in response to the requirements of the app. So you’re going to have to look at it some of the time, unless you’ve decide to lock it into “function key only” mode. In which case, it’s a strictly-worse replacement for the keys that used to be there, and you’re not getting the benefits of the reconfigurability.

Even if you were going to use the Touchbar strictly as an F-key replacement, it still doesn’t have the key edges to let you know whether you’re about to hit one key or two, so you’ll want to look at what you’re doing anyway. I know there are people out there who use the function keys without looking at them, but the functions there are rarely-enough used that I suspect the vast majority of users end up having to look at the keyboard anyway, in order to figure out which one is the “volume up” key, or at least the keyboard-brightness controls.

How can Apple fix this?

First, and primarily, make it less-likely for users to accidentally activate functions on the Touchbar. I think that some amount of vertical relief could make this failure mode less-likely, though I’m not sure if the Touchbar should be higher or lower than it is now. I have even considered trying to fabricate a thin divider to stick to the keyboard to keep my finger away from accidentally activating the “escape” key, which is my personal pet-peeve with the touch bar.

A better solution to that problem is probably to include some amount of pressure-sensitivity and haptic feedback. The haptic home button on the iPhone 7 does a really good job of providing satisfying feedback without any visuals, so we know thiscan work well. Failing that, at least some way to bail out of hitting the Touchbar icons would be worth pursuing - possibly making them less senstive to grazing contact, though that would increase the cases where you didn’t activate a button while actually trying to.

Another option would be bringing back the physical function keys. This doesn’t necessarily mean killing the Touchbar, but maybe just moving it away from the rest of the keyboard a bit. This pretty much kills the “you can use it without taking your eyes off the screen or your hands off the home row” argument, but I’m really not convinced that’s at all realistic, unless you only ever use one application.

So, is Apple going to do anything to address these issues?

Honestly? You can never tell with them. Apple’s history of pushing customers forward against their will (see above) argues that they’ll pursue this for a good while, even if it’s clearly a bad idea. On the other hand, the pressure-sensitivity option seems like the kind of thing they might easily decide to add all by themselves. In the meantime, I’ll be working out my Stick On Escape Key Guard…