Tuesday, September 26, 2006

Another thread on . . . threads

Thanks for reading...

First, I want to thank everybody who read Part I, especially those of you who made comments on it. I'm going to address a couple of those comments and questions first, then proceed to my philosophy of How not to shoot yourself in the foot when writing multi-threaded code in C-like languages.

In a completely non-technical aside, one of my previous articles somehow got listed on both digg and reddit, and now random people on the Internet are making cogent, well-reasoned responses to it, and to my previous posts. I feel like a "real blogger" now. Thanks, and I'll try not to let it go to my head. It's a bit ironic, in that the original purpose of this blog was to help me get over my fear of writing, and now that I know that I have an audience, it's even harder...

Okay, back to threads...

Graham Lee pointed out that Mach threads can in fact be configured to conform to something like the no-state shared model. All you have to do is create a new task, use vm_inherit() to disallow any sharing of memory regions with the old task, and Bob's your Uncle. That's a good point, and something that I might have glossed over. In many cases, you can get a separation of state between threads by doing a little additional work outside the pthreads-style interface.

Reimer Mellin mentioned that the CSP model had been around for quite some time before Occam was invented. That's true - the initial paper describing CSP was apparently published in 1978, whereas Occam didn't hit the scene until 1983 or so, when the Transputer first started to become available. Apparently, Tony Hoare (the inventor of CSP) wrote a book on a more formalized version of CSP in 1985. It's available online, but if you're not a mathematician, it might be rough going. Personally, I find that the more funky symbols used in a piece of writing, the harder it is to read. Hoare's book uses lots of symbols - there's even a six page long "glossary of symbols".

Some Dos and Don'ts

These are in no particular order, and simply represent some different ways of slicing the multi-programming pie. One or more of them may apply to your next project...

Do consider whether you need to use threads at all

Sometimes what you actually want is a separate process, in the heavy-weight, OS-level process sense. If you think about it, one program doing two things at once isn't fundamentally all that different from two programs doing one thing each. Yeah, I know, all that overhead, spawning a whole new process, setting up IPC with named pipes or whatever... But have you ever actually measured the overhead of creating a process, or transferring a few megabytes of data between two processes on the same machine?

I've done a couple of simple, two-process (GUI and background server) applications on both Mac OS and Windows, and you might well be surprised by how well this design works in practice. Of course, if your 'background' process just ends up spinning its wheels inside some hideously-complex calculation, or you actually need to send a lot of data between the GUI and the calculation engine, then you haven't actually solved your problem, and you'll have to do something more sophisticated.

Don't use threads to avoid blocking on I/O

Unless you're programming on some seriously old, backwater OS, you should have other options for your file and network I/O that don't involve waiting for the I/O to complete. This is very dependent on what platform you're using. Try hitting your favorite search engine with the terms "async I/O" or "nonblocking I/O" to read about the various options available. The complexity of these async I/O approaches can seem a little daunting, until you realize that in the simple-seeming "create a thread for background I/O" model, the complexity is all still there, it's just not as easy to see.

Do know what each thread in your program is for

You need to have an identified set of responsibilities for each thread in your system. Without a clear idea of what each thread is responsible for, you'll never be able to figure out what your data-sharing strategy needs to be. If you use UML or CRC cards to model your system, or even if your "design" is a bunch of clouds and arrows on a whiteboard, you need to be able to determine which parts of the system can run concurrently, and what information they need to share. Otherwise, you're doomed.

Don't reinvent the wheel

It's harder than you might think to write code that's truly thread-safe. You'd be well advised to see what's been done already for your language & environment of choice. If someone has already gone to the effort of creating thread-safe data structures for you to use, then use them, don't create your own.

For example, if you're already running your "main" GUI thread in an event-processing loop, consider using that message queue as your communication channel between threads. The .NET 2.0 framework provides a class called BackgroundWorker specifically to address the "trivial background calculation in a GUI app". The design of BackgroundWorker is worth reading about (Google it), even if you're on another platform. It's a nice, simple way to manage a second thread for background processing in a GUI application.

Do consider developing a strategy for detecting and/or avoiding deadlocks

Let's get this out of the way - in any non-trivial shared-memory system with conventional locking semantics, you'll never be able to predict ahead of time whether on not a deadlock will occur. I'm told there's a proof that in the general case, predicting deadlocks is equivalent to the infamous Halting Problem, which you've perhaps heard of before. If you have a reference to a research paper on this, let me know - I'd like to beat some people over the head with it. Despite all that, it's relatively easy to detect when the system is deadlocked.

Don't spawn threads in response to external events

This is really just a special case of know what each thread in your program is for. It's hard enough to coordinate all the concurrency in your program with a static set of threads. Adding in the additional complication of unknown numbers of active threads at any given time is sheer insanity.

Also, given that there's some amount of overhead involved for each thread that you create or have active, scaling up the number of threads as load increases will often have the perverse effect of decreasing throughput by attempting to improve it..

Do consider a message-passing design

I mentioned this in Part I, but you might want to consider using the message passing model, even if you're working in a shared-memory world. The basic rule here is to avoid modifying any global state from within more than one thread. When you send a message from one thread to another, you pass in all the data it'll need to access in order to complete its job. Then, you don't touch those data structures from anywhere else until the other thread is done working with them.

The only real hurdle in implementing this strategy is in keeping up the separation between threads, despite not having any language-level support for the desired partitioning. You need to be really careful to not accidentally start sharing data between threads without intending to (and without having a plan).

Don't hold a lock or semaphore any longer than actually necessary

In particular, never hold a lock across a function call. Now, this might seem a bit extreme, but remember, we're trying to manage complexity here. If you can see all the places where a lock can be acquired and released all at once, it's easier to verify that it's actually acquired and released in the right places. Holding locks for the shortest time practical also shortens the window in which you can experience a deadlock, if you've made some other mistake in your locking strategy.

Do stay on the well-trodden path

The producer-consumer model, thread pools and work queues all exist for a reason. There's a solid theoretical underpinning for these designs, and you can find robust, well tested implementations for most any environment you might be working in. Find out what's been done, and understand how it was done, before you go off half-cocked, inventing you own inter-thread communication and locking mechanisms. If you don't understand the very low-level details of how (and when) to use the "volatile" qualifier on a variable, or you haven't heard of a memory barrier, then you shouldn't be trying to implement your own unique thread-safe data structures.

Do use multiple threads to get better performance on multi-processor systems

If your program is running on a multi-processor or multi-core computer (and chances are that it will be, eventually) you'll want to use multiple threads to get the best possible performance.

Moore's Law, and what the future holds

Welcome to the multi-core era

I can't find the excellent blog post I was reading on this subject just yesterday, but here's an article by Herb Sutter that hits the high points. The bottom line is that you're not going to see much improvement in the performance of single-threaded code on microprocessors in the near future. In order to make any kind of performance headway with the next couple generations of processors, your code needs to be able to distribute load over multiple processes or threads.

The future is now

Desktop PCs with 4 processors are already readily available. Sun's UltraSparc T1 has 8 cores on one chip, and can execute 32 threads "simultaneously", under ideal conditions. Even Intel's Itanium is going multi-core, a dramatic departure from the instruction-level parallelism that was supposed to be the hallmark of the EPIC architecture (but that's a story for another time).

Some time in the very near future, the programs that you're writing will be executing on systems with 8, 16, or more processors. If you want to get anything near the peak level of performance the hardware is capable of, you're going to need to be comfortable with multi-processor programming.

Everything old is NUMA again

It's perhaps a trite observation that yesterday's supercomputer is tomorrow's desktop processor. Actually, I think it's more like there is a tide in processor design, that hits the supercomputer world, then hits the mainstream a couple decades or so later, when the high-performance folks have moved on to something else.

In the 1980's, supercomputers were all about high clock-speed vector (SIMD) processing, which is where the current generation of desktop chips have stalled out. Clock speeds aren't going to massively increase, and the vector capabilities of the Pentium and PowerPC processors, while impressive, are still limited in the kinds of calculations they can accelerate. And the processor designs are so very complex, that it's hard to imagine that there are many more tricks available to get more performance-pre-clock out of the existing designs.

When the supercomputer folks hit their own megahertz and design complexity wall, they went through their own muti-core CPU era, then in rapid succession to massively parallel MIMD systems, then to the super-cluster computers we see these days. It seems reasonable to expect an explosion of processors in desktop systems too, and for much the same reason - the standard SMP shared memory model doesn't scale well.

In particular, cache coherency becomes a major performance issue in shared-memory multi-processor systems as the number of processors increases. The conventional wisdom says that a design where all memory is shared can scale to 4-8 processors. This is obviously dependent on memory performance, cache architecture, and a number of other factors. Perhaps worryingly, this means we're not only at the start of the multi-core era in the desktop world, we're also about one processor generation away from the end of it. Gee, that went by pretty fast, didn't it?

So, what's next?

Going by the "20 years behind supercomputers" model, the Next Big Thing in desktop processors would be massively-parallel architectures, with locally-attached memory. You'd expect to see something like the Connection Machine, or the Transputer-based systems of the 90's. Given the advances in process technology, you might even be able to fit hundreds of simple processors on a single chip (actually, some folks have already done that for the DSp market).

However, the desktop computer market has shown a remarkable reluctance to embrace new instruction sets. So a design using hundreds or thousands of very simple processors with fast locally-attached memory isn't likely to succeed the currently ascendant IA32/IA64 Intel architecture. So where do we go from here? I think Intel is going to keep trying to wring as much performance out of their now-standard two chip, multiple cores per chip design. They can certainly do some more clever work with the processor caches, and with a little help from the OS, they can try to minimize thread migration.

Ultimately that approach is going to run out of steam though, and when that happens, there's going to be a major shift in the way these systems are designed and programmed. Through the multi-core era, and even into the beginning of the massively parallel era which will inevitably follow, you ought to be able to get away with following the pthreads model. You might need to think about processor affinity and cache sharing in ways you don't have to now, but it'll at least be familiar territory.

When really massively-parallel systems start to become more common, the programming model will have to change. The simplicity of implementation of the shared-memory model will inevitably give way to more explicitly compartmentalized models. What languages you'll likely use to program these beasts is an interesting question - most likely, it'll be a functional language, something like Haskell, or Erlang. I've been lax in getting up to speed on functional programming, and I'm going to make an effort to do better. I recommend that you do the same.

5 comments:

Graham Lee said...

"When you send a message from one thread to another, you pass in all the data it'll need to access in order to complete its job."


Actually, I'd recommend against that ;-). I'm still in favour of by-value passing between threads, but if there's a large amount of data to be sent, then it's worth considering a model in which you send a message along the lines of "this work is ready for you to do, and when you send me a reply message then I'll blit you the data" which defers passing huge gobbets of data around until (and of course iff) they're actually going to be used. Although I suppose a copy-on-write memory model might reduce the effects of that.

Mark Bessey said...

Yeah, that's a good point. You don't want to duplicate a bunch of data just to enforce partitioning between threads. There ought to be a better way to do this in a high-level language, like some kind of smart reference that automatically handles the hand-off between threads...

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

I have a question if you don't mind.

I am about to begin a C++ programming project that will involve some parallel processing. There are two reasons for the parallel processing; the first is to protect the master process from the inability of the slave process to complete its task (which seems to be unavoidable), the second is take advantage of multiple processors when available on a given machine.

The architecture that I am considering is similar to what you discuss in "Do consider whether you need to use threads at all" in that I intend to have multiple separate processes running. This way the slave process(s) will have their own memory, and if they blow up, the master process should be unaffected. I don't believe that the slave process needs to have access to the master process data structures because the instructions to the slave process are simple and each processed item is independent.

The master process will parse a large task from an input file and distribute the subunits to whatever slave process is currently available with no load balancing. The master process collects data sent back by the slave process and creates output to the disk. I think this is no different than a PVM, just run on one machine.

I know very little about this kind of programing, such as how the slave processes are launched by the master process and how the processes communicate with each other, since they do not share address space. Is this kind of architecture handled with MPI stuff, or is there a standard method for setting this kind of thing up?

In the past I have been able to keep essentially the same code base for Windows and Linux. This is in part because the is no GUI and the software is run from the command line.

Any suggestions that you could make to point me in the wright direction (and away from the wrong direction), would be greatly appreciated.

ninjalj said...

The complexity of these async I/O approaches can seem a little daunting, until you realize that in the simple-seeming "create a thread for background I/O" model, the complexity is all still there, it's just not as easy to see.

Indeed, it could probably be argued that pthread-style threads (i.e. threads with preemptive multitasking) introduce more complexity than non-blocking/async IO.

The argument for non-blocking IO vs threads would go as follows: non-blocking IO usually leads to event-driven systems. Event driven systems are duals of cooperative threading (cf. On the Duality of Operating System Structures). Any complexity delta between event-driven systems and cooperative threading is merely due to representational complexity. Preemptive multithreading adds the fundamental complexity of preemptiveness.

The case for async IO should be similar. While async IO has the OS doing IO behind the process' back, that's strictly limited to IO, and is probably dual to a cooperative threading system where the IO threads can only do IO.