Monday, November 22, 2010

Intel talks Kilocore processors, coherency wall

At least somebody took notes at the talk given by Intel (specifically Timothy Mattson) at the SC'10 conference this year, which speculates on what computer architecture will be required to make a thousand on-chip cores useful.  The message is that these systems are do-able, but that the current method of shared memory programming will not survive the transition.

Shared memory has been a problem in supercomputing since the beginning.  In the shared memory programming paradigm, all of the different processors and cores can see each other's data and do not need to send messages with complete data structures, but can instead send addresses to each other.  This is similar to emailing YouTube links instead of attaching entire mpeg files, which is of course much more practical for long HD videos that might not fit in the recipient's inbox.

The issue that arises is that the various processors in a supercomputer are different distances from the hardware that is actually holding the data in memory, and this leads to different amounts of delay and bandwidth pressure when accessing the data.  The term "Non-Uniform Memory Access" (NUMA) is applicable in this case, and in the shared memory model the programmer doesn't have a lot of control over which processors are near or far from the data, and therefore they have less control over the performance.  This is the price for creating and using the abstraction that all of the processors share memory when in reality they are all on a network with a variable number of hops between nodes and memory.

To date, the solution has been to lash many computers together in a network cluster, called distributed computing, and obliterate the illusion of shared memory completely, using a distributed programming paradigm where programmers send messages between the processors explicitly.  For example, "Send( Processor_1, "This is a message to processor 1")" is a type of command that programmers would use in this environment to communicate between processors.  This is the method used at Google exclusively before MapReduce, and is still used when MapReduce is not applicable.

Intel is now allowing its researchers to talk about how the concept of shared memory won't even extend within a single server node much longer (metaphorically, all emails will have to send entire videos, not YouTube links).  The issue is that even when all of the cores in a node reside on the same chip, they are different distances from each other, creating a NUMA effect, and the current method of hiding this effect with separate memory caches that talk to each other in order to present a single unified memory, called cache coherency, is not sustainable into the Kilocore realm.  It's not new knowledge that this model cannot be sustained in the future, but it is new that Intel is allowing its researchers to admit to the existence of the "coherency wall".  The statements are couched in the condition that the talk is discussing thousand-core systems, Kilocore processors, which are a long way off for Intel, who's current strategy is to build fewer fast processors rather than many simple processors.

An interesting subtext is that, not only is shared memory programming not viable in the Kilocore future, but that even within the alternative, message passing, Intel is predicting a "synchronous messages only" constraint that allows small on-core buffers to satisfactorily hold the communication data.  In synchronous message passing, the "send" command does not complete until the receiver performs a corresponding "recv" command, to clear the buffer.  The proposed RCCE protocol is a little excessively constrained in that it should be possible for the sender to proceed, but be limited to not sending an additional send command until the previous send has had a corresponding recv.  In the YouTube email example, this is akin to an email server only allowing new video attachments to be sent if the recipient has already deleted all previous emails from that sender.

I think the more strict synchrony is implemented for simplicity's sake, and an option to remove this restriction is available, though only in the "gory" implementation of the RCCE message passing library.  It should be noted that using a "gory" build mode option on an experimental library is, in a way, its own reward, since it gives experience points in the programming demigod class ;-D

1 comment:

  1. Clued-up post! Your accepted wisdom is great. Thanks for keep me notify. For more information I will be in touch.
    broadcast audio processor for FM & HD