Monday, November 7, 2011

128-bit SIMD is dead. Long live 256-bit SIMD.

Linus recently made known his preference for bigger cores, and the death of 128-bit SIMD is definitely a step in that direction.  As announced back in 2008, Intel now considers 256-bit SIMD to be standard, and AMD has followed with their all-purpose Bulldozer line of processor models.

History: The top-500 supercomputer list was the first consistently updated ranking of supercomputers, and because they were first they are considered the standard.  Although high performance computing can now be considered as a particular configuration of cloud computing resources, it was originally dedicated almost exclusively to scientific pursuits.  As is suggested by scientific notation, it is necessary for science to represent numbers where the placement of the decimal point is flexible, and controllable through the use of the secondary "exponent" number.  Thus the number, its exponent, and the sign (positive or negative) combine to represent numbers used in science using the finite resources (bits) available in computers.  Although representations with 3 (half precision, 16-bit) and 6 (32-bit single precision) significant digits are available, science has almost exclusively used the double precision representation which provides a whopping 15 digits of accuracy.  The top-500 list therefore decided to rank supercomputers by their double-precision performance (Floating point operations per second, or FLOPs), which is measured using the Linpack benchmark, whose Double-precision General Matrix Multiply library, or DGEMM, causes a performance bottleneck dependent upon carrying out the inner-product of matrix fragments.  (Interestingly, it can in fact be more efficient to carry out an outer-product function within the registers in order to reduce the number of memory reads/writes required to feed the register file a sufficient number of operands.. but that is for another post).

Using what are now old-style 128-bit SIMD systems it is possible to hold two 64-bit double precision numbers in each register.  The typical operation that is performed on these registers is A = A + B*C (the so-called multiply accumulate, or MAC, although accumulation suggests that the operands being summed are of the same sign, which significantly simplifies the design of the adder), and the addition and multiplication are each considered a floating point operation, therefore the performance of a MAC on 128-bit SIMD registers results in four double-precision floating point operations.  The register file is designed to be sufficiently large, and the cache bandwidth and secondary SIMD memory operation issue port sufficient to allow each core to initiate a SIMD MAC each cycle, storing the results in registers that will not be read for several cycles, in order to allow the multi-cycle latency of the MAC unit time to fully generate the MAC results.  This results in four FLOPs per core per cycle (each core typically has one SIMD register file and ALU).

With 256-bit SIMD a total of four 64-bit operands can be held in each register, allowing the initiation of a total of eight FLOPs per core per cycle, thereby doubling potential performance.  This new wider SIMD unit became standard with the addition of Intel's AVX instruction to the x86 standard.  Doing a little math, we can see that Fujitsu's announcement of a 23.2 Petaflop system using 16-core processors at 1.848ghz delivering 236.5 GFlops each is using a 256-bit SIMD unit.

The extension of general purpose processors to include 256-bit SIMD units as standard narrows the gap between GPU and CPU at SIMD operations.  China's fastest supercomputer, which previously held the world title, is a heterogeneous supercomputer which uses GPUs with 512-bit wide SIMD.  Intel's Many-integrated-core architecture also touts 512-bit wide SIMD.

At just 2x, the difference between general purpose SIMD units and vector processors (GPUs) has never been so small.  The main difference between these architectures is no longer the SIMD units, but the performance of the cache and the cache coherency simulating a shared memory between the cores.  Intel continues to attempt to push the coherency boulder up the hill, and if third tries are truly a charm, then the cache coherent 512-bit SIMD cores of the Larrabee 3 hybrid, aka Knight's Corner, will be something to see.

Saturday, November 5, 2011

Linus on big vs small cores

There's a discussion over at involving Linus Torvalds, David Kanter, and Michael Schuette where they argue the merits of big cores vs small cores.  If Carmack, Sharky, Anand, and Stokes were to get in on it then it would be a most epic cpu discussion (hint hint).

Thursday, November 3, 2011

The MOST EFFICIENT pseudo random number generator

I exumed Marsaglia's Complementary Multiply with Carry (CMWC) algorithm years ago in pursuit of implementing a really efficient random number generator on a 32-bit system that has a full 64-bit result from its 32x32 multiplier.  It so happens that in this case CMWC is really the most efficient algorithm that can  be thought up, and about a million times more efficient than Mersenne Twister, while Marsaglia has shown in multiple articles that its randomness appears to be as good as any existing psuedo-random number generator.

It was not easy to find all of the necessary information to implement it - I had to follow several threads from old newsgroups, and find copies of the referenced papers in particular in order to find a suitable value for "a", which is one of the factors in the repetitive multiplication.  I implemented it using four 32-bit seeds, which creates a very low likelihood of repetition (before the development of our Sun into a Red Giant).  I thought I was so clever to have unearthed such a great algorithm, so well suited to the hardware I was programming.

Well it turns out that Wikipedia now includes everything you need to know about CMWC.  They include an example C implementation with an apparently proper sample "a" value, and even use a default implementation of four 32-bit seeds.  This must be how it feels to finish your dissertation's bibliography just before Google Scholar comes out and electronic copies of journal articles are ubiquitous.  That Wikipedia page would have saved me A LOT of time...

Anyway, for everyone trying to generate random numbers quickly, check out CMWC - from my experience it is faster than anything else out there, and plenty random.

Edit: The Wikipedia implementation is for 4096 seeds of 32-bits, so not very compact relative to the four seed version - although changing to four seeds is not difficult, generating an "a" value that works for four seeds still requires a little bit of research.