Monday, November 7, 2011

128-bit SIMD is dead. Long live 256-bit SIMD.

Linus recently made known his preference for bigger cores, and the death of 128-bit SIMD is definitely a step in that direction.  As announced back in 2008, Intel now considers 256-bit SIMD to be standard, and AMD has followed with their all-purpose Bulldozer line of processor models.

History: The top-500 supercomputer list was the first consistently updated ranking of supercomputers, and because they were first they are considered the standard.  Although high performance computing can now be considered as a particular configuration of cloud computing resources, it was originally dedicated almost exclusively to scientific pursuits.  As is suggested by scientific notation, it is necessary for science to represent numbers where the placement of the decimal point is flexible, and controllable through the use of the secondary "exponent" number.  Thus the number, its exponent, and the sign (positive or negative) combine to represent numbers used in science using the finite resources (bits) available in computers.  Although representations with 3 (half precision, 16-bit) and 6 (32-bit single precision) significant digits are available, science has almost exclusively used the double precision representation which provides a whopping 15 digits of accuracy.  The top-500 list therefore decided to rank supercomputers by their double-precision performance (Floating point operations per second, or FLOPs), which is measured using the Linpack benchmark, whose Double-precision General Matrix Multiply library, or DGEMM, causes a performance bottleneck dependent upon carrying out the inner-product of matrix fragments.  (Interestingly, it can in fact be more efficient to carry out an outer-product function within the registers in order to reduce the number of memory reads/writes required to feed the register file a sufficient number of operands.. but that is for another post).

Using what are now old-style 128-bit SIMD systems it is possible to hold two 64-bit double precision numbers in each register.  The typical operation that is performed on these registers is A = A + B*C (the so-called multiply accumulate, or MAC, although accumulation suggests that the operands being summed are of the same sign, which significantly simplifies the design of the adder), and the addition and multiplication are each considered a floating point operation, therefore the performance of a MAC on 128-bit SIMD registers results in four double-precision floating point operations.  The register file is designed to be sufficiently large, and the cache bandwidth and secondary SIMD memory operation issue port sufficient to allow each core to initiate a SIMD MAC each cycle, storing the results in registers that will not be read for several cycles, in order to allow the multi-cycle latency of the MAC unit time to fully generate the MAC results.  This results in four FLOPs per core per cycle (each core typically has one SIMD register file and ALU).

With 256-bit SIMD a total of four 64-bit operands can be held in each register, allowing the initiation of a total of eight FLOPs per core per cycle, thereby doubling potential performance.  This new wider SIMD unit became standard with the addition of Intel's AVX instruction to the x86 standard.  Doing a little math, we can see that Fujitsu's announcement of a 23.2 Petaflop system using 16-core processors at 1.848ghz delivering 236.5 GFlops each is using a 256-bit SIMD unit.

The extension of general purpose processors to include 256-bit SIMD units as standard narrows the gap between GPU and CPU at SIMD operations.  China's fastest supercomputer, which previously held the world title, is a heterogeneous supercomputer which uses GPUs with 512-bit wide SIMD.  Intel's Many-integrated-core architecture also touts 512-bit wide SIMD.

At just 2x, the difference between general purpose SIMD units and vector processors (GPUs) has never been so small.  The main difference between these architectures is no longer the SIMD units, but the performance of the cache and the cache coherency simulating a shared memory between the cores.  Intel continues to attempt to push the coherency boulder up the hill, and if third tries are truly a charm, then the cache coherent 512-bit SIMD cores of the Larrabee 3 hybrid, aka Knight's Corner, will be something to see.

No comments:

Post a Comment