Monday, October 18, 2010

CELL University Challenge

Ashok Chandrashekar is an amazing guy.  During his first few months of graduate school at Dartmouth he taught himself how to program the CELL processor, and even how to debug it, which was much harder.  Errors that show up only in hardware were particularly vicious, and the catch-all "bus error" actually gives no information about what went wrong.  Still, in about one month he wrote every line of code that went into our entry for the CELL University Challenge, which resulted in winning the grand prize (thanks also to Jay Moorkanikara, who originally had the idea to submit the entry).

For that month, working past midnight day after day with Ashok was one of the best experiences of my life, and our workarounds for the problems we encountered are why I think we won.  The biggest and most strategic workaround was discovered while we were trying to reduce our "bit-vector dot-product" (dot-product where all the elements are 0 or 1) to 3 cycles.  This was possible because the CELL processor ingeniously implemented the "pop-count" instruction which counts all of the 1's in a binary integer (e.g. popcount(1000100100001) = 4).  One of the claims-to-fame for architectures like Itanium was the hardware pop-count instruction, which required significant dedicated hardware in the architecture design.  Itanium and other architectures count the bits in the entire integer, but counting 32 or 64 bits requires a lot of logic to complete in a single clock cycle.  Someone at IBM had the notion to count bits in smaller fields, namely each 8-bit field, separately.  For large integers, it is much easier to count multiple 8-bit fields separately and store the totals in separate 8-bit regions, which allows the results of multiple pop-counts to be summed (on the CELL, summing 31 or fewer pop-counts has no overflow danger).

With the CELL pop-count instruction, it is possible to perform a 128-bit bit-vector dot-product in just 3 cycles: AND, popcount, ADD, and repeat along the entire vector's length.  Before the sums outgrow their 8-bit boundaries they must be aggregated, e.g. to 16-bits, but that is pretty simple to do.  And of course both input vectors must be loaded into registers as well, but those loads are hidden by the CELL's second execution port which can handle simultaneous loads/stores to/from the local store memory.

During the last stages of our implementation we encountered a throughput problem: much more than 3-cycles was required per 128-bits, and the reason for this was not obvious.  It turns out the CELL processor does not contain a bypass network in the traditional sense, meaning that values that exit the ALU for register writeback are not immediately available in the next cycle as inputs (a capability provided in most modern architectures by their bypass network - in fact the Pentium 4 had a half-cycle throughput and latency for simple instructions).  The CELL is designed this way because the bypass network is an expensive piece of hardware in terms of silicon area and power consumption, and as clock cycles scale (3.2 ghz in the CELL, which was a very high clock for 90nm technology) the turnaround time of the bypass network must decrease to achieve single-cycle latency (there is a similar requirement for branch prediction which we will cover in a future post).  Furthermore, the ability of typical modern processors to process out-of-order (OOO) allows other instructions to be scheduled whose inputs are in fact available, but the CELL uses an in-order design instead of OOO to reduce the silicon area and power requirements of the CPU (thereby increasing the number of cores that can fit in each processor, increasing overall throughput).

Instead of a full latency-hiding bypass network and OOO execution, IBM relies on the compiler to intelligently schedule instructions, but this didn't work in our case (something we are ironically thankful for, since it made our contest submission more impressive, hehe).  This may have been due to limitations in the compiler's scheduler, such as the size of the window in which rearrangements are looked for.  Our workaround was to unroll the inner loop and then syncopate the operations of several loop iterations like this:

AND A1, B1 -> C1
AND A2, B2 -> C2
AND A3, B3 -> C3
AND A4, B4 -> C4
AND A5, B5 -> C5
AND A6, B6 -> C6


POPCOUNT C1 -> D1
POPCOUNT C2 -> D2
POPCOUNT C3 -> D3
POPCOUNT C4 -> D4
POPCOUNT C5 -> D5
POPCOUNT C6 -> D6

ADD D1, E1 -> E1
ADD D2, E2 -> E2
ADD D3, E3 -> E3
ADD D4, E4 -> E4
ADD D5, E5 -> E5
ADD D6, E6 -> E6

The very large (though multi-cycle latency) register file in the CELL processor was able to simultaneously hold all of the temporary values without issue.  This bit-vector dot-product sequence works on vector chunks of 768 bits, which evenly divided our input vectors.  This scheduling allows the AND, POPCOUNT, and ADD instructions to have 6 cycles of latency headroom before their outputs are needed as inputs.  It also amortizes the cost of the looping branch over 18 instructions instead of just 3, further increasing throughput.

With the $10k in prize money our team went to Vegas and had a very crazy week that was eventually ripped off by Rockstar Games and incorporated into Grand Theft Auto 4.  :-D just kidding, I think I spent my portion paying off a credit card.  Oh well!

-Andrew

No comments:

Post a Comment