Tuesday, November 23, 2010

Intel releases x86-FPGA hybrid - app store to follow?

The guys over at Altera have gotta be busting out the champagne - they're officially in bed with Intel, and if history is any indication (Microsoft), that is a good place to be.  Intel is integrating an Altera FPGA with an Atom processor in the same package, and calling it the Atom 600C series.  To do this, the package's small metal box (that gets squished by the heat-sink-fan (HSF) unit) houses two chips, an Intel Atom processor and an Altera FPGA, and connects them with very small wires, presenting the appearance of a single physical chip.  Without cutting open the metal case it would be impossible to tell this apart from an FPGA and processor on the same chip die.  This is the same technique used by Intel to unexpectedly achieve quad-core processors before AMD (even though AMD's SRQ and Hypertransport fabric were demonstrated as superior for multiprocessing).

The 600C-series Atom processor is a single-core Tunnel Creek model, but it is no slouch.  It comes in at a maximum speed of 1.6ghz and supports hyperthreading, whereas the highest performing Atom processor is a 1.83ghz dual-core Pineview Atom processor, also hyperthreaded (this Atom hyperthreading provides a huge performance improvement, as we will show in a future post, with improvements on the order of 66% of a second core).

Despite marketing hype, this is not the first time a processor has been tightly integrated with an FPGA, with RISC-cores coming embedded in FPGAs since Xilinx released the Virtex-II Pro model with on-chip PowerPC cores back in 2002.  What is new is that the processor is x86, allowing execution of code that can't be recompiled (i.e. proprietary libraries) very close to the FPGA.  This closeness should not only increase bandwidth and decrease latency of processor-FPGA intercommunication, but it also definitely results in lower power.  For example, we have used an Altera Cyclone III 65nm FPGA at Cognitive Electronics using Altera's starter kit and the power consumption measured for the FPGA chip was approximately 200 milliwatts whereas the entire board's power consumption was over 3 watts.  Indeed it is hard to think of an add-on card that comes in at less than 2 watts.  So, instead of doubling the power consumption of the Atom processor (TDP 2.7 watts @ 600mhz) the FPGA will increase it by only about 10%.  Even though this improvement becomes less dramatic in the context of total system power, it is still quite significant.

Intel could have designed their own FPGA ASIC pretty easily - they are just millions of very small SRAMs connected together with programmable wiring.  The hardest part of creating a good FPGA product is designing the Electronics Design Automation (EDA) software, which takes in a hardware design written in a hardware description language (HDL) and processes it through many stages, from conversion to a gate-level netlist ("Synthesis"), to mapping the gates onto the resources of the FPGA ("Map"), to determining the location of those components on the FPGA and routing programmable wires to connect them up (Place-and-Route).  Although Intel uses EDA software to design its processor ASICs, after the synthesis stage the FPGA EDA software and ASIC EDA software are quite different.  Furthermore, Intel's internal EDA software might not actually be that good, because they are known to hand-craft most of the components of their processors (a labor-intensive "Full Custom" ASIC design process).

One of the nicest aspects of this new Intel system is its potential to serve as a standard platform.  Currently, the small amount of open cores available for FPGA don't install without recompilation etc. - so Intel's new Atom-FPGA platform may be the first where you can just download binaries of FPGA programs and run them.  Maybe this will be the start of the first FPGA App store?  I can see the commercials now: "Want to perform Fast Fourier Transforms?  There's an app for that."

Monday, November 22, 2010

Intel talks Kilocore processors, coherency wall

At least somebody took notes at the talk given by Intel (specifically Timothy Mattson) at the SC'10 conference this year, which speculates on what computer architecture will be required to make a thousand on-chip cores useful.  The message is that these systems are do-able, but that the current method of shared memory programming will not survive the transition.

Shared memory has been a problem in supercomputing since the beginning.  In the shared memory programming paradigm, all of the different processors and cores can see each other's data and do not need to send messages with complete data structures, but can instead send addresses to each other.  This is similar to emailing YouTube links instead of attaching entire mpeg files, which is of course much more practical for long HD videos that might not fit in the recipient's inbox.

The issue that arises is that the various processors in a supercomputer are different distances from the hardware that is actually holding the data in memory, and this leads to different amounts of delay and bandwidth pressure when accessing the data.  The term "Non-Uniform Memory Access" (NUMA) is applicable in this case, and in the shared memory model the programmer doesn't have a lot of control over which processors are near or far from the data, and therefore they have less control over the performance.  This is the price for creating and using the abstraction that all of the processors share memory when in reality they are all on a network with a variable number of hops between nodes and memory.

To date, the solution has been to lash many computers together in a network cluster, called distributed computing, and obliterate the illusion of shared memory completely, using a distributed programming paradigm where programmers send messages between the processors explicitly.  For example, "Send( Processor_1, "This is a message to processor 1")" is a type of command that programmers would use in this environment to communicate between processors.  This is the method used at Google exclusively before MapReduce, and is still used when MapReduce is not applicable.

Intel is now allowing its researchers to talk about how the concept of shared memory won't even extend within a single server node much longer (metaphorically, all emails will have to send entire videos, not YouTube links).  The issue is that even when all of the cores in a node reside on the same chip, they are different distances from each other, creating a NUMA effect, and the current method of hiding this effect with separate memory caches that talk to each other in order to present a single unified memory, called cache coherency, is not sustainable into the Kilocore realm.  It's not new knowledge that this model cannot be sustained in the future, but it is new that Intel is allowing its researchers to admit to the existence of the "coherency wall".  The statements are couched in the condition that the talk is discussing thousand-core systems, Kilocore processors, which are a long way off for Intel, who's current strategy is to build fewer fast processors rather than many simple processors.

An interesting subtext is that, not only is shared memory programming not viable in the Kilocore future, but that even within the alternative, message passing, Intel is predicting a "synchronous messages only" constraint that allows small on-core buffers to satisfactorily hold the communication data.  In synchronous message passing, the "send" command does not complete until the receiver performs a corresponding "recv" command, to clear the buffer.  The proposed RCCE protocol is a little excessively constrained in that it should be possible for the sender to proceed, but be limited to not sending an additional send command until the previous send has had a corresponding recv.  In the YouTube email example, this is akin to an email server only allowing new video attachments to be sent if the recipient has already deleted all previous emails from that sender.

I think the more strict synchrony is implemented for simplicity's sake, and an option to remove this restriction is available, though only in the "gory" implementation of the RCCE message passing library.  It should be noted that using a "gory" build mode option on an experimental library is, in a way, its own reward, since it gives experience points in the programming demigod class ;-D

Friday, November 19, 2010

China takes top supercomputer spot

It is now well known that China has taken the top supercomputer spot with a gigantic GPU cluster.  It's not surprising that GPUs are able to power the Linpack benchmark to such great heights.  Linpack benefits from a division of bigger tasks into smaller tasks known as "blocking".  This is not an embarassingly parallel breakdown, which means it doesn't necessarily represent the type of performance that could be expected on data-parallel benchmarks, because the network bandwidth, latency, on-server bandwidth, latency, and on-chip bandwidth, latency are all put to the test.  Where Nvidia can really deliver a high value is with their very high ~200 GBytes/sec of memory bandwidth per GPU whereas a commodity server node gets about 8-12 GBytes/sec.  With such high bandwidth it is possible for the GPUs to approach their maximum SIMD capabilities, and for double-precision SIMD floating point operations they just scream relative to commodity processors.

Linpack is very close to the type of application GPUs were originally invented for, and with Nvidia being slowly pushed out of the discrete graphics processor business by Intel and AMD integrating increasingly better graphics processors directly onto the CPU die, Nvidia had to branch out and add supercomputer-like capabilities to their graphics cards in order to try to fetch market in the HPC space.  They added double-precision, ECC, caching, and some other features.  I wouldn't have thought the result would be a migration of the top supercomputer to China, but that is indeed what has happened.

The Renaissance of IT

You have to appreciate Songnian Zhou, CEO of Platform Computing.  Here he shines in an interview that in fact merits a post just on his fantastic quotes:

"Cloud?  What is cloud?  I don't know.  Everybody's trying to get what cloud is.  Dark cloud, white cloud, big cloud, floating cloud... but... is that hype?  I believe there's some substance behind it.  I think this is probably the biggest invention in computing, and the business model of computing, in the last 30 years... one can say the cloud is the endpoint of distributed computing, where things are so distributed, so accessible through the Internet, that it represents the democratization, and popularization, and mainstreaming of HPC."

"So if you think about the word HPC, it gives the concept of very high performance, esoteric [applications] maybe in the cloud. In fact this methodology of using computing to study nature, to design products, to optimize businesses, to have entertaining Internet games, all these things are very broadly applicable."

"If you look at computing, at the IT industry, we are in adolescence at best, and we have been operating in the stone age... You run your own computers, you run your own servers, you run your own applications... you must be crazy.  You don't have to own the place... the point is that the IT industry is entering into a mature stage, it's a lot like the auto industry, just like the transportation industry in the sense that computing and applications will deliver services at low cost, more accessible, doing more interesting things... the range of applications because of the enablement of cloud computing, is going to grow ten... a hundred times.  And this is now the renaissance of IT."

Zhou has some interesting points, like that every mature industry eventually becomes service based, and there is other great stuff in that 10 minute interview, including a great metaphor likening the cloud to the airline industry (which happens to fly airplanes through clouds... nice).

Thursday, November 18, 2010

The age of "good enough" computing

In the last few years, consumers have been very clear in showing that they care about better CPU performance only in so much as it enables new features they care about.  Hennessy agrees (yes, the Hennessy), claiming that one of the big demands for CPU performance now comes from "the Googles of the world", meaning people want more products and features from cloud computing.  The other big demand for CPU performance comes from users that want a better user interface - and a failure to innovate on the user-interface side of things has resulted in a lack of demand for CPU performance.

AMD would agree, and they have taken on the corollary that people want the qualitatively best user experience available today at the lowest power (i.e. best battery life) and lowest cost possible.  Enter AMD's hybrid CPU/GPU processors, called accelerated processing units (APU) that deliver the best graphical experiences possible within 9w (Ontario) and 18w (Zacate) for netbooks and laptops respectively.  AMD has given up chasing Intel's single-threaded performance, which, as long as Moore's law continues and Intel maintains a process technology lead, will arguably never be beaten again.

"Good enough" can also be applied to operating systems, and the strong user base of Microsoft's 9-year-old OS is evidence that WindowsXP was indeed good enough.  I remember when Bill Gates told me and a small crowd of interns that Microsoft's biggest competitor is free software.  Not open source software or Linux, but free software of the type that users have already bought from Microsoft and is now free to operate for all time.  So originally it was not free, but it doesn't bring any money into Microsoft and Microsoft must compete against it to produce more sales.  That means Bill thought that Microsoft had been fighting "good enough" for quite some time, and foretold of the longevity of their greatest OS.

An interesting side-effect of x86 processor's progression toward multi-core is that it is now possible for an outsider to throw away multicore processing in favor of one really fast core.  It is interesting to think of what might have been, or what could still be, if the design decisions that resulted in the 3.8ghz Pentium 4 were put into action today, at 22nm, four process technologies beyond the highest clock speed processor ever released by Intel. Although it wouldn't deliver 16x the performance, it would still run at 7ghz+ with 24MB+ of cache and would beat today's 3.3ghz processors at single-threaded applications (i.e. almost every desktop application) by a good margin.  But that processor would consume 130 watts, which is not "good enough" for the mobile computing users are trending towards today.  Nor is it good enough for cloud computing, which requires thousands of cores to execute its parallel applications.

Progress in the world of caches

I was researching the Sun Niagara III, aka Spark T3, and came across an interesting aspect: there is no L3 cache.  Each core gets an L2 cache in the increasingly popular size of near 256KB (Tilera, Intel Nehalem, Sandybridge), but slightly larger at 384KB.  This is coupled with L1 caches that are surprisingly small relative to the number of threads (8) supported per core: 8KB L1 data cache and 16KB L1 instruction cache - that's 1KB and 2KB respectively per thread in the case of the memory not being usefully shared, but realistically all of the instruction cache can be shared between the 8 threads on each core.  Furthermore, by supporting 8 threads per core, and topping out at just 1.65ghz, it is realistic that the local L2 cache has latency under 8 cycles so that there is close to no penalty hitting the L2 cache for data fairly frequently.  This suggests the L1 data cache is just there to reduce the number of accesses to L2, freeing up its bandwidth and reducing power by replacing higher-power L2 accesses with lower power L1 accesses.

Although they both lack L3 and use distributed and similarly sized L2 caches, there are some interesting architectural divergences between the Niagara 3 and Tilera's to-be-released-some-time-in-2011 Gx-100.  Tilera doesn't multithread, so their in-order cores will take a performance hit whenever they hit memory, even in the L1 cache, which suggests that programmers will need to use asynchronous memory transfers or the Tilera tools somehow extract these function calls automatically (with the same caveat all such automated program analysis tools have in that they sometimes work, and sometimes don't).  In contrast, the Niagara 3 hardware-multithreading is naturally tolerant to the latency of accessing memories belonging to other cores on the same chip.

It really is an open question as to whether tiled routing like Tilera's will take off, with Intel's cloud research processor having also used the method, as did their Terascale research chip.  In contrast, the mainstream Sandy Bridge and CELL processors use ring buses, making them a relatively proven architecture, though without modification the cross-chip latency will scale linearly rather than with the square root of the number of cores.  This linear  latency scaling has less impact right now since, in the existing ring bus examples, the caches themselves have latency that is similar or more than the latency resulting from core-to-core data passing.

AMD and Intel continue forward with their large L3 caches, with Intel having transitioned to a ring-bus style L3 from Nehalem to Sandy Bridge, which has reportedly had a favorable impact on L3 latency (though that seems somewhat counterintuitive).  The IBM BlueGene/P uses cross-bar access to an 8MB L3 cache with relatively high latency at 35 cycles (that is a really great review of the architecture, btw), but this is not unusual since cross-bar is a standard method for 4-cores or less.

So the jury is not just out on whether giant high latency L3 caches will continue to prevail, but also whether the largest on-chip cache will utilize a ring bus, tiled mesh, hierarchical, or other topology as chips continue to progress to ever larger core counts.

Tuesday, November 16, 2010

Technicalities of "x" and "%"

When something is "2x faster, does that mean it is 200% faster?  I have seen "x" and "%" sometimes used in such a way that the answer would be yes, and sometimes no.  My impression from working in the field of computer architecture and hardware acceleration is that there is a general rule that is followed, so since the terminology is somewhat ambiguous Mac and I formalized how we use them some time back.

1) We have formal understandings of what it means to be Y% faster, it means the new performance is 100% + Y%. Slower works in the same way but with subtraction, so Y% slower means the new performance is 100% - Y%.  Therefore if A is 50% faster than B, B is 33% slower than A.

2) Higher performance x's work similarly but without adding the 100%.  So 3x faster means the final performance is 300% of the original.  Slower performance x's work pretty different, by turning the x into a division sign "/".  5x slower means the new performance is 1/5, or 20% the original performance.  Therefore if A is 30x faster than B, B is 30x slower than A.

Here's an example of Mozilla using the same definition for higher performance x's.  The reference is regarding the performance of Firefox's new beta - one nice thing is that there is an instance where they round up from 2.94x to 3x.  They also round from 3.49x down to 3x.  In these cases they are rounding to one significant digit.  At Cognitive we typically use two or three significant digits and round down, or whichever direction is a more conservative estimate for Cognitive's performance.

Wednesday, November 10, 2010

Where MapReduce fault tolerance comes from

Google's MapReduce is a great programming paradigm.  It takes data parallelism for granted, runs on however many processors are available, and keeps running even if some of the computers crash, get unplugged, or catch fire.

MapReduce is able to tolerate faults for an interesting reason.  During the Map phase, the Master Controller (not sure if that's the official label, but if not this is better anyway :) assigns data chunks to each map worker node, and the death of a map worker just means that the Master Controller must reassign those Map inputs to a different node.

Here's the interesting part: Each mapper caches the outputs produced during the map phase and separates them according to which reducer node will receive them as inputs.  If a reducer node crashes, gets unplugged, or gets too cold, the Master Controller tells a different reducer node to use the mapper caches to finish up the missing processing.

It works this way because clever MapReduce designers realized they could cache the map results to *memory* until a sufficiently large chunk could be sent to disk as an efficient sequential write.  The hard drive caching never became a performance bottleneck because the sequential writes and sequential reads are faster than the Gigabit Ethernet connecting them to the network.

This should have caught up to Google by now, because 10GigE is getting cheap and hard drives didn't get any faster.  But as history has shown, it's not a good idea to bet against Google.  SSD arrived just in time to save the day, and Google will be able to transition just fine to 10GigE by coupling a few of them with each server (SATA 3 reaches 6 gbps, which can be saturated by 3-4 SSD drives or so).  Google's main concern about servers is their power consumption, and those SSDs sip relative to the other big power drawing PCcomponents.  For the IO bound MapReduce tasks, 10GigE + SSD means life is good.

Tuesday, November 9, 2010

7Gbps wireless a step backwards?

A lot of noise is being made over the new 7Gbps WiGig standard, which, at first glance, is about a thousand times faster than my laptop gets at home.  This has gotta be good, right?

Well, as is typical of anything having to do with wireless, the real story is a lot more complicated than just peak bandwidth.  The range is just terrible compared to even 802.11B, with full performance only achieved within a 15-foot radius with line-of-sight - and don't think about transmitting through walls because the bandwidth  will be awful if the connection doesn't completely drop.  This is played up as a good thing, because it prevents neighbors from stepping on each other's bandwidth, but that also makes it unsuitable for most in-home purposes as well.

The practical uses of such bandwidth are also hard to come by.  Most hard drives won't read much faster than 60GB/sec in sequential mode, which is 480mbps, or about 1/15th the bandwidth for which WiGig has been designed.   Sequential read speeds for blazing fast SSD hard drives are only 2gbps, still not pushing the limit.  A favorite scenario that is often cited is the ability to transfer a Blu-ray in under a minute - but, ignoring the fact that movies would have to be decrypted, the fastest Blu-ray drives are 12x, where 1x is equal to 4.5MB/sec, 12x is 54MB/sec, or 432mbps; i.e. also less than 1/15th the optimial WiGig speed.  And let's not get started on the abysmal "broadband" speeds in the U.S. for which existing WiFi is already overpowered - where I pay $60/month for 20mbps and in reality get 5mbps (a data rate that transfers just fine on 7-year-old 802.11B cards) - or about 1,000x slower than WiGig.

I'm not saying WiGig won't be great for connecting external screens (in the same room) wirelessly, because it will be, but that is a fairly niche purpose.  In fact, people don't plug in devices to screens all that often (a friend bringing over the new PlayStation would be an exception, but that only happens once every ten years, right?), so WiGig isn't really replacing wires, it's just making things wireless that most of us rarely do to begin with.

Monday, November 8, 2010

Solid State Drives mature, leave home for new housing

Most of us buying laptops with SSD hard drives love them for their speed and ruggedness.  They're expensive, but that's the price we pay for the latest and greatest, right?  It may be surprising, then, that your new bleeding edge Flash-based hard drive is housed in the standard 2.5" laptop form factor (69.85 mm × 7–15 mm × 100 mm) originally designed for spinning disks in... 1988.  Thats pretty old, and a company has finally taken a stand, said enough is enough, and brought SSD to a form factor worthy of the new millennium.

Toshiba announced from Tokyo today a new form factor that is not just smaller than the current standard, but literally 1/10th the size (24mm x 2.2-3.7mm x 108.9mm).  Called the Blade X-gale, it's basically the same length, one third the width and about one third the height.  In fact it's almost identical to the DIMM form factor used for memory, which we discussed yesterday - (a size that seems to have been tried back in August but didn't catch on).

One might expect that the capacity-to-weight ratio, or gigabytes per gram, to not have improved much over 2.5" magnetic disk drives because, while heavier than Flash,  they also have greater capacity.  The current biggest laptop drive is the Seagate Momentus 640GB, which weighs in at 120 grams, achieving 5.3 GB/gram.  The highest capacity 3.5" drive is currently 3TB drive from Western Digital's Caviar Green series, which weighs 730 grams, yielding just 4.1GB/gram. This is where it gets interesting, as the new 256GB Blade X-gale from Toshiba weights just 13.2 grams, achieving 19.4 GB/gram - besting the legacy form factors by 4x-5x.

With such big improvements in size and weight, Toshiba's new product line is a good reminder that many components of the Personal Computer are mired in their own legacy, just waiting to be updated.  PC BIOS is another example of this, which first debuted back in 1981, and industry has been so slow to move on that the newest hard drives are no longer fully functional.  Indeed, the three terabyte drive discussed above can't serve as a boot drive, and is therefore limited to secondary data storage roles until motherboard manufacturers implement the newer Universal Enhanced Firmware Interface (UEFI) more broadly.

This leads to the question of what other parts of the PC may be stuck in the past, with order-of-magnitude improvements still waiting to be unleashed..

Friday, November 5, 2010

Flash vs DRAM

DRAM and Flash store bits in fundamentally the same way: charge (on-bit) or a lack of charge (off-bit) is stored in a capacitor (a very small battery) which will be tested later to detect whether the charge is present.  MLC Flash uses the same technique but varies the level of charge stored in the capacitor in order to get more than 1-bit per capacitor.

Flash chips hold a lot more data than DRAM.  This sounds intuitive when you remember that Flash goes in hard drives, which are much bigger than the memory (DRAM) of a typical computer.  But the difference is more striking when the two chips are placed side-by-side, because the Flash chip is basically the same size as the DRAM chip (here's a neat picture showing how two Flash chips fit in one SD card).  Case in point, I was browsing Flash integrated circuits (ICs) and stumbled upon a monster at Micron.  The Micron catalog shows the MT29F256G08CUCBBH3-12 coming in at 32 gigabytes.  For a frame of reference, the best DRAM chips (Micron catalog) hold 512 megabytes.  That's a difference of 64x!

The transistors are similar sizes for the latest DRAM and Flash, so the capacity difference is achieved by putting multiple flash layers on top of each other (this is on its way to achieving 128GB Flash chips).  3D chip stacking has a traditional problem of overheating because each layer consumes power and the layers insulate themselves, causing internal temperature to escalate.  Stacking Flash solves this problem because capacitors don't consume power, and Flash doesn't lose its charge for years, so most of the layers are not in use at a given moment.  Thus, Flash chips are utilizing their dark silicon to achieve extreme densities.

On the other hand, Flash is slow and only supports low data transfer speeds.  For the example above, the Flash chip runs at 166mhz at one byte per cycle whereas DDR3 DRAM achieves an effective 1600mhz (800mhz DDR) at one byte per cycle.  So DRAM chips allow about 10x the transfer rates of Flash chips.

Lastly, there is not much benefit in fetching data from Flash in chunks much smaller than 4KB, meaning about 4,000 cycles over an 8-bit bus.  This is why SSD speed is sometimes measured in IOPs, where a 166mhz Flash operating on 4KB blocks with an 8-bit interconnect provides about 40,000 IOPs (with a really good controller).   In contrast, DDR does generally have a minimum of 8 transfers per access, which, over an 8-bit connection, is 8 bytes per access.  At an effective 1600mhz this is 200,000,000 IOPs, or about 5000x as many as the Flash chip.  Thus, DRAM allows the memory bandwidth to be dedicated to many accesses of smaller chunks of data instead of only very large chunks like Flash.

The differences between DRAM and Flash are indeed striking, with Flash providing about two orders of magnitude greater density and DRAM providing about 4 orders of magnitude more operations per second.

Edit: You can see a follow-up post here.

Thursday, November 4, 2010

The Nano Duo

For 100 points, what company has been the second largest manufacturer of motherboard chipsets and recently settled with Intel to extend its x86 license to 2018? Although AMD probably comes to mind first, the answer is VIA technologies, the most scrappy company to ever make computer chips.  Here's a great interview of VIA execs over at bit-tech that helps explain why you haven't heard from them in a while.

Around the time Intel released the Atom, VIA introduced the Nano processor (codenamed Isaiah, here's the whitepaper), which achieved double the performance of Intel's Atom while consuming about the same amount of power.  The Nano even consumed less power at idle, a typical state for the netbooks it was being designed to drive.  More benchmarks from 2009 also showed Intel being handed its own posterior.

But nobody used it.  Well, IBM and Samsung put it in netbooks, Dell put it in mini servers for physicalization (the first gigabit is always free), and maybe there were some other devices (I seem to remember trying to put it into a robot at one point).  Intel continued advancing the Atom, multiplying its cores, adding out-of-order execution, and consolidating the north and south bridge so that, as a platform, Atom consumed less power than before.  This has gone far enough that you can't find Nano on newegg anymore (it should be on the shelf next to the ARM netbook, shouldn't it?).

Well things may turn around as the Nano has gotten a facelift, jumped one-and-a-half Moore's Law cycles, gone dual core, and integrated a good graphics processor.  Perhaps unsurprisingly, it is back to handing out posteriors, playing PC desktop games that were previously impossible on a netbook, and coming very close to the performance of a Core 2 Duo.  Hey, I remember that processor - it's the one powering this laptop as I type.  Hmm...

Maybe VIA can do something with their new demon this time.

Wednesday, November 3, 2010

Golden ages of technology never to return. Part2: WindowsXP

I first became a fanboi of Windows for the games.  Doom and Quake changed the way I thought about computers.  They made me want to learn how to program them and, when I could first afford my own, to know why one computer was better than another.  This latter motivation coincided with the original dual celeron hack, which raised issues that are still at the forefront of computing (what's the difference between on-chip and off-chip cache?  Why do some processors overclock more than others?  How does higher voltage increase clock headroom?  Why was overclocking the bus important?  How does a graphics card offload CPU work?).

Back on topic, Windows 95 had crashing problems.  Windows 98 had less crashing problems but only supported single processors.  Windows NT crashed even less and allowed dual processors, but had software and driver incompatibilities.  WindowsXP was the first OS that supported multiple processors, crashed very little, and was compatible with all first-release computer games.  Another great feature is that Linux and WindowsXP could dual-boot with a little care, and at the time it was fun (for me at least) to learn what ruined a dual-boot installation and how it could be done properly.  Those were the first reasons I latched onto WindowsXP.

Then something expected happened :), Mac market share continued to dwindle, reaching an all-time low around 2002, coinciding with the time most kinks were ironed out in WindowsXP with service pack 1.  For example, market share at universities like Cornell (a traditional haven for mac fans) had fallen from 41% in 1994 to a sustained period of 5% from 2000 to 2002. This meant that every piece of software released in 2002 came out for WindowsXP (please post exceptions in the comments, as well as whether those companies are still in business).  Put another way, there was no software released in 2002 that you couldn't run on your WindowsXP computer.

This period also coincided with an all-time peak in Internet Explorer adoption rate and the release of Internet Explorer 6.  This meant that there were no browser incompatibilities for WindowsXP users in 2002 - everything worked with IE6 or died.  In addition, any hardware that came out got very cheap very fast as the hardware manufacturers all competed on basically a single platform (elongating the lull in Mac usage as the hardware benefited less from economies of scale).  Finally, all this software automatically got twice as fast (as it had for the previous ~20 years) as clock speeds and Instruction-Level-Parallelism continued to scale without the need for dual cores or multithreaded programming.

During its golden age, WindowsXP created the most compatible computers of all time.  This period was eventually followed by an increase in Mac usage (a healthy thing from many perspectives) which is now between 50% and 70% for incoming college students today.  In conjunction, there was an increase in browser diversity, and an operating system from Microsoft that intentionally irritated users (see that dominance inspires hubris, leading to bad products is a consistent theme for a golden age).  Software also stopped getting 2x faster automatically with Moore's Law, with subsequent improvements requiring downloads and reinstallation .

It is sad that this compatibility came at the expense of Apple etc., and some will see it as a dark ages of sorts, but in terms of compatibility (both applications and web surfing) it is hard to argue that computers were ever more compatible before, or will ever be more compatible than they were during the golden age of WindowsXP.


Tuesday, November 2, 2010

Intel hedging against Moore's Law?

Intel's delivery of the first 16 Moore's law cycles are widely admired across the industry as being on-time and on-budget.  This unique reputation, 40 years in the making, strikes a fear of falling behind into the hearts of competing manufacturers.  Photons not traveling straight enough?  No problem, just immerse the whole process in liquid to straighten that up.  Need the performance of a 4-atom-thick insulating layer without the defect rate?  No problem, just change the way transistors have been made since CMOS was invented.  Up to now, Intel has achieved these milestones without any outside help, and has lately accrued a substantial lead in process technology.  Competitors are running to each other in the hopes of not falling further behind.

That's what makes this story about Intel partnering with Toshiba and Samsung for the next two Moore's Law cycles so surprising.  Who would have thought Intel could use some help pushing Moore's Law along?

Now, it is possible to downplay this, I mean it is only for flash memory technology - and Intel is not confirming the story either, so it may not happen at all.  But let's suppose it is and think through this for a second.  Many steps in the process to make flash memory are also used to make microprocessors (e.g. both require fabricating a type of transistor), so Toshiba and Samsung should get a serious leg up on their way to producing non-flash devices at 10nm as well.  This could potentially concede part of Intel's lead in fabrication technology - that's a big downside.  Why risk it?

One answer is that Intel foresees real struggles and the potential for long delays before achieving 10 nanometer parts. By partnering, Intel trades the increased risk of losing its technology lead for a decreased risk of reaching 10 nanometers slowly or not at all.  That is some serious doubt coming out of the company that should be most confident about its future.

Let's hope the story is wrong and that Intel is indeed as confident as can be about their timely achievement of 10 nanometers and beyond.

Monday, November 1, 2010

Intel follows AMD's lead, spins off foundry business

</sensational headline> Well, the spinoff is yet to be announced :-P, but Intel is indeed opening their fab to an outsider for the first time.  Achronix, a relatively new FPGA company (background), must have seriously impressed some execs to win the keys to Intel city, and their 22nm fabrication facilities coming online next year .  With Intel now maintaining roughly a half-node advantage over all other fabs, Achronix will be releasing production 22nm FPGAs by the time Xilinx and Altera are at full production with their 28nm FPGAs from TSMC.  When fabricating the same design, 22nm will hold about 60% more than 28nm, and Intel's 22nm will probably consume about 30%-50% less power for the same design at the same speed.

The current method of circumventing the power wall, ILP wall, and memory wall by adding more cores to each processor die,may not maintain the Moore's Law rate of 2x per cycle.  A nice aspect of FPGAs is that they still achieve 2x per tech node, or slightly better.  Designs on even the biggest and fastest FPGAs are still not near the ~150-watt power wall, and off-chip communication bandwidth continues to roughly double as the onboard high-speed transceivers are still getting faster and more numerous.  Part of the reason for this continued ability to scale is that FPGAs are programmed in Hardware Description Languages (HDLs) in which huge amounts of parallelism must be declared directly by the programmer (hard problems like timing closure and clock boundary crossing also fall on their shoulders).

Thus, with Intel's 22nm tech node not expected to deliver a big improvement in serial processing speed, and only (optimistically) doubling x86 core count to 12 (which AMD's 32nm Magny Cours already reaches at 2.2ghz), the greatest capabilities reaped from the timely arrival of Intel's 22nm tech node (potentially 1-2 years ahead of Global Foundries, TSMC, UMC, etc.) may come in the form of FPGAs with the highest speed and capacity on the planet (by a large margin).

More importantly, these new best-on-the-planet FPGAs may be provided at prices like $400.  That is  25x less than today!  Even if you can afford ~$10k to get the best right now, you still need like 3 -6 months lead time.  Taking all this into account, what's most surprising is that stock prices didn't plummet for companies that will soon be competing on unlevel ground (Xilinx up 0.1% and Altera down 0.1% on the day).

Friday, October 29, 2010

Golden ages of technology never to return. Part1: Intel 440BX

The mere mention of the Intel 440BX motherboard chipset still runs chills down my spine (as it should for all self respecting nerds).  It serves as a great example of a piece of hardware that lasted way way longer than its builder intended.  Its compatibility started with the 0.35 micron Pentium II 233mhz processor of 1997, and ended with the 0.13 micron Pentium III-S 1.4ghz Tualatin processor of 2001.  That is four generations of Moore's law!  In addition, Tualatin benefited from an unusually elegant design that allowed it to outperform 2ghz processors sold years later. (0.18 and 0.13 micron processor generations required slocket adapters like this, and in a future post we will also note another great role the slocket served ;-)  From a system builder's perspective, the 440BX represented the epitome of upgradeability, and reinforced in us the value of building our own computers.

The amazing dominance of the 440BX chipset may have partially inspired hubris at Intel that lead to a series of bad decisions, like RDRAM (Rambus) and the 33-stage pipelines of the Pentium 4.  Once pride was swallowed, Intel backtracked into better products that used standard memory and efficient pipelines.  In order to increase sales, however, they began a practice of requiring new motherboard chipsets each processor generation, and this continues to today.

In this way the 440BX represents the greatest period of computer upgrading in history, a golden age of technology that will seemingly never be repeated.  In a future post we will see similarities with Microsoft's WindowsXP product, which was also followed by bad decision making, a poor product cycle, and continuous incompatibility.

Thursday, October 28, 2010

Hard Drives eye SSD - friend or foe?

This is what it's like when worlds collide (song) - well, maybe not worlds, more like data storage technologies with different substrates.    As reported at TomsHardware, Steve Luczo, CEO of magnetic storage hard drive manufacturer Seagate, made some interesting comments regarding up and coming Solid State Drive (SSD) technology (full transcript).

First, on the product most famous for its SSD option, the MacBook Air (discussed here by Mac himself :) Luzco states the percentage of those units sold with SSD is very low, so it's not a threat.  Second, SSD drives are too small (low capacity) and too costly.  Third , SSD slows down over time, so their performance is not as great as you've heard.  And last, "Seagate introduced hybrid drive last quarter, you get basically the features and function of SSD at more like disc drive cost and capacity ... with the hybrid there is things that you can do to alleviate that [performance degredation] so your boot times are actually as compelling one and two, three and four years down the road."

There you have it, resistance is futile, prepare for assimilation.  Well, that is certainly one point of view, but in reality things could turn out less comfortable for Big Magnetic Storage (yeah, I said it :P) than they would like to admit.

As reported at RealWorldTech, SSD is a classic disruptive technology.  There is a certain amount of storage each user needs, and as the cost per bit of SSD continues to follow Moore's Law, SSD will continue to meet an increasing percentage of users' needs.   A possible counter argument is that users' needs are increasing exponentially, but I wonder if this is really happening.  Most of the large data requirements come from audio and video libraries, where storage demands increase when the libraries get bigger or gain quality.  In my case I have about two hundred albums in my music library that I have collected over the last decade, and it is really unlikely this will double to 400 albums in the next two years.  Furthermore, high definition video is pretty close to the resolution of the human retina, making demand for further improvements (and their requisite bitrate increases) unnecessary.

To hard drive manufacturers, SSDs represent the wild west.  In the vertically integrated magnetic storage business, manufacturers build the entire hard drive themselves in factories that cost billions of dollars.  In contrast, the flash memory chips that store data in SSDs are a commodity, similar to DRAM memory chips, and anyone can buy them wholesale and integrate them into SSDs.  This allows anyone to become an SSD OEM and ushers in a new competitive environment that previously wasn't.  It will be interesting to see how this new west is won.

Wednesday, October 27, 2010

SiCortex from the outside

SiCortex was an inspiring company - they had a novel chip-to-chip networking architecture that boasted improved latency and bandwidth for MPI programs.  They packed their processors densely, and instead of making rack-mount servers they optimized their own custom chassis for cooling.  Their largest system unit had over 5,000 cores and could fit within the power budget of the typical office without any power or cooling retrofitting - a very exciting proposition and I don't think I was the only one who dreamed of bringing one home.

The system was marketed as a power efficient supercomputer, a niche that seems like a good idea to target since there is such a large margin by which commodity servers can be beat in that arena, with the right architecture.  Low-power cores coming out of ARM and Tensilica inspire thoughts of how computer systems could incorporate such efficient cores in a useful way.

In 2007, about a year after SiCortex installed their first systems, the best SiCortex system was "only arguably" more power efficient than the latest Intel servers - meaning a significant advantage on some common cluster workloads wasn't obvious.  For example, in the double-precision GFlops arena (often times not a representative benchmark but used here for simplicity), SiCortex provided 5832 cores, each capable of 1GFlops, yielding 5.832 TFlops.  The power consumption of the system was about 18 kilowatts, resulting in 324 GFlops per kilowatt.  The 3Ghz Core 2 Quads that could fit in 150-watt servers in 2007 were putting out 48GFlops (4 ops/cycle in SIMD, 4 cores, 3ghz).  That's 320 GFlops per kilowatt, reducing the SiCortex advantage to a rounding error.

The GFlops comparison is not fair - the SiCortex architecture had a lot of advantages outside of GFlops, like much lower penalties for cache misses, higher memory bandwidth per compute cycle, somewhat lower penalties for branch misprediction, etc.  The comparison above also does not take network power consumption into account, and the PC network would have delivered lower performance for latency or bandwidth-bound problems.  These advantages for SiCortex would have been more compelling if the floating point power efficiency had at least a 2x-4x advantage over Intel at the time, which could be perceived as a 2-year to 4-year advantage over commodity servers as a minimum.  As it was, it was easy to think of workloads (e.g. SIMD floating-point bound workloads) that gained no power efficiency advantage on the SiCortex hardware, which starts the power efficiency story on the wrong foot.

Tuesday, October 26, 2010

QX9650 TDP, power measurement, and the Q9505S in robots.

This is an update to the previous post on the Q9505S (and includes a correction).

When the first 45nm quad-core processor arrived, the 3ghz QX9650 with 12MB cache, it was labeled with a 130 TDP.  TDP, which stands for "Thermal design power", indicates the maximum amount of heat that a cooler would need to dissipate when the processor is under load.  TDP's are known to not be the best estimate of power consumption, in fact they are necessarily overestimates, but I had figured they are a fair estimate for processors with the highest clock speed in their class, i.e. the ones closest to consuming the TDP power.

Measuring the power consumption of the processor in a PC is non-trivial - using a Kill-a-Watt results in measuring the total system power (including power supply overhead).  Even Kill-a-Watts get it quite wrong if the power consumption is changing quickly between different levels - but for measuring constant loads they work fine.  Another way of measuring power consumption, and the one xbitlabs uses, is to wrap the DC power cables that run to the CPU through an ammeter ring, which measures amps, and then measures voltage elsewhere, allowing calculation of watts as amps times volts.  This method adds the inefficiency of the voltage regulator module, or VRM, to the CPU power consumption.  The VRM is responsible for bringing the 12v down to the ~1.1v required by the processors and efficiency can vary from 75%-95% (ASUS has achieved 96%).  It is possible to use multiple motherboards with the same processor and use some statistical techniques to work out what the efficiencies of the different motherboards must be close to, but that has never been tried to my knowledge since it is just too much work.

Another technique for measuring power is to change the voltage of the processor several times, each time measuring the system power consumption.  This can be hairy because over-volting can break the processor, but when it works you end up with several points data at several different voltages.  You can overclock and underclock the processor to different levels as well.  Power consumption will be passive power + dynamic power.  Assuming passive power is pretty low (claimed to have decreased by 10x in Intel's 45nm process), estimating the dynamic power is enough, and dynamic power can be calculated by multiplying multiple factors, two of which are voltage and frequency.  By holding the other factors constant and modifying just voltage and frequency it is possible to calculate the total dynamic power by solving for the missing factor.  System power = Non-CPU power + CPU-power, or s = a + x*y*b, where many s,x,y points (voltage and frequency) can be collected to solve for a and b (ignoring CPU passive power).  I haven't seen this technique used but it should work - it would be interesting to compare it experimentally other techniques to see how close it comes to them.  One nice aspect of this technique is that the only hardware required is a kill-a-watt, as the frequency and voltage can be measured using speedfan or other software tools.

A last technique for measuring power consumption, and the one used in the previous article, is described by Anandtech as: "requires nothing more that the processor's specified TDP and then scales this value based on a given overclocked core frequency and voltage".  This method is particularly terrible when the TDP is inaccurate, and the QX9650 was a special case of an extremely inaccurate TDP, estimated to be about double the actual power consumption.

This makes a corrected theme of the previous post a little less exciting: the power consumption of the 45nm process probably decreased a significant amount but not by half over the lifetime of the process.

Still, the Q9505S is an amazing processor.  While working in the brain engineering lab at Dartmouth we put it in a mobile robot to run speech recognition (Dragon Naturally Speaking), speech production (AT&T Voices) and visual feature extraction (RoboRealm) simultaneously on a small ~15 pound mobile robot called Brainbot.  Each application ran on a different core quite smoothly, while the entire robot got around 1.5 hours of battery life on 200 watt-hours.  That's better life than my old Alienware Pentium 4 laptop when it was new.  Brainbot is now sold for $30k, and the only robot available with more onboard processing power is the WillowGarage PR2, with two PCs built-in, consuming 6x as much power and selling for $400k.  The Q9505S helped Brainbot get close to the PR2's level of performance for less than one-tenth the cost, which makes it a true marvel.

Monday, October 25, 2010

Intel's 45 nanometer Q9505S

In the Intel lineup, the Q9505S is a freak.  If I were to tell you that the performance of Intel's flagship 45nm quad-core processor (QX9650) would be delivered 2-years later at one third the cost and one half the power consumption you might conclude that Moore's Law had cycled again and the new processor benefited from a 32nm die shrink.  This is what makes the Q9505S such a strange creature - these benefits were surprisingly reaped all within the same 45nm tech node.  What gives?

For starters, it was released quite recently, two years after the first 45nm Core 2 quad-core processor, the QX9650, and consumes half the power at 65-watt TDP vs 130-watt TDP, and achieves approximately the same clock speed at 2.83ghz vs 3ghz.  Because the processor architecture is the same (Core 2 quad), there are only a few possible sources for the power efficiency gains.

One possibility is an improved layout, i.e. placement-and-routing of the transistors and wires that implement the Core 2 architecture.  It makes sense that only slight improvements would have been made here because Intel spends many man-hours hand-crafting the circuitry in the first place, and the Q9505S is seemingly not a high volume product that would merit follow-on hand-crafting.  Another source of efficiency gains is obtaining a sweet spot of 6MB for the L2 cache, which is 50% less than the 130 watt 12MB Core-2-quads, and 50% more than the 4MB Core-2-quads.

Perhaps most interesting is the implication that Q9505S is a beneficiary of improved fabrication technology within the 45nm tech node, long after the original 45nm debut.  Additional evidence for this is that mobile quad-core 2.53ghz parts (e.g. QX9300) with similar power efficiency to the Q9505S (45 watt TDP) were available a full year prior to the Q9505S, but cost much more (~$1k), indicating that the yield of processors with those specs was quite low.  Given the additional year for developing the 45nm process, the yield for such processor specs must have improved by a large margin to allow their release at much lower prices - these types of improvements usually come from Moore's Law die shrinks but in this case it was all within the 45nm process.

45nm also stands out as an unusual tech node, having been credited as the greatest advancement in semiconductors in 40 years by Gordon Moore himself.  This was due to overcoming difficulties in fabricating 45nm transistors by changing the elements used to makeup of the gate wires and insulation (so called "high-k metal gate").  This same technology is also being used in 32nm, begging the question as to whether 32nm will see similar delayed improvements.   This would be just what Intel needs in order to deliver 3ghz 8-core Sandy Bridge processors before Ivy bridge's 22nm fabrication technology is ready.

Friday, October 22, 2010

Supercomputing didn't happen

I was fortunate enough to attend Bill Gate's back yard barbecue twice.  In the summers of 2003 and 2004 I interned at Microsoft, definitely two of the best summers of my life, and each time some interns were treated to meeting Bill Gates for dinner on his home turf.  What fantastic evenings - I don't remember the food but the unlimited free beer and ice cream sandwiches were just awesome.  Even the bathroom was amazing, the paper towels used to dry your hands after washing were like real towels, super thick and yet super soft.

Bill would arrive fashionably late on his back lawn that touches Lake Washington, just when the sun was setting but so bright you had to almost close your eyes to squint hard enough to see when looking west.  A crowd of 50+ interns would immediately surround him at very close proximity and at that point he would answer questions for about an hour and a half and then security would usher us out and back home.  Nobody wanted to leave Bill, he really has an electric personality in personal settings.  I think he wanted to inspire us interns, and he did.

I have some experience getting to the front of crowds, having practiced getting to the stage at Tool and Rage Against the Machine concerts - and I was able to in this case too, but there were always certain interns with more hubris, who would ask questions quicker and louder than me.  I was able to interject a couple times, and one of my questions pertained to supercomputing and what he thought about it.

He said "Supercomputing didn't happen, it never happened, ask anybody.  The only company that even tried is right over there [points across Lake Washington, referring to Cray Inc.] and they have only barely survived." (not a word for word quote)

I like people who get to the point and don't give wishy washy answers (who doesn't?), and I liked his statements.  Bill was definitely right from certain perspectives, and in that historical context, but today the destiny of PCs and supercomputers are deeply intertwined.  In the modern context, most of supercomputing is the collection of networked personal computers, personal computers that he invented.  Many supercomputers are built around hardware acceleration on graphics cards that originated in, and can't run without, the PC; and a standard way to build a supercomputer is by extending PC clusters with accelerator cards like Nvidia's Fermi or IBM's CELL PCI Express cards.  The dependency between supercomputing and the PC is not a one-way street either.  PCs owe many of their features to supercomputers like multi-core processing, 64-bit addressing, and SIMD instructions.  Only a fraction of the performance of modern desktop PCs would be possible without parallel programming techniques previously used only for supercomputers, like MPI and OpenMP.

Besides the insight Bill bestowed upon the crowd of interns, he also gave us great stories that we can tell and retell, anytime there is an excuse, to anyone who will listen.. blog readers not exempted!  :-D

Thursday, October 21, 2010

Hey 32 nanometer, where's my 16 cores?

Four score and seven years ago... no wait... Four years and two tech nodes ago, Intel bestowed upon us the first quad-core x86 processor, the QX6800.   (Well, ok three-and-a-half years ago, but that doesn't really roll off the tongue the same does it?)  It was a beast at 2.93 ghz, capable of issuing four instructions per cycle.

Fast forward two Moore's Law cycles (4x increase in transistors, even faster than before) and we should have 16-cores of at least the same performance right?  Or maybe 4-cores that are four times as fast?  Or 8-cores that are twice as fast?  Hell we could even settle for four cores that issue a max 16 instructions per clock.  That is the life to which we have grown accustomed.

Well, it turns out we've been spoiled, the chickens have come home to roost, and Intel is only to blame if you think they should be able to tweak the laws of physics (well, maybe for getting our hopes up, but do you really want them not to be optimists?).  The best x86 processor today is a 6-core Gulftown (Westmere) with the same 4-issue rate, and a measly 13% increase in clock rate.  There are some feature improvements, like extra threads, better branch prediction, and an increased likelihood of actually issuing all four instructions, but dammit, I want my cores, clocks and issue rate ;-)

We were promised and actually got 8-cores at 45nm, called the Nehalem-EX, but that was way too hot to function at 3ghz (a speed originally introduced back in 2002) - so it was underclocked to 2.26ghz.  It also arrived _after_ Westmere and costs about 2x-3x as much :-(

None of this is Intel's fault.  They are hitting the memory wall, ILP wall, power wall, and all sorts of walls.  In the past, when Intel ran into walls, like having to shrink an insulating layer that was already 5 atoms thick, they steamrolled right over them, laughing all the way to the bank.  Here's hoping they bust out that steamroller again.

Wednesday, October 20, 2010

Hardware in politics

Barack Obama and I have at least one thing in common, we both like to read ForeignPolicy and Sports Illustrated.  I usually don't find much hardware news in my foreign policy morning brief (a blog title makes me feel real important), but today I find that China has stopped exporting rare earth elements to the U.S. which are required for the manufacture of all of our favorite technologies (cell phones, fiber optic Internet backbone, etc.), which until recently China has been providing to the entire world (95% global supply).

Undeterred, Intel has announced it's first 22nm fab will be located in the U.S., which will cost something like $8B, a surprising move in some respects since historically most of Intel's fabs are outside the U.S. in places like Israel, Malaysia, and Costa Rica.  The increasing cost of new fabs (yes really, $8B!) has necessarily created a consolidation in the semiconductor industry, with only one player (Intel) capable of production at 32nm (or better) for about 10 months, and previously competitive companies are coming together to prevent falling further behind.  In fact it is quite arguable that "half nodes" like 40nm and 28nm are an admission that the traditional nodes cannot be delivered in lock step with Intel, and the missing months are costly. All this leads to the conclusion that, with the possibility of escalating trade wars, a state-of-the-art domestic fab is of key strategic importance.

In response to China's increasing dependence on imported computers, the Chinese national processor "Godson" was developed, and can be fabricated by STMicroelectronics within China's borders.  With respect to placing a lower bound on the processor performance that can be achieved without imports, Godson could be considered a huge success and a security blanket of sorts.  Intellectual property issues did arise early in Godson's development due to using an instruction set based on MIPS but without the patented instructions.  Licensing agreements were eventually worked out with MIPS technologies (founded in the U.S.), which were arguably unnecessary but certainly put a stop to any ongoing controversy.

It will be interesting to find out how the world responds to China's reluctance to export rare earth elements, and where future fabs and processor architectures will emerge in the context of their increasing political importance...

Tuesday, October 19, 2010

2.7 Petaflops for $12.80

Until the HPC Cluster instance debuted back in July, there was a lot to complain about in terms of CPU performance per dollar in the Amazon EC2 cloud.  Single-threaded performance of their next-best instance, the so-called "High CPU" instance, is less than a tenth that of a modern desktop PC (2.5/8 = 0.3125 vs. 33.5/8 = 4.1875 EC2 compute units for a 2.93Ghz Nehalem).  Indeed it is well known that Amazon slices up their real cores into many virtual cores that include only a fraction of the computing resources.  This was the norm until the HPC Cluster instance, which is the first to provide a 1:1 real:virtual core ratio.

Without special arrangements only 8 HPC cluster instances can be recruited, at a cost of $1.60 each per hour, or $12.80 for all 8.  The theoretical max double-precision GFlops (an imperfect and often misleading metric that is OK with respect to how we use it here) is 93.76 GFlops/instance * 8 instances = 750 GFlops (although only half this rate was achieved in their poster benchmark, we will give the benefit of the doubt).  An hour's worth of processing (the smallest unit that can be purchased) delivers 750 * 3600 seconds = 2.7 Peta floating point operations.

Some applications are able to scale performance on N processors to be O(N), meaning linear scaling minus some overhead that does not increase out of proportion.  "Embarassingly parallel" algorithms are good examples of this, such as Monte Carlo algorithms and algorithms that process large amounts of data like Web Search etc.

Suppose a scalable algorithm that bottlenecks on the SIMD DP FPU takes an hour to complete a task on Amazon's 8 available HPC instances (any faster and performance per dollar is reduced due to the 1-hour minimum).  Ignoring initialization time (which we will discuss in a future posting, and which today is not charged to EC2 users) scalable algorithms can do massive amounts of work in almost no time by recruiting tons of hardware for very brief periods.  In this example, if 28,800 instances are available and there is no "1-hour minimum", the task finishes in about 1 second for the same price as the 1-hour scenario, utilizing 2.7 Petaflops (quadrillion floating point operations per second).  At the current cost per compute-second, err.. compute-hour, the total cost would be $12.80.

Conventional commodity-server based systems will probably never be capable of delivering this type of performance because of initialization time (currently 5-20 minutes on EC2, depending on OS), but it is easy to envision custom cloud architectures that would confer nearly instant execution to scalable algorithms.

Monday, October 18, 2010

CELL University Challenge

Ashok Chandrashekar is an amazing guy.  During his first few months of graduate school at Dartmouth he taught himself how to program the CELL processor, and even how to debug it, which was much harder.  Errors that show up only in hardware were particularly vicious, and the catch-all "bus error" actually gives no information about what went wrong.  Still, in about one month he wrote every line of code that went into our entry for the CELL University Challenge, which resulted in winning the grand prize (thanks also to Jay Moorkanikara, who originally had the idea to submit the entry).

For that month, working past midnight day after day with Ashok was one of the best experiences of my life, and our workarounds for the problems we encountered are why I think we won.  The biggest and most strategic workaround was discovered while we were trying to reduce our "bit-vector dot-product" (dot-product where all the elements are 0 or 1) to 3 cycles.  This was possible because the CELL processor ingeniously implemented the "pop-count" instruction which counts all of the 1's in a binary integer (e.g. popcount(1000100100001) = 4).  One of the claims-to-fame for architectures like Itanium was the hardware pop-count instruction, which required significant dedicated hardware in the architecture design.  Itanium and other architectures count the bits in the entire integer, but counting 32 or 64 bits requires a lot of logic to complete in a single clock cycle.  Someone at IBM had the notion to count bits in smaller fields, namely each 8-bit field, separately.  For large integers, it is much easier to count multiple 8-bit fields separately and store the totals in separate 8-bit regions, which allows the results of multiple pop-counts to be summed (on the CELL, summing 31 or fewer pop-counts has no overflow danger).

With the CELL pop-count instruction, it is possible to perform a 128-bit bit-vector dot-product in just 3 cycles: AND, popcount, ADD, and repeat along the entire vector's length.  Before the sums outgrow their 8-bit boundaries they must be aggregated, e.g. to 16-bits, but that is pretty simple to do.  And of course both input vectors must be loaded into registers as well, but those loads are hidden by the CELL's second execution port which can handle simultaneous loads/stores to/from the local store memory.

During the last stages of our implementation we encountered a throughput problem: much more than 3-cycles was required per 128-bits, and the reason for this was not obvious.  It turns out the CELL processor does not contain a bypass network in the traditional sense, meaning that values that exit the ALU for register writeback are not immediately available in the next cycle as inputs (a capability provided in most modern architectures by their bypass network - in fact the Pentium 4 had a half-cycle throughput and latency for simple instructions).  The CELL is designed this way because the bypass network is an expensive piece of hardware in terms of silicon area and power consumption, and as clock cycles scale (3.2 ghz in the CELL, which was a very high clock for 90nm technology) the turnaround time of the bypass network must decrease to achieve single-cycle latency (there is a similar requirement for branch prediction which we will cover in a future post).  Furthermore, the ability of typical modern processors to process out-of-order (OOO) allows other instructions to be scheduled whose inputs are in fact available, but the CELL uses an in-order design instead of OOO to reduce the silicon area and power requirements of the CPU (thereby increasing the number of cores that can fit in each processor, increasing overall throughput).

Instead of a full latency-hiding bypass network and OOO execution, IBM relies on the compiler to intelligently schedule instructions, but this didn't work in our case (something we are ironically thankful for, since it made our contest submission more impressive, hehe).  This may have been due to limitations in the compiler's scheduler, such as the size of the window in which rearrangements are looked for.  Our workaround was to unroll the inner loop and then syncopate the operations of several loop iterations like this:

AND A1, B1 -> C1
AND A2, B2 -> C2
AND A3, B3 -> C3
AND A4, B4 -> C4
AND A5, B5 -> C5
AND A6, B6 -> C6


POPCOUNT C1 -> D1
POPCOUNT C2 -> D2
POPCOUNT C3 -> D3
POPCOUNT C4 -> D4
POPCOUNT C5 -> D5
POPCOUNT C6 -> D6

ADD D1, E1 -> E1
ADD D2, E2 -> E2
ADD D3, E3 -> E3
ADD D4, E4 -> E4
ADD D5, E5 -> E5
ADD D6, E6 -> E6

The very large (though multi-cycle latency) register file in the CELL processor was able to simultaneously hold all of the temporary values without issue.  This bit-vector dot-product sequence works on vector chunks of 768 bits, which evenly divided our input vectors.  This scheduling allows the AND, POPCOUNT, and ADD instructions to have 6 cycles of latency headroom before their outputs are needed as inputs.  It also amortizes the cost of the looping branch over 18 instructions instead of just 3, further increasing throughput.

With the $10k in prize money our team went to Vegas and had a very crazy week that was eventually ripped off by Rockstar Games and incorporated into Grand Theft Auto 4.  :-D just kidding, I think I spent my portion paying off a credit card.  Oh well!

-Andrew

Friday, October 15, 2010

Hello World!

I was discussing some new developments in computer architecture with my friend (and CEO) Mac Dougherty one day when he suggested that I might start a blog on the subject.  This intrigued me because, since building my first computer (Dual Celeron 300a with modded slockets!  (the forum that started it all)) I have appreciated reading commentary on computer architecture.  I still get a rush when reading great articles by likes of Jon Stokes, David Kanter, and Michael Schuette, and I also love checking tech news sites like HardOCP etc., which are updated at a much higher frequency.  What was never satisfied for me was a need for computer architecture analysis and commentary updated on a more frequent basis than is possible for in-depth articles, and so I will endeavor to do something about it.