Tuesday, November 23, 2010

Intel releases x86-FPGA hybrid - app store to follow?

The guys over at Altera have gotta be busting out the champagne - they're officially in bed with Intel, and if history is any indication (Microsoft), that is a good place to be.  Intel is integrating an Altera FPGA with an Atom processor in the same package, and calling it the Atom 600C series.  To do this, the package's small metal box (that gets squished by the heat-sink-fan (HSF) unit) houses two chips, an Intel Atom processor and an Altera FPGA, and connects them with very small wires, presenting the appearance of a single physical chip.  Without cutting open the metal case it would be impossible to tell this apart from an FPGA and processor on the same chip die.  This is the same technique used by Intel to unexpectedly achieve quad-core processors before AMD (even though AMD's SRQ and Hypertransport fabric were demonstrated as superior for multiprocessing).

The 600C-series Atom processor is a single-core Tunnel Creek model, but it is no slouch.  It comes in at a maximum speed of 1.6ghz and supports hyperthreading, whereas the highest performing Atom processor is a 1.83ghz dual-core Pineview Atom processor, also hyperthreaded (this Atom hyperthreading provides a huge performance improvement, as we will show in a future post, with improvements on the order of 66% of a second core).

Despite marketing hype, this is not the first time a processor has been tightly integrated with an FPGA, with RISC-cores coming embedded in FPGAs since Xilinx released the Virtex-II Pro model with on-chip PowerPC cores back in 2002.  What is new is that the processor is x86, allowing execution of code that can't be recompiled (i.e. proprietary libraries) very close to the FPGA.  This closeness should not only increase bandwidth and decrease latency of processor-FPGA intercommunication, but it also definitely results in lower power.  For example, we have used an Altera Cyclone III 65nm FPGA at Cognitive Electronics using Altera's starter kit and the power consumption measured for the FPGA chip was approximately 200 milliwatts whereas the entire board's power consumption was over 3 watts.  Indeed it is hard to think of an add-on card that comes in at less than 2 watts.  So, instead of doubling the power consumption of the Atom processor (TDP 2.7 watts @ 600mhz) the FPGA will increase it by only about 10%.  Even though this improvement becomes less dramatic in the context of total system power, it is still quite significant.

Intel could have designed their own FPGA ASIC pretty easily - they are just millions of very small SRAMs connected together with programmable wiring.  The hardest part of creating a good FPGA product is designing the Electronics Design Automation (EDA) software, which takes in a hardware design written in a hardware description language (HDL) and processes it through many stages, from conversion to a gate-level netlist ("Synthesis"), to mapping the gates onto the resources of the FPGA ("Map"), to determining the location of those components on the FPGA and routing programmable wires to connect them up (Place-and-Route).  Although Intel uses EDA software to design its processor ASICs, after the synthesis stage the FPGA EDA software and ASIC EDA software are quite different.  Furthermore, Intel's internal EDA software might not actually be that good, because they are known to hand-craft most of the components of their processors (a labor-intensive "Full Custom" ASIC design process).

One of the nicest aspects of this new Intel system is its potential to serve as a standard platform.  Currently, the small amount of open cores available for FPGA don't install without recompilation etc. - so Intel's new Atom-FPGA platform may be the first where you can just download binaries of FPGA programs and run them.  Maybe this will be the start of the first FPGA App store?  I can see the commercials now: "Want to perform Fast Fourier Transforms?  There's an app for that."

Monday, November 22, 2010

Intel talks Kilocore processors, coherency wall

At least somebody took notes at the talk given by Intel (specifically Timothy Mattson) at the SC'10 conference this year, which speculates on what computer architecture will be required to make a thousand on-chip cores useful.  The message is that these systems are do-able, but that the current method of shared memory programming will not survive the transition.

Shared memory has been a problem in supercomputing since the beginning.  In the shared memory programming paradigm, all of the different processors and cores can see each other's data and do not need to send messages with complete data structures, but can instead send addresses to each other.  This is similar to emailing YouTube links instead of attaching entire mpeg files, which is of course much more practical for long HD videos that might not fit in the recipient's inbox.

The issue that arises is that the various processors in a supercomputer are different distances from the hardware that is actually holding the data in memory, and this leads to different amounts of delay and bandwidth pressure when accessing the data.  The term "Non-Uniform Memory Access" (NUMA) is applicable in this case, and in the shared memory model the programmer doesn't have a lot of control over which processors are near or far from the data, and therefore they have less control over the performance.  This is the price for creating and using the abstraction that all of the processors share memory when in reality they are all on a network with a variable number of hops between nodes and memory.

To date, the solution has been to lash many computers together in a network cluster, called distributed computing, and obliterate the illusion of shared memory completely, using a distributed programming paradigm where programmers send messages between the processors explicitly.  For example, "Send( Processor_1, "This is a message to processor 1")" is a type of command that programmers would use in this environment to communicate between processors.  This is the method used at Google exclusively before MapReduce, and is still used when MapReduce is not applicable.

Intel is now allowing its researchers to talk about how the concept of shared memory won't even extend within a single server node much longer (metaphorically, all emails will have to send entire videos, not YouTube links).  The issue is that even when all of the cores in a node reside on the same chip, they are different distances from each other, creating a NUMA effect, and the current method of hiding this effect with separate memory caches that talk to each other in order to present a single unified memory, called cache coherency, is not sustainable into the Kilocore realm.  It's not new knowledge that this model cannot be sustained in the future, but it is new that Intel is allowing its researchers to admit to the existence of the "coherency wall".  The statements are couched in the condition that the talk is discussing thousand-core systems, Kilocore processors, which are a long way off for Intel, who's current strategy is to build fewer fast processors rather than many simple processors.

An interesting subtext is that, not only is shared memory programming not viable in the Kilocore future, but that even within the alternative, message passing, Intel is predicting a "synchronous messages only" constraint that allows small on-core buffers to satisfactorily hold the communication data.  In synchronous message passing, the "send" command does not complete until the receiver performs a corresponding "recv" command, to clear the buffer.  The proposed RCCE protocol is a little excessively constrained in that it should be possible for the sender to proceed, but be limited to not sending an additional send command until the previous send has had a corresponding recv.  In the YouTube email example, this is akin to an email server only allowing new video attachments to be sent if the recipient has already deleted all previous emails from that sender.

I think the more strict synchrony is implemented for simplicity's sake, and an option to remove this restriction is available, though only in the "gory" implementation of the RCCE message passing library.  It should be noted that using a "gory" build mode option on an experimental library is, in a way, its own reward, since it gives experience points in the programming demigod class ;-D

Friday, November 19, 2010

China takes top supercomputer spot

It is now well known that China has taken the top supercomputer spot with a gigantic GPU cluster.  It's not surprising that GPUs are able to power the Linpack benchmark to such great heights.  Linpack benefits from a division of bigger tasks into smaller tasks known as "blocking".  This is not an embarassingly parallel breakdown, which means it doesn't necessarily represent the type of performance that could be expected on data-parallel benchmarks, because the network bandwidth, latency, on-server bandwidth, latency, and on-chip bandwidth, latency are all put to the test.  Where Nvidia can really deliver a high value is with their very high ~200 GBytes/sec of memory bandwidth per GPU whereas a commodity server node gets about 8-12 GBytes/sec.  With such high bandwidth it is possible for the GPUs to approach their maximum SIMD capabilities, and for double-precision SIMD floating point operations they just scream relative to commodity processors.

Linpack is very close to the type of application GPUs were originally invented for, and with Nvidia being slowly pushed out of the discrete graphics processor business by Intel and AMD integrating increasingly better graphics processors directly onto the CPU die, Nvidia had to branch out and add supercomputer-like capabilities to their graphics cards in order to try to fetch market in the HPC space.  They added double-precision, ECC, caching, and some other features.  I wouldn't have thought the result would be a migration of the top supercomputer to China, but that is indeed what has happened.

The Renaissance of IT

You have to appreciate Songnian Zhou, CEO of Platform Computing.  Here he shines in an interview that in fact merits a post just on his fantastic quotes:

"Cloud?  What is cloud?  I don't know.  Everybody's trying to get what cloud is.  Dark cloud, white cloud, big cloud, floating cloud... but... is that hype?  I believe there's some substance behind it.  I think this is probably the biggest invention in computing, and the business model of computing, in the last 30 years... one can say the cloud is the endpoint of distributed computing, where things are so distributed, so accessible through the Internet, that it represents the democratization, and popularization, and mainstreaming of HPC."

"So if you think about the word HPC, it gives the concept of very high performance, esoteric [applications] maybe in the cloud. In fact this methodology of using computing to study nature, to design products, to optimize businesses, to have entertaining Internet games, all these things are very broadly applicable."

"If you look at computing, at the IT industry, we are in adolescence at best, and we have been operating in the stone age... You run your own computers, you run your own servers, you run your own applications... you must be crazy.  You don't have to own the place... the point is that the IT industry is entering into a mature stage, it's a lot like the auto industry, just like the transportation industry in the sense that computing and applications will deliver services at low cost, more accessible, doing more interesting things... the range of applications because of the enablement of cloud computing, is going to grow ten... a hundred times.  And this is now the renaissance of IT."

Zhou has some interesting points, like that every mature industry eventually becomes service based, and there is other great stuff in that 10 minute interview, including a great metaphor likening the cloud to the airline industry (which happens to fly airplanes through clouds... nice).

Thursday, November 18, 2010

The age of "good enough" computing

In the last few years, consumers have been very clear in showing that they care about better CPU performance only in so much as it enables new features they care about.  Hennessy agrees (yes, the Hennessy), claiming that one of the big demands for CPU performance now comes from "the Googles of the world", meaning people want more products and features from cloud computing.  The other big demand for CPU performance comes from users that want a better user interface - and a failure to innovate on the user-interface side of things has resulted in a lack of demand for CPU performance.

AMD would agree, and they have taken on the corollary that people want the qualitatively best user experience available today at the lowest power (i.e. best battery life) and lowest cost possible.  Enter AMD's hybrid CPU/GPU processors, called accelerated processing units (APU) that deliver the best graphical experiences possible within 9w (Ontario) and 18w (Zacate) for netbooks and laptops respectively.  AMD has given up chasing Intel's single-threaded performance, which, as long as Moore's law continues and Intel maintains a process technology lead, will arguably never be beaten again.

"Good enough" can also be applied to operating systems, and the strong user base of Microsoft's 9-year-old OS is evidence that WindowsXP was indeed good enough.  I remember when Bill Gates told me and a small crowd of interns that Microsoft's biggest competitor is free software.  Not open source software or Linux, but free software of the type that users have already bought from Microsoft and is now free to operate for all time.  So originally it was not free, but it doesn't bring any money into Microsoft and Microsoft must compete against it to produce more sales.  That means Bill thought that Microsoft had been fighting "good enough" for quite some time, and foretold of the longevity of their greatest OS.

An interesting side-effect of x86 processor's progression toward multi-core is that it is now possible for an outsider to throw away multicore processing in favor of one really fast core.  It is interesting to think of what might have been, or what could still be, if the design decisions that resulted in the 3.8ghz Pentium 4 were put into action today, at 22nm, four process technologies beyond the highest clock speed processor ever released by Intel. Although it wouldn't deliver 16x the performance, it would still run at 7ghz+ with 24MB+ of cache and would beat today's 3.3ghz processors at single-threaded applications (i.e. almost every desktop application) by a good margin.  But that processor would consume 130 watts, which is not "good enough" for the mobile computing users are trending towards today.  Nor is it good enough for cloud computing, which requires thousands of cores to execute its parallel applications.

Progress in the world of caches

I was researching the Sun Niagara III, aka Spark T3, and came across an interesting aspect: there is no L3 cache.  Each core gets an L2 cache in the increasingly popular size of near 256KB (Tilera, Intel Nehalem, Sandybridge), but slightly larger at 384KB.  This is coupled with L1 caches that are surprisingly small relative to the number of threads (8) supported per core: 8KB L1 data cache and 16KB L1 instruction cache - that's 1KB and 2KB respectively per thread in the case of the memory not being usefully shared, but realistically all of the instruction cache can be shared between the 8 threads on each core.  Furthermore, by supporting 8 threads per core, and topping out at just 1.65ghz, it is realistic that the local L2 cache has latency under 8 cycles so that there is close to no penalty hitting the L2 cache for data fairly frequently.  This suggests the L1 data cache is just there to reduce the number of accesses to L2, freeing up its bandwidth and reducing power by replacing higher-power L2 accesses with lower power L1 accesses.

Although they both lack L3 and use distributed and similarly sized L2 caches, there are some interesting architectural divergences between the Niagara 3 and Tilera's to-be-released-some-time-in-2011 Gx-100.  Tilera doesn't multithread, so their in-order cores will take a performance hit whenever they hit memory, even in the L1 cache, which suggests that programmers will need to use asynchronous memory transfers or the Tilera tools somehow extract these function calls automatically (with the same caveat all such automated program analysis tools have in that they sometimes work, and sometimes don't).  In contrast, the Niagara 3 hardware-multithreading is naturally tolerant to the latency of accessing memories belonging to other cores on the same chip.

It really is an open question as to whether tiled routing like Tilera's will take off, with Intel's cloud research processor having also used the method, as did their Terascale research chip.  In contrast, the mainstream Sandy Bridge and CELL processors use ring buses, making them a relatively proven architecture, though without modification the cross-chip latency will scale linearly rather than with the square root of the number of cores.  This linear  latency scaling has less impact right now since, in the existing ring bus examples, the caches themselves have latency that is similar or more than the latency resulting from core-to-core data passing.

AMD and Intel continue forward with their large L3 caches, with Intel having transitioned to a ring-bus style L3 from Nehalem to Sandy Bridge, which has reportedly had a favorable impact on L3 latency (though that seems somewhat counterintuitive).  The IBM BlueGene/P uses cross-bar access to an 8MB L3 cache with relatively high latency at 35 cycles (that is a really great review of the architecture, btw), but this is not unusual since cross-bar is a standard method for 4-cores or less.

So the jury is not just out on whether giant high latency L3 caches will continue to prevail, but also whether the largest on-chip cache will utilize a ring bus, tiled mesh, hierarchical, or other topology as chips continue to progress to ever larger core counts.

Tuesday, November 16, 2010

Technicalities of "x" and "%"

When something is "2x faster, does that mean it is 200% faster?  I have seen "x" and "%" sometimes used in such a way that the answer would be yes, and sometimes no.  My impression from working in the field of computer architecture and hardware acceleration is that there is a general rule that is followed, so since the terminology is somewhat ambiguous Mac and I formalized how we use them some time back.

1) We have formal understandings of what it means to be Y% faster, it means the new performance is 100% + Y%. Slower works in the same way but with subtraction, so Y% slower means the new performance is 100% - Y%.  Therefore if A is 50% faster than B, B is 33% slower than A.

2) Higher performance x's work similarly but without adding the 100%.  So 3x faster means the final performance is 300% of the original.  Slower performance x's work pretty different, by turning the x into a division sign "/".  5x slower means the new performance is 1/5, or 20% the original performance.  Therefore if A is 30x faster than B, B is 30x slower than A.

Here's an example of Mozilla using the same definition for higher performance x's.  The reference is regarding the performance of Firefox's new beta - one nice thing is that there is an instance where they round up from 2.94x to 3x.  They also round from 3.49x down to 3x.  In these cases they are rounding to one significant digit.  At Cognitive we typically use two or three significant digits and round down, or whichever direction is a more conservative estimate for Cognitive's performance.

Wednesday, November 10, 2010

Where MapReduce fault tolerance comes from

Google's MapReduce is a great programming paradigm.  It takes data parallelism for granted, runs on however many processors are available, and keeps running even if some of the computers crash, get unplugged, or catch fire.

MapReduce is able to tolerate faults for an interesting reason.  During the Map phase, the Master Controller (not sure if that's the official label, but if not this is better anyway :) assigns data chunks to each map worker node, and the death of a map worker just means that the Master Controller must reassign those Map inputs to a different node.

Here's the interesting part: Each mapper caches the outputs produced during the map phase and separates them according to which reducer node will receive them as inputs.  If a reducer node crashes, gets unplugged, or gets too cold, the Master Controller tells a different reducer node to use the mapper caches to finish up the missing processing.

It works this way because clever MapReduce designers realized they could cache the map results to *memory* until a sufficiently large chunk could be sent to disk as an efficient sequential write.  The hard drive caching never became a performance bottleneck because the sequential writes and sequential reads are faster than the Gigabit Ethernet connecting them to the network.

This should have caught up to Google by now, because 10GigE is getting cheap and hard drives didn't get any faster.  But as history has shown, it's not a good idea to bet against Google.  SSD arrived just in time to save the day, and Google will be able to transition just fine to 10GigE by coupling a few of them with each server (SATA 3 reaches 6 gbps, which can be saturated by 3-4 SSD drives or so).  Google's main concern about servers is their power consumption, and those SSDs sip relative to the other big power drawing PCcomponents.  For the IO bound MapReduce tasks, 10GigE + SSD means life is good.

Tuesday, November 9, 2010

7Gbps wireless a step backwards?

A lot of noise is being made over the new 7Gbps WiGig standard, which, at first glance, is about a thousand times faster than my laptop gets at home.  This has gotta be good, right?

Well, as is typical of anything having to do with wireless, the real story is a lot more complicated than just peak bandwidth.  The range is just terrible compared to even 802.11B, with full performance only achieved within a 15-foot radius with line-of-sight - and don't think about transmitting through walls because the bandwidth  will be awful if the connection doesn't completely drop.  This is played up as a good thing, because it prevents neighbors from stepping on each other's bandwidth, but that also makes it unsuitable for most in-home purposes as well.

The practical uses of such bandwidth are also hard to come by.  Most hard drives won't read much faster than 60GB/sec in sequential mode, which is 480mbps, or about 1/15th the bandwidth for which WiGig has been designed.   Sequential read speeds for blazing fast SSD hard drives are only 2gbps, still not pushing the limit.  A favorite scenario that is often cited is the ability to transfer a Blu-ray in under a minute - but, ignoring the fact that movies would have to be decrypted, the fastest Blu-ray drives are 12x, where 1x is equal to 4.5MB/sec, 12x is 54MB/sec, or 432mbps; i.e. also less than 1/15th the optimial WiGig speed.  And let's not get started on the abysmal "broadband" speeds in the U.S. for which existing WiFi is already overpowered - where I pay $60/month for 20mbps and in reality get 5mbps (a data rate that transfers just fine on 7-year-old 802.11B cards) - or about 1,000x slower than WiGig.

I'm not saying WiGig won't be great for connecting external screens (in the same room) wirelessly, because it will be, but that is a fairly niche purpose.  In fact, people don't plug in devices to screens all that often (a friend bringing over the new PlayStation would be an exception, but that only happens once every ten years, right?), so WiGig isn't really replacing wires, it's just making things wireless that most of us rarely do to begin with.

Monday, November 8, 2010

Solid State Drives mature, leave home for new housing

Most of us buying laptops with SSD hard drives love them for their speed and ruggedness.  They're expensive, but that's the price we pay for the latest and greatest, right?  It may be surprising, then, that your new bleeding edge Flash-based hard drive is housed in the standard 2.5" laptop form factor (69.85 mm × 7–15 mm × 100 mm) originally designed for spinning disks in... 1988.  Thats pretty old, and a company has finally taken a stand, said enough is enough, and brought SSD to a form factor worthy of the new millennium.

Toshiba announced from Tokyo today a new form factor that is not just smaller than the current standard, but literally 1/10th the size (24mm x 2.2-3.7mm x 108.9mm).  Called the Blade X-gale, it's basically the same length, one third the width and about one third the height.  In fact it's almost identical to the DIMM form factor used for memory, which we discussed yesterday - (a size that seems to have been tried back in August but didn't catch on).

One might expect that the capacity-to-weight ratio, or gigabytes per gram, to not have improved much over 2.5" magnetic disk drives because, while heavier than Flash,  they also have greater capacity.  The current biggest laptop drive is the Seagate Momentus 640GB, which weighs in at 120 grams, achieving 5.3 GB/gram.  The highest capacity 3.5" drive is currently 3TB drive from Western Digital's Caviar Green series, which weighs 730 grams, yielding just 4.1GB/gram. This is where it gets interesting, as the new 256GB Blade X-gale from Toshiba weights just 13.2 grams, achieving 19.4 GB/gram - besting the legacy form factors by 4x-5x.

With such big improvements in size and weight, Toshiba's new product line is a good reminder that many components of the Personal Computer are mired in their own legacy, just waiting to be updated.  PC BIOS is another example of this, which first debuted back in 1981, and industry has been so slow to move on that the newest hard drives are no longer fully functional.  Indeed, the three terabyte drive discussed above can't serve as a boot drive, and is therefore limited to secondary data storage roles until motherboard manufacturers implement the newer Universal Enhanced Firmware Interface (UEFI) more broadly.

This leads to the question of what other parts of the PC may be stuck in the past, with order-of-magnitude improvements still waiting to be unleashed..

Friday, November 5, 2010

Flash vs DRAM

DRAM and Flash store bits in fundamentally the same way: charge (on-bit) or a lack of charge (off-bit) is stored in a capacitor (a very small battery) which will be tested later to detect whether the charge is present.  MLC Flash uses the same technique but varies the level of charge stored in the capacitor in order to get more than 1-bit per capacitor.

Flash chips hold a lot more data than DRAM.  This sounds intuitive when you remember that Flash goes in hard drives, which are much bigger than the memory (DRAM) of a typical computer.  But the difference is more striking when the two chips are placed side-by-side, because the Flash chip is basically the same size as the DRAM chip (here's a neat picture showing how two Flash chips fit in one SD card).  Case in point, I was browsing Flash integrated circuits (ICs) and stumbled upon a monster at Micron.  The Micron catalog shows the MT29F256G08CUCBBH3-12 coming in at 32 gigabytes.  For a frame of reference, the best DRAM chips (Micron catalog) hold 512 megabytes.  That's a difference of 64x!

The transistors are similar sizes for the latest DRAM and Flash, so the capacity difference is achieved by putting multiple flash layers on top of each other (this is on its way to achieving 128GB Flash chips).  3D chip stacking has a traditional problem of overheating because each layer consumes power and the layers insulate themselves, causing internal temperature to escalate.  Stacking Flash solves this problem because capacitors don't consume power, and Flash doesn't lose its charge for years, so most of the layers are not in use at a given moment.  Thus, Flash chips are utilizing their dark silicon to achieve extreme densities.

On the other hand, Flash is slow and only supports low data transfer speeds.  For the example above, the Flash chip runs at 166mhz at one byte per cycle whereas DDR3 DRAM achieves an effective 1600mhz (800mhz DDR) at one byte per cycle.  So DRAM chips allow about 10x the transfer rates of Flash chips.

Lastly, there is not much benefit in fetching data from Flash in chunks much smaller than 4KB, meaning about 4,000 cycles over an 8-bit bus.  This is why SSD speed is sometimes measured in IOPs, where a 166mhz Flash operating on 4KB blocks with an 8-bit interconnect provides about 40,000 IOPs (with a really good controller).   In contrast, DDR does generally have a minimum of 8 transfers per access, which, over an 8-bit connection, is 8 bytes per access.  At an effective 1600mhz this is 200,000,000 IOPs, or about 5000x as many as the Flash chip.  Thus, DRAM allows the memory bandwidth to be dedicated to many accesses of smaller chunks of data instead of only very large chunks like Flash.

The differences between DRAM and Flash are indeed striking, with Flash providing about two orders of magnitude greater density and DRAM providing about 4 orders of magnitude more operations per second.

Edit: You can see a follow-up post here.

Thursday, November 4, 2010

The Nano Duo

For 100 points, what company has been the second largest manufacturer of motherboard chipsets and recently settled with Intel to extend its x86 license to 2018? Although AMD probably comes to mind first, the answer is VIA technologies, the most scrappy company to ever make computer chips.  Here's a great interview of VIA execs over at bit-tech that helps explain why you haven't heard from them in a while.

Around the time Intel released the Atom, VIA introduced the Nano processor (codenamed Isaiah, here's the whitepaper), which achieved double the performance of Intel's Atom while consuming about the same amount of power.  The Nano even consumed less power at idle, a typical state for the netbooks it was being designed to drive.  More benchmarks from 2009 also showed Intel being handed its own posterior.

But nobody used it.  Well, IBM and Samsung put it in netbooks, Dell put it in mini servers for physicalization (the first gigabit is always free), and maybe there were some other devices (I seem to remember trying to put it into a robot at one point).  Intel continued advancing the Atom, multiplying its cores, adding out-of-order execution, and consolidating the north and south bridge so that, as a platform, Atom consumed less power than before.  This has gone far enough that you can't find Nano on newegg anymore (it should be on the shelf next to the ARM netbook, shouldn't it?).

Well things may turn around as the Nano has gotten a facelift, jumped one-and-a-half Moore's Law cycles, gone dual core, and integrated a good graphics processor.  Perhaps unsurprisingly, it is back to handing out posteriors, playing PC desktop games that were previously impossible on a netbook, and coming very close to the performance of a Core 2 Duo.  Hey, I remember that processor - it's the one powering this laptop as I type.  Hmm...

Maybe VIA can do something with their new demon this time.

Wednesday, November 3, 2010

Golden ages of technology never to return. Part2: WindowsXP

I first became a fanboi of Windows for the games.  Doom and Quake changed the way I thought about computers.  They made me want to learn how to program them and, when I could first afford my own, to know why one computer was better than another.  This latter motivation coincided with the original dual celeron hack, which raised issues that are still at the forefront of computing (what's the difference between on-chip and off-chip cache?  Why do some processors overclock more than others?  How does higher voltage increase clock headroom?  Why was overclocking the bus important?  How does a graphics card offload CPU work?).

Back on topic, Windows 95 had crashing problems.  Windows 98 had less crashing problems but only supported single processors.  Windows NT crashed even less and allowed dual processors, but had software and driver incompatibilities.  WindowsXP was the first OS that supported multiple processors, crashed very little, and was compatible with all first-release computer games.  Another great feature is that Linux and WindowsXP could dual-boot with a little care, and at the time it was fun (for me at least) to learn what ruined a dual-boot installation and how it could be done properly.  Those were the first reasons I latched onto WindowsXP.

Then something expected happened :), Mac market share continued to dwindle, reaching an all-time low around 2002, coinciding with the time most kinks were ironed out in WindowsXP with service pack 1.  For example, market share at universities like Cornell (a traditional haven for mac fans) had fallen from 41% in 1994 to a sustained period of 5% from 2000 to 2002. This meant that every piece of software released in 2002 came out for WindowsXP (please post exceptions in the comments, as well as whether those companies are still in business).  Put another way, there was no software released in 2002 that you couldn't run on your WindowsXP computer.

This period also coincided with an all-time peak in Internet Explorer adoption rate and the release of Internet Explorer 6.  This meant that there were no browser incompatibilities for WindowsXP users in 2002 - everything worked with IE6 or died.  In addition, any hardware that came out got very cheap very fast as the hardware manufacturers all competed on basically a single platform (elongating the lull in Mac usage as the hardware benefited less from economies of scale).  Finally, all this software automatically got twice as fast (as it had for the previous ~20 years) as clock speeds and Instruction-Level-Parallelism continued to scale without the need for dual cores or multithreaded programming.

During its golden age, WindowsXP created the most compatible computers of all time.  This period was eventually followed by an increase in Mac usage (a healthy thing from many perspectives) which is now between 50% and 70% for incoming college students today.  In conjunction, there was an increase in browser diversity, and an operating system from Microsoft that intentionally irritated users (see that dominance inspires hubris, leading to bad products is a consistent theme for a golden age).  Software also stopped getting 2x faster automatically with Moore's Law, with subsequent improvements requiring downloads and reinstallation .

It is sad that this compatibility came at the expense of Apple etc., and some will see it as a dark ages of sorts, but in terms of compatibility (both applications and web surfing) it is hard to argue that computers were ever more compatible before, or will ever be more compatible than they were during the golden age of WindowsXP.

Tuesday, November 2, 2010

Intel hedging against Moore's Law?

Intel's delivery of the first 16 Moore's law cycles are widely admired across the industry as being on-time and on-budget.  This unique reputation, 40 years in the making, strikes a fear of falling behind into the hearts of competing manufacturers.  Photons not traveling straight enough?  No problem, just immerse the whole process in liquid to straighten that up.  Need the performance of a 4-atom-thick insulating layer without the defect rate?  No problem, just change the way transistors have been made since CMOS was invented.  Up to now, Intel has achieved these milestones without any outside help, and has lately accrued a substantial lead in process technology.  Competitors are running to each other in the hopes of not falling further behind.

That's what makes this story about Intel partnering with Toshiba and Samsung for the next two Moore's Law cycles so surprising.  Who would have thought Intel could use some help pushing Moore's Law along?

Now, it is possible to downplay this, I mean it is only for flash memory technology - and Intel is not confirming the story either, so it may not happen at all.  But let's suppose it is and think through this for a second.  Many steps in the process to make flash memory are also used to make microprocessors (e.g. both require fabricating a type of transistor), so Toshiba and Samsung should get a serious leg up on their way to producing non-flash devices at 10nm as well.  This could potentially concede part of Intel's lead in fabrication technology - that's a big downside.  Why risk it?

One answer is that Intel foresees real struggles and the potential for long delays before achieving 10 nanometer parts. By partnering, Intel trades the increased risk of losing its technology lead for a decreased risk of reaching 10 nanometers slowly or not at all.  That is some serious doubt coming out of the company that should be most confident about its future.

Let's hope the story is wrong and that Intel is indeed as confident as can be about their timely achievement of 10 nanometers and beyond.

Monday, November 1, 2010

Intel follows AMD's lead, spins off foundry business

</sensational headline> Well, the spinoff is yet to be announced :-P, but Intel is indeed opening their fab to an outsider for the first time.  Achronix, a relatively new FPGA company (background), must have seriously impressed some execs to win the keys to Intel city, and their 22nm fabrication facilities coming online next year .  With Intel now maintaining roughly a half-node advantage over all other fabs, Achronix will be releasing production 22nm FPGAs by the time Xilinx and Altera are at full production with their 28nm FPGAs from TSMC.  When fabricating the same design, 22nm will hold about 60% more than 28nm, and Intel's 22nm will probably consume about 30%-50% less power for the same design at the same speed.

The current method of circumventing the power wall, ILP wall, and memory wall by adding more cores to each processor die,may not maintain the Moore's Law rate of 2x per cycle.  A nice aspect of FPGAs is that they still achieve 2x per tech node, or slightly better.  Designs on even the biggest and fastest FPGAs are still not near the ~150-watt power wall, and off-chip communication bandwidth continues to roughly double as the onboard high-speed transceivers are still getting faster and more numerous.  Part of the reason for this continued ability to scale is that FPGAs are programmed in Hardware Description Languages (HDLs) in which huge amounts of parallelism must be declared directly by the programmer (hard problems like timing closure and clock boundary crossing also fall on their shoulders).

Thus, with Intel's 22nm tech node not expected to deliver a big improvement in serial processing speed, and only (optimistically) doubling x86 core count to 12 (which AMD's 32nm Magny Cours already reaches at 2.2ghz), the greatest capabilities reaped from the timely arrival of Intel's 22nm tech node (potentially 1-2 years ahead of Global Foundries, TSMC, UMC, etc.) may come in the form of FPGAs with the highest speed and capacity on the planet (by a large margin).

More importantly, these new best-on-the-planet FPGAs may be provided at prices like $400.  That is  25x less than today!  Even if you can afford ~$10k to get the best right now, you still need like 3 -6 months lead time.  Taking all this into account, what's most surprising is that stock prices didn't plummet for companies that will soon be competing on unlevel ground (Xilinx up 0.1% and Altera down 0.1% on the day).