Monday, November 7, 2011

128-bit SIMD is dead. Long live 256-bit SIMD.

Linus recently made known his preference for bigger cores, and the death of 128-bit SIMD is definitely a step in that direction.  As announced back in 2008, Intel now considers 256-bit SIMD to be standard, and AMD has followed with their all-purpose Bulldozer line of processor models.

History: The top-500 supercomputer list was the first consistently updated ranking of supercomputers, and because they were first they are considered the standard.  Although high performance computing can now be considered as a particular configuration of cloud computing resources, it was originally dedicated almost exclusively to scientific pursuits.  As is suggested by scientific notation, it is necessary for science to represent numbers where the placement of the decimal point is flexible, and controllable through the use of the secondary "exponent" number.  Thus the number, its exponent, and the sign (positive or negative) combine to represent numbers used in science using the finite resources (bits) available in computers.  Although representations with 3 (half precision, 16-bit) and 6 (32-bit single precision) significant digits are available, science has almost exclusively used the double precision representation which provides a whopping 15 digits of accuracy.  The top-500 list therefore decided to rank supercomputers by their double-precision performance (Floating point operations per second, or FLOPs), which is measured using the Linpack benchmark, whose Double-precision General Matrix Multiply library, or DGEMM, causes a performance bottleneck dependent upon carrying out the inner-product of matrix fragments.  (Interestingly, it can in fact be more efficient to carry out an outer-product function within the registers in order to reduce the number of memory reads/writes required to feed the register file a sufficient number of operands.. but that is for another post).

Using what are now old-style 128-bit SIMD systems it is possible to hold two 64-bit double precision numbers in each register.  The typical operation that is performed on these registers is A = A + B*C (the so-called multiply accumulate, or MAC, although accumulation suggests that the operands being summed are of the same sign, which significantly simplifies the design of the adder), and the addition and multiplication are each considered a floating point operation, therefore the performance of a MAC on 128-bit SIMD registers results in four double-precision floating point operations.  The register file is designed to be sufficiently large, and the cache bandwidth and secondary SIMD memory operation issue port sufficient to allow each core to initiate a SIMD MAC each cycle, storing the results in registers that will not be read for several cycles, in order to allow the multi-cycle latency of the MAC unit time to fully generate the MAC results.  This results in four FLOPs per core per cycle (each core typically has one SIMD register file and ALU).

With 256-bit SIMD a total of four 64-bit operands can be held in each register, allowing the initiation of a total of eight FLOPs per core per cycle, thereby doubling potential performance.  This new wider SIMD unit became standard with the addition of Intel's AVX instruction to the x86 standard.  Doing a little math, we can see that Fujitsu's announcement of a 23.2 Petaflop system using 16-core processors at 1.848ghz delivering 236.5 GFlops each is using a 256-bit SIMD unit.

The extension of general purpose processors to include 256-bit SIMD units as standard narrows the gap between GPU and CPU at SIMD operations.  China's fastest supercomputer, which previously held the world title, is a heterogeneous supercomputer which uses GPUs with 512-bit wide SIMD.  Intel's Many-integrated-core architecture also touts 512-bit wide SIMD.

At just 2x, the difference between general purpose SIMD units and vector processors (GPUs) has never been so small.  The main difference between these architectures is no longer the SIMD units, but the performance of the cache and the cache coherency simulating a shared memory between the cores.  Intel continues to attempt to push the coherency boulder up the hill, and if third tries are truly a charm, then the cache coherent 512-bit SIMD cores of the Larrabee 3 hybrid, aka Knight's Corner, will be something to see.

Saturday, November 5, 2011

Linus on big vs small cores

There's a discussion over at involving Linus Torvalds, David Kanter, and Michael Schuette where they argue the merits of big cores vs small cores.  If Carmack, Sharky, Anand, and Stokes were to get in on it then it would be a most epic cpu discussion (hint hint).

Thursday, November 3, 2011

The MOST EFFICIENT pseudo random number generator

I exumed Marsaglia's Complementary Multiply with Carry (CMWC) algorithm years ago in pursuit of implementing a really efficient random number generator on a 32-bit system that has a full 64-bit result from its 32x32 multiplier.  It so happens that in this case CMWC is really the most efficient algorithm that can  be thought up, and about a million times more efficient than Mersenne Twister, while Marsaglia has shown in multiple articles that its randomness appears to be as good as any existing psuedo-random number generator.

It was not easy to find all of the necessary information to implement it - I had to follow several threads from old newsgroups, and find copies of the referenced papers in particular in order to find a suitable value for "a", which is one of the factors in the repetitive multiplication.  I implemented it using four 32-bit seeds, which creates a very low likelihood of repetition (before the development of our Sun into a Red Giant).  I thought I was so clever to have unearthed such a great algorithm, so well suited to the hardware I was programming.

Well it turns out that Wikipedia now includes everything you need to know about CMWC.  They include an example C implementation with an apparently proper sample "a" value, and even use a default implementation of four 32-bit seeds.  This must be how it feels to finish your dissertation's bibliography just before Google Scholar comes out and electronic copies of journal articles are ubiquitous.  That Wikipedia page would have saved me A LOT of time...

Anyway, for everyone trying to generate random numbers quickly, check out CMWC - from my experience it is faster than anything else out there, and plenty random.

Edit: The Wikipedia implementation is for 4096 seeds of 32-bits, so not very compact relative to the four seed version - although changing to four seeds is not difficult, generating an "a" value that works for four seeds still requires a little bit of research.

Friday, October 21, 2011

Mac ARM'ed to the teeth?

Rumors of a Mac OS X migration to ARM processors are flowing from multiple reliable sources.  We hear from Jon Stokes' blog (now on Wired's Cloudline) of some evidence that Mac OS X was running on Intel hardware from the beginning, long before it was determined that Mac would switch to x86 (thereby removing most backward compatibility).  It is now believed that there is a similar port being maintained again, but this time for ARM processors.

That Cortex A15 review indicates that performance will be on par with a 2Ghz Core 2 Duo (3-issue superscalar 15-stage pipeline).  Although Intel achieved this performance over 5 years ago, the chip is admirably about half as fast as the very fastest processors at serial processing (typical of laptop processors) which comprise the vast majority of applications.  At 32nm the ARM processor could be expected to do so at around 5 watts per core at peak, opening the way for speculation of its debut in the Mac Book Air.

Charlie at SemiAccurate seems to think it a certainty, though betting on a successor to the A15 somewhere around 2015 (this was also speculated about a year ago as well).

The processor wars are heating up thanks to the increased importance of power efficiency.  If things continue to go the way they are, I wonder if the DoJ will end up allowing ARM to acquire AMD?

Friday, October 14, 2011

I-Opener resurrected

This post will include a lot of "I", and generally bad writing, because it is being written from an excited state.  Forgive ok?

I spent a good portion of the last week cleaning my office area.  I'm very happy with the results and felt like rewarding myself.. but how?  It's not an easy choice because I already had ice cream and pizza today, enjoyed the pool, and the budget doesn't allow for extravagance.

Oh I know, pry open the old I-Opener and see if we can get it running again.  I had spotted its various parts during the move to Palo Alto this summer, and clung to it with ferocity when my well meaning parents tried to help us clean up after the move and wanted to dump it.

There are many issues with the I-Opener, too many to list today, but the end result of rewarding myself with unrestricted hacking is that the I-Opener is playing Cocteau Twins as I speak, running Windows 98 SE.  The I-Opener hasn't made a peep in about 9 years, after I attempted to improve the sound but accidentally desoldered the sound chip (I thought it was a misc unused chip and needed the space for the other upgrades it holds within).  Enter the JLAB USB speaker, which needs no driver disks, just plugs in and starts making sound.

Can you say AWESOME!?

I also had to remove a nice 9mm 3GB laptop hard drive from within it and replace it with a 13mm 32GB I had loaded previously but had removed to use as a backup drive during the final days of my dissertation.  The 13mm doesn't quite fit as nicely - the additional pressure on the chassis is noticeable, which mattered back when I was being careful but now that the hack is just for lulz who cares?

Oh how I love my glorious I-Opener.  I believe it is uniquely configured and there is none like it in the world.  It has a 4x CD-Rom inserted inside the case, which I have not heard or seen in any other (pictures with working CD-Roms on the net have them dangling outside).  Many internal components had to be moved or removed in order to make the space.  The RAM is upgraded from 32MB to the maximal 128MB, and the slow slow 180mhz WinChip is upgraded to a 300mhz K6-2 (not an easy hack!), which benches about 3x faster.  The massive passive (band name?) heat sink had to be removed and replaced with a lasagna fan cooler, and a giant resistor that was inserted near the RAM had to be moved to be adjacent to the lasagna fan to prevent overheating and crashes.  The onboard storage is 32GB, upgraded from the 16MB flash chip.  The original keyboard was hacked to replace the joystick mouse with a real mouse that connects by wire to the keyboard, and uses the single keyboard plug to supply both mouse and keyboard PS2 connections.

I recall flashing the bios by hacking the original I-Opener software to dial into a Windows PC I had set up on a second phone line as a dial-up server.

It is amazing that a company could get funding to sell $300 - $400 worth of hardware for $99, arguing to make up the difference in dial-up fees.  With hacked I-Openers there were no dial-up fees, and Netpliance popped like so many bubbles :-(

Sad?  Yes.  Hell of a hack?  Damn straight.

Tuesday, October 11, 2011

Newegg starts "First from Asia" program

If you've ever looked to build something new from commodity parts you've discovered  Alibaba has a million different products but the prices are usually not listed, in fact you're lucky to get contact info for the asian  factories that create the products, and the minimum orders are usually thousands of units.

Enter newegg whose new First from Asia program means fast shipping products straight from Asia in single unit quantities.  Oh, and instead of taking a leap of faith, you get real honest to goodness user reviews to peruse.  Giving it a quick once over indicates that the usb banana is a hot item (edit: now sold out!).

I have previously looked into buying the parts for a mobile robot from Asia, and in fact made some purchases with good results in quality, price, and timeliness.  It will be interesting to see how newegg's inventory fills out, and whether its legendary service and speediness extend one continent to the left.

Monday, October 10, 2011

Good enough computing comes to SyFy

I started watching Warehouse 13 on Netflix this weekend and I must say I am impressed.  The special effects are not the best I've seen... nor are the actors and acting.  With regular levels of suspension of disbelief, however, the show is fun, interesting, and most of all has a conspiracy theorist vibe that is quirky and avoids taking itself too seriously (see the strange neutralizing "goo", that, when asked "what's it do?", it is responded "I don't know", heh).

If you've watched any of the too-many-to-count original SyFy movies you will see that they are so cheesy they are hard to watch without crackers (please point me to the exceptions and I will happily ammend this post to include them).  Usually there is some monster crocodile, dinosaur, or swamp monster that is killing people, usually beautiful, which humans have brought upon themselves through meddling with the environment, science, or other.  The stories are not intoxicating but I will admit they are fun to watch while intoxicated.

Through much trial and error, SyFy has discovered a few actors that are easy on the eyes and sufficiently believable.  SyFy has then spread these actors across multiple series, thereby scaling their value within the SyFy franchise.  For people that love a genre that is large-but-still-niche, greatness is not really expected (Firefly was an indulgent luxury, not a need, for SciFi lovers).  But it can't be a joke - i.e. laughter from wives or husbands that are not so in love with the genre still stings - Science Fiction junkies have absorbed enough ridicule to know how to avoid it.

The SyFy channel is now benefiting from the effect of "Good Enough Computing" - that is that regular amounts of computing resources make special effects that are "good enough" to be believed (under regular levels of suspension of disbelief :).  Netflix seems perfectly willing to pay to stream shows from SyFy, and if these shows are good enough to watch, then Netflix gets ever closer to having enough content that is good enough to disconnect cable (thereby creating a high value proposition through consumer savings).

It used to be that SyFy's low level of resources were only enough to create lame movies, but through acquisition of actors and the improvement of technology they are leaving the lameness behind.

Friday, October 7, 2011

Zet soft core running Windows 3.0

Sending a shout out to the Zet team.  Congrats on supporting DE0 and DE2 Altera FPGAs.  With the DE0 Nano the price of running a soft x86 system in FPGA is reduced to just $59!  That brings the cost of entry from the funded research realm well into the hobbyist and enthusiast user space.  Well done!

The capabilities of FPGAs are indeed just now beginning to overcome the threshold of x86 processor complexity.  Exciting things to come!

Thursday, October 6, 2011

Fare well Steve Jobs, 1955 - 2011

He stayed hungry, he stayed foolish.  From "Put on your Sunday clothes":

Out there
There's a world outside of Yonkers
Way out there beyond this hick town, Barnaby
There's a slick town, Barnaby
Out there
Full of shine and full of sparkle
Close your eyes and see it glisten, Barnaby
Listen, Barnaby...
Put on your Sunday clothes,
There's lots of world out there

This song was featured in the movie Wall-E, by Pixar, one of the multiple companies Steve founded that expanded the territory of imagination.

Wednesday, October 5, 2011

What are computers for?

When my father brought home our first computer, the Atari ST, my mind ran through myriad things I imagined it could do.  I understood that it was different from a game system, even though it could play games - it was supposed to be able to do things that a video game system couldn’t do.  The next question was “What is this for”?  I learned to program BASIC on it, drew dinosaurs with the mouse, and typed homework, but still felt I hadn’t figured it out.  What is supposed to be done on computers?
                The answer came later when my Dad installed America Online on a brand new Pentium 133mhz desktop computer.  I searched the web, wrote emails to my friend who was spending a semester abroad, and chatted in chat rooms.  The light switch went on and I realized that computers are for the Internet.  More accurately, computers are for communicating, whether over the Internet, cell phone networks, undersea sonar, or satellite radio.
                It is striking that the first instructions a computer runs on boot up are for installing the Basic Input Output System (BIOS), which enables the computer to communicate with the network, memory, disk, monitor, keyboard, etc.  No apps will run during this process, in fact the operating system won’t even start until the BIOS finishes loading.

Thursday, September 29, 2011

Sun Sparc T4

Something about Oracle Sparc T4 doesn't roll off the tongue right so we shall stage a protest - long live Sun!

<steps off soap box, onto a smaller soap box>

The Sparc T4 was recently released, much to the delight of Sun, err, Oracle hardware users.  Interestingly, it has only half the number of cores as the T3 (8 instead of 16), but reaches almost twice the clock rate (3ghz instead of 1.67ghz) - all with the same number of threads per core (8).  With an upgrade from in-order to out-of-order execution (OOO, heh), and improved branch prediction the T4 is clearly targeted at single threaded performance.

In the age of scaling applications into the cloud it is still an open question whether the "sort of parallel" applications that would improve when moving from T3 to T4 are the future of the server market.  Certainly users that already are dependent upon sort-of-parallel applications, and are binary compatible, will want to upgrade.  But the NoSQL, MapReduce, and memcached movements are new versions of old programming tools with the non-scalable pieces removed.  They are used by newer tech companies that use their scale for competitive advantage (Google, Yahoo, Facebook) - which is another way of saying that they are the future.

On a side note, it is interesting that ArsTechnica no longer has the tried and true voice of Jon Stokes analyzing the latest processors (btw, if you want an insanely good book on computer architecture, get his).  Fingers crossed that he's just on vacation.  In either case, we at DailyCircuitry wish him the best of luck.

Wednesday, September 28, 2011

Larrabee 3 rumor

The list of things that are bigger in Texas just got, err, bigger.  Get out of the way Ranger, here comes a Stampede!  No longer content with "just" half a petaflop, University of Texas has upped the ante unveiling their plans for a 10 Petaflop supercomputer.  Interestingly, one fifth of the GFlops (an often misleading benchmark, but how could that matter?) will be provided by the dual processors (8C16T each) powering each server, while the other 80% will be provided by Knight's Corner (aka Larrabee 3) accelerator cards.  Until now, a select few researchers have had access to the predecessor of LRB3, known as Knight's Ferry (aka Larrabee 2).  Knight's Ferry is interesting in that it supports only single precision floating point, which is typical of graphics processors but atypical of the new GPGPU cards sold by Nvidia with which it will compete.

Love him or hate him, Charlie over at semiaccurate is at it again with a very interesting tidbit.  LRB3 is rumored to be binary incompatible with LRB2 (if you have a better reference than Charlie, please send it to me).  When not hating on Nvidia Charlie tends to be more accurate than "semi" would suggest, so this is an interesting development (well... he's also pretty accurate when hating, heh).  In the ensuing discussion, Exophase makes a good point that binary compatibility doesn't really matter for highly parallel software.  I tend to agree with this, since discussions between Cognitive Electronics and many different developers has also indicated that to be true.

Won't it be great when we can access a variety of different architectures in the cloud?  Benchmarking for pennies.  Can't wait!

Tuesday, September 27, 2011

Google Circles vs Facebook

When I first joined facebook I was taken back by how incredibly impertinent the news in the news feed was.  I wrote a few long private messages to friends to try to catch up, but vowed to log on very infrequently for fear I would become addicted to it like some people I've heard about (I am easily addicted to video games, as my Diablo II hardcore sorceress's #1 USWEST ladder rank can attest, or could have in 2001, heh).

I had accrued some principled reasons for being anti-facebook.  Of course there are the privacy issues, but I especially disliked how an account can be revoked and there is very little one can do about it.  I also don't like how Chinese activists have had difficulty using names of their own choosing.  What I wanted most was a way of controlling what subset of friends would see certain posts.

I had told myself I would switch to whatever social network Google would come up with (since Buzz and Orkut did not seriously compete with facebook).  Google+ has now hit the scene and I have joined, read posts, and posted on it.

What I am most surprised by is how much I detest making the choice of who should see my post.  Is this post for friends, family, professional colleagues, or people that are only in a combination of those, or only a subset of those?  This is too much thinking, too much pressure for a post about how nice the weather in Palo Alto is (very).  Or how much better the Mexican food is (much much).  Or how much I enjoyed playing Red Dead Redemption with my uncle during my last visit to Southern California (a lot).

I find myself using facebook much more now.  I think Google+ has taught me that I do not want to think all that much about my social network postings.  I would rather spend 10 seconds writing a dumb musing instead of 2 minutes.  In fact, after spending 2 minutes thinking about it, I generally would just abandon the idea of posting... or log on to facebook to check if there are any posts I might troll, err, comment on.  If other people are like me then that does not spell good news for Google.

This reminds me of the "affordances" taught in the human-computer-interaction  course I took in college.  Every item has a use, or affordance.  To kids, trees are "for" climbing, stones are for throwing, and soccer balls are for kicking.  It was funny that this notion also extended to more nefarious purposes.  For example, the behavior of hooligans could be controlled by the material that covers an opening in a building.  The hooligans thought that windows were for breaking, and boards were for graffiti, even though windows can be graffitied and boards can be broken.

In my mind facebook is for posting musings, typically aimless but sometimes with a little bit of a point.  I can also see the importance of LinkedIn - people there would not be surprised to see a message that a contact is or plans to be out of work and is looking for something interesting.  LinkedIn is for work.

Google+ is trying to be whatever you want it to be at the moment.  It can be used for both facebook and LinkedIn types of posts.  If we do end up needing a toolbox to hold all of our tools, then Google+ has a bright future.  But we might only need a couple.  In support of this latter position, I'll finish by quoting Clint Eastwood from Gran Torino:

"Take these three items, some WD-40, a vice grip, and a roll of duct tape. Any man worth his salt can fix almost any problem with this stuff alone."