Wednesday, November 10, 2010

Where MapReduce fault tolerance comes from

Google's MapReduce is a great programming paradigm.  It takes data parallelism for granted, runs on however many processors are available, and keeps running even if some of the computers crash, get unplugged, or catch fire.

MapReduce is able to tolerate faults for an interesting reason.  During the Map phase, the Master Controller (not sure if that's the official label, but if not this is better anyway :) assigns data chunks to each map worker node, and the death of a map worker just means that the Master Controller must reassign those Map inputs to a different node.

Here's the interesting part: Each mapper caches the outputs produced during the map phase and separates them according to which reducer node will receive them as inputs.  If a reducer node crashes, gets unplugged, or gets too cold, the Master Controller tells a different reducer node to use the mapper caches to finish up the missing processing.

It works this way because clever MapReduce designers realized they could cache the map results to *memory* until a sufficiently large chunk could be sent to disk as an efficient sequential write.  The hard drive caching never became a performance bottleneck because the sequential writes and sequential reads are faster than the Gigabit Ethernet connecting them to the network.

This should have caught up to Google by now, because 10GigE is getting cheap and hard drives didn't get any faster.  But as history has shown, it's not a good idea to bet against Google.  SSD arrived just in time to save the day, and Google will be able to transition just fine to 10GigE by coupling a few of them with each server (SATA 3 reaches 6 gbps, which can be saturated by 3-4 SSD drives or so).  Google's main concern about servers is their power consumption, and those SSDs sip relative to the other big power drawing PCcomponents.  For the IO bound MapReduce tasks, 10GigE + SSD means life is good.

No comments:

Post a Comment