Until the HPC Cluster instance debuted back in July, there was a lot to complain about in terms of CPU performance per dollar in the Amazon EC2 cloud. Single-threaded performance of their next-best instance, the so-called "High CPU" instance, is less than a tenth that of a modern desktop PC (2.5/8 = 0.3125 vs. 33.5/8 = 4.1875 EC2 compute units for a 2.93Ghz Nehalem). Indeed it is well known that Amazon slices up their real cores into many virtual cores that include only a fraction of the computing resources. This was the norm until the HPC Cluster instance, which is the first to provide a 1:1 real:virtual core ratio.
Without special arrangements only 8 HPC cluster instances can be recruited, at a cost of $1.60 each per hour, or $12.80 for all 8. The theoretical max double-precision GFlops (an imperfect and often misleading metric that is OK with respect to how we use it here) is 93.76 GFlops/instance * 8 instances = 750 GFlops (although only half this rate was achieved in their poster benchmark, we will give the benefit of the doubt). An hour's worth of processing (the smallest unit that can be purchased) delivers 750 * 3600 seconds = 2.7 Peta floating point operations.
Some applications are able to scale performance on N processors to be O(N), meaning linear scaling minus some overhead that does not increase out of proportion. "Embarassingly parallel" algorithms are good examples of this, such as Monte Carlo algorithms and algorithms that process large amounts of data like Web Search etc.
Suppose a scalable algorithm that bottlenecks on the SIMD DP FPU takes an hour to complete a task on Amazon's 8 available HPC instances (any faster and performance per dollar is reduced due to the 1-hour minimum). Ignoring initialization time (which we will discuss in a future posting, and which today is not charged to EC2 users) scalable algorithms can do massive amounts of work in almost no time by recruiting tons of hardware for very brief periods. In this example, if 28,800 instances are available and there is no "1-hour minimum", the task finishes in about 1 second for the same price as the 1-hour scenario, utilizing 2.7 Petaflops (quadrillion floating point operations per second). At the current cost per compute-second, err.. compute-hour, the total cost would be $12.80.
Conventional commodity-server based systems will probably never be capable of delivering this type of performance because of initialization time (currently 5-20 minutes on EC2, depending on OS), but it is easy to envision custom cloud architectures that would confer nearly instant execution to scalable algorithms.