I was researching the Sun Niagara III, aka Spark T3, and came across an interesting aspect: there is no L3 cache. Each core gets an L2 cache in the increasingly popular size of near 256KB (Tilera, Intel Nehalem, Sandybridge), but slightly larger at 384KB. This is coupled with L1 caches that are surprisingly small relative to the number of threads (8) supported per core: 8KB L1 data cache and 16KB L1 instruction cache - that's 1KB and 2KB respectively per thread in the case of the memory not being usefully shared, but realistically all of the instruction cache can be shared between the 8 threads on each core. Furthermore, by supporting 8 threads per core, and topping out at just 1.65ghz, it is realistic that the local L2 cache has latency under 8 cycles so that there is close to no penalty hitting the L2 cache for data fairly frequently. This suggests the L1 data cache is just there to reduce the number of accesses to L2, freeing up its bandwidth and reducing power by replacing higher-power L2 accesses with lower power L1 accesses.
Although they both lack L3 and use distributed and similarly sized L2 caches, there are some interesting architectural divergences between the Niagara 3 and Tilera's to-be-released-some-time-in-2011 Gx-100. Tilera doesn't multithread, so their in-order cores will take a performance hit whenever they hit memory, even in the L1 cache, which suggests that programmers will need to use asynchronous memory transfers or the Tilera tools somehow extract these function calls automatically (with the same caveat all such automated program analysis tools have in that they sometimes work, and sometimes don't). In contrast, the Niagara 3 hardware-multithreading is naturally tolerant to the latency of accessing memories belonging to other cores on the same chip.
It really is an open question as to whether tiled routing like Tilera's will take off, with Intel's cloud research processor having also used the method, as did their Terascale research chip. In contrast, the mainstream Sandy Bridge and CELL processors use ring buses, making them a relatively proven architecture, though without modification the cross-chip latency will scale linearly rather than with the square root of the number of cores. This linear latency scaling has less impact right now since, in the existing ring bus examples, the caches themselves have latency that is similar or more than the latency resulting from core-to-core data passing.
AMD and Intel continue forward with their large L3 caches, with Intel having transitioned to a ring-bus style L3 from Nehalem to Sandy Bridge, which has reportedly had a favorable impact on L3 latency (though that seems somewhat counterintuitive). The IBM BlueGene/P uses cross-bar access to an 8MB L3 cache with relatively high latency at 35 cycles (that is a really great review of the architecture, btw), but this is not unusual since cross-bar is a standard method for 4-cores or less.
So the jury is not just out on whether giant high latency L3 caches will continue to prevail, but also whether the largest on-chip cache will utilize a ring bus, tiled mesh, hierarchical, or other topology as chips continue to progress to ever larger core counts.