Last edited on 1999-04-21 01:09:46 by stolfi
Modern computer designs generally include processor caches in order to reduce the apparent delay of memory read/write operations.
The simplest kind of cache stores individual words, and is effective for applications that compute for a relatively long time over a small "working set" of memory locations. In that case most memory acesses will be satisfied immediately by the cache.
However, not all applications have small working sets. Many compute-intensive applications, in particular, have inner loops that scan sequentially a set of consecutive memory locations. For these applications, the basic word-by-word cache is not effective. For that reason, most machines have "wide" caches, where the data is managed in relatively large chunks ("lines"). Thus accessing one word of memory will normaly cause W consecutive words to be brought into the cache. If memory accesses are indeed sequential, a wide cache has the effect of predicting and anticipating memory accesses, so that the processor does not have to stop for them.
On the other hand, for applications that access memory in random order, or sequentially but with large strides, a cache will not be of much help, and will in fact multiply the memory-processor traffic by the cache width W. This extra traffic can easily slow down the system by an order of magnitude.
A multiprocessor machine can be idealized as a set of N processors connected to a set of M memory modules by a memory network. The latter can be a simple or multiple bus, a crossbar switch, a high-speed packet network, etc. We consider here only shared-memory multiprocessor (SMMP) machines, where any processor can directly access data in any memory module, without software intervention.
SMMP machines generally have dedicated caches attached to each processor, for the same reasons that monoprocessors have them. For applications with small working sets, the caches have the added benefit of reducing contention for the memory network among the processors. For those applications, the cache allows the ideal speedup N to be achieved, for fairly large N, even when the memory network is a simple bus.
For applications that scan large amounts of memory sequentially, contention for the memory network can easily become a bottleneck. Caches should be of little help in this case, since they will not reduce the total amount of data transferred across the network. Wide caches may improve things slightly, however, by reducing the number of separate memory transactions.
For such applications, the maximum number of processors Nmax that can be usefully employed is roughly c·m, where c is the average number of computation clock cycles performed by the application for each sequential memory read or write; and m is the maximum number of words that can be transferred across the memory network per clock cycle. For these applications, therefore, the design of the memory network is critical, and single-bus designs start with a disadvantage---at least in theory.
For applications that access data randomly, or in large strides, wide caches may have a doubly disastrous effect on SMMP performance. As in monoprocessor machines, the actual memory traffic for each processor will be multiplied by W, which may impact performance even when N=1. Moreover, the total traffic across the memory network will be multiplied by W too; so the maximum amount of parallelism achievable Nmax is c·m/W, which can easily be less than 1. In particular, on a single-bus machine, if a single processor is memory-bound when executing a given application, then activating additional processors will not make the computation faster.
Multi-threaded applications with scattered working sets may be penalized further by cache-line contention: if two or more processors make repeated writes to different words that happen to belong to the same cache line, then the whole line will have to be shuffled many times between the caches and the main memory. This shuffling will further increase the amount of data that has to be transferred across the memory network. If the network is already saturated (or close to) by a single processor, using more processors will in fact make the application take longer to finish.
To get a sense of the magnitude these effects, we prepared a special-purpose benchmark program, which we tested on SMMPs proposed by three vendors that were competing for the Fominhas grant.
The program allocates a large array with R rows and C columns, and applies a certain linear algebra operation (let's call it a "reflection step") K times to each of P pairs of rows, in somewhat random order. Each reflection step scans the two rows in parallel, performing C "elementary operations"---each consisting of a couple of read/write, arithmetic, and pointer increment operations on their elements.
The parameter C is specified by the user; then R is computed so that the array fills the available memory; K is chosen so that the number of elementary operations applied to each pair of rows, K·C, is comfortably large (about 2,000,000 in these tests); and, finally, the number of pairs P is chosen so that the total number of elementary operations, P·K·C is about the same in all tests (about 300,000,000). Another user option determines whether the array is laid out in memory row-by-row, as in C, or column-by-column, as in FORTRAN.
The actual computations are carried out by T concurrent threads. Each thread repeatedly grabs two row indices i,j from a shared queue, applies the reflection step K times to rows i and j, and returns the indices to the queue. If the work per pair K·C is reasonably large (as it was in the experiments reported here), the thread synchronization overhead is negligible, and the computation is dominated by the inner loop of the reflection step.
The "working set" of each thread, for most of the computation, is therefore a pair of rows, each of them consisting of C words; this working set is scanned K times before the thread shifts to another pair of rows (usually disjoint from the first one).
By varying the row size C and the array storage order, this program can probe the machine's performance in all the regimes discussed above. The main cases we tested were
Compact/Small: array stored by row, C = 418 (working set = 6.7 KB).
Compact/Large: array stored by row, C = 10,000 (working set = 160 KB).
Scattered/Small: array stored by column, C = 418 (working set = 6.7 KB).
Scattered/Large: array stored by column, C = 10,000 (working set = 160 KB).
The working set sizes above ignore the effect of cache lumping. Thus, for the scattered tests, one should multiply those numbers by the cache width W, in order to get the effective size of the working set.
(In restrospect the "large" working set size may have been too small. Given that the machines tested have several MB of on-board, off-chip cache, the Compact/Large test would have been more informative if C was 300,000 or more. Unfortunately the whole benchmark was already quite expensive to run, and we did not want to abuse the vendors' good will by asking them to run these additional tests.)
It should be emphasized that the total number of arithmetic, read/write, and pointer increment operation was very nearly the same in all tests. In particular, the total amount of data actually read (and written back) by the CPUs in each test is 2K·P·C = 600,000,000 words = 4.8 GB. (This number does not include access to the loop index and pointer variables, which should be mapped to registers by the compiler, or reside permanently in the cache.) Thus, the (huge!) differences in running time observed between the four cases above are basically due to differences in the performance of the memory network and cache mechanism.
The tests above were repeated using various numbers of active threads. The row selection policy was such that two threads running in parallel would often pick adjacent rows to work on. Therefore, in tests with scattered (by-column) layout, the two threads would often want to modify words in the same cache line, incurring in the cache-line contention penalty discussed above. This analysis probably explains why, on two of the machines, the scattered tests take longer with two processors than with a single one.
On the other hand, as the number of parallel threads T increases beyond two, their working sets will be spread out over more rows, and therefore over a wider set of cache lines, so that the contention for cache-lines will be lessened. This effect seems to be the likely explanation for the (slight) decrease in running time of the scattered tests as the number of threads is increased beyond 2. Note that the times continue to improve even after T exceeds the number of processors, so the speedup cannot be due to parallelism per se.
The most important information we got from this benchmark is the enormous effect of data layout on performance. We were expecting to see significant differences, but certainly not the factors of 50 to 100 that we actually got. In fact the relative difference in performance between the `friendly' and `nasty' tests was much greater on these modern machines than on our slightly older multiprocessors. It seems that the newer machines were further optimized for the `friendly' cases, while little was done to improve the `nasty' ones.
Furthermore we learned that a `nasty' access pattern will not only make the single-threaded application 50 times slower, but it will also negate the expected advantage of multiple processors, since the bottleneck then gets shifted to the memory network.
It was also disappointing to find that the cc-NUMA architecture did not perform visibly better here than the single-bus designs, in spite of its theoretical advantage. Presumably the difference becomes visible only when the number of processors is very large, and/or the memory access pattern is only slightly `nasty'.
Unfortunately, these surprises mean that the benchmark results are not entirely conclusive, for the purpose of selecting the Fominha machine. As a research and academic institution, we cannot predict which kinds of applications we will want to run on it, and in most cases we will not have the manpower resources to fine-tune the access patterns. Obviously it is difficult to choose the best among three machines, whose raw CPU speed varies by a factor of 2, when the expected running time of a typical application includes an unknown `memory nastiness' factor that may range between 1 and 100.
On the bright side, we confirmed that all three machines could achieve almost perfect speedup in the `friendly' tests. While the fastest CPU was almost twice as fast as the slowest one, the number of processors offered by each vendor (for basically the same price) was such that all three machines were able to do the main loop of the `most friendly' test in the same time (0.5 second).