Last edited on 1999-03-26 17:17:01 by stolfi

Cache/Memory Performance Benchmarks of
Shared-Memory Multiprocessor Machines

Introduction and nomenclature

Caches and memory access patterns

Modern computer designs generally include processor caches in order to reduce the apparent delay of memory read/write operations.

The simplest kind of cache stores individual words, and is effective for applications that compute for a relatively long time over a small "working set" of memory locations. In that case most memory acesses will be satisfied immediately by the cache.

However, not all applications have small working sets. Many compute-intensive applications, in particular, have inner loops that scan sequentially a set of consecutive memory locations. For these applications, the basic word-by-word cache is not effective. For that reason, most machines have "wide" caches, where the data is managed in relatively large chunks ("lines"). Thus accessing one word of memory will normaly cause W consecutive words to be brought into the cache. If memory accesses are indeed sequential, a wide cache has the effect of predicting and anticipating memory accesses, so that the processor does not have to stop for them.

On the other hand, for applications that access memory in random order, or sequentially but with large strides, a cache will not be of much help, and will in fact multiply the memory-processor traffic by the cache width W. This extra traffic can easily slow down the system by an order of magnitude.

Caches in multiprocessor machines

A shared-memory multiprocessor machine (SMMP) can be idealized as a set of N processors connected to a set of M memory modules by a memory channel. The latter can be a simple or multiple bus, a crossbar, a high-speed packet network, etc.

SMMPs generally have dedicated caches attached to each processor, for the same reasons that monoprocessors have them. For applications with small working sets, the caches have the added benefit of reducing contention for the memory channels among the processors. For those applications, the cache allows the ideal speedup N to be achieved, for fairly large N, even when the memory channel is a simple bus.

For applications that scan large amounts of memory sequentially, contention for the memory-processor channels can easily become a bottleneck. Caches should be of little help in this case, since they will not reduce the total amount of memory-processor traffic. Wide caches may be slightly benefical, however, by reducing the number of separate memory transactions.

For such applications, the maximum number of processors Nmax that can be usefully employed is roughly c·m, where c is the average number of computation clock cycles performed for each sequential memory read or write; and is the maximum number of words that can be transferred across the memory channel per clock cycle. For these applications, therefore, the design of the memory channel is critical, and single-bus designs start out with a distinct disadvantage.

For applications that access data randomly or in large strides, wide caches may have a doubly disastrous effect on SMMP performance. As in monoprocessor machines, the actual memory traffic for each processor will be multiplied by W, which may impact performance even when N=1. Moreover, the total traffic across the memory channel will be multiplied by W too; so the maximum amount of parallelism achievable Nmax is c·m/W, which can easily be less than 1. In particular, on a single-bus machine, if a single processor is memory-bound when executing a given application, then activating additional processors can only make things worse.

A simple benchmark program

To get a sense for these effects, we have executed a special-purpose benchmark program on several SMMPs.

The program allocates a large array with R rows and C columns, and applies a certain linear algebra operation (let's call it a "reflection step") K times to each of P pairs of rows, in somewhat random order. Each reflection step scans the two rows in parallel, performing C "elementary operations"---each consisting of a couple of read/write, arithmetic, and pointer increment operations on their elements.

The parameter C is specified by the user; then R is computed so that the array fills the available memory; K is chosen so that the number of elementary operations applied to each pair of rows, K·C, is comfortably large (about 2,000,000 in these tests); and, finally, the number of pairs P is chosen so that the total number of elementary operations, P·K·C is about the same in all tests (about 300,000,000). Another user option determines whether the array is laid out in memory row-by-row, as in C, or column-by-column, as in FORTRAN.

The actual computations are carried out by T concurrent threads. Each thread repeatedly grabs two row indices i,j from a shared queue, applies the reflection step K times to rows i and j, and returns the indices to the queue. If the work per pair K·C is reasonably large (as it was in the experiments reported here), the thread synchronization overhead is negligible, and the computation is dominated by the inner loop of the reflection step.

The "working set" of each thread, for most of the computation, is therefore a pair of rows, each of them consisting of C words; this working set is scanned K times before the thread shifts to another pair of rows (usually disjoint from the first one).

Effect of parameters on memory access patterns

By varying the parameters C, T, and the array storage order, this program can probe the machine's performance in all the regimes discussed above. Here are the main cases:

It should be emphasized that the total number of arithmetic, read/write, and pointer increment operation is very nearly the same in all cases. In particular, the total memory traffic actually requested by the program, is 2K·P·C words. The variations in total memory traffic are due to cache effects.

Actual data

Compaq (Alpha) server

Array
layout
Columns
C
Wk.set
KB
Threads
T
Elapsed
sec
Rel
spdup
Total
sec
Rel
effcy
 
C 418 6.7 1 4.6 1. 4.5 1.00
C 418 6.7 2 2.3 2. 4.6 0.98
C 418 6.7 4 1.1 4. 4.3 1.04
C 418 6.7 8 0.6 8. 4.6 0.98
C 418 6.7 10 0.6 8. 4.7 0.96
C 418 6.7 16 0.6 8. 4.7 0.96
C 418 6.7 20 0.6 8. 4.8 0.94
C 418 6.7 24 0.6 8. 4.7 0.96
C 418 6.7 32 0.6 8. 4.8 0.94
C 418 6.7 48 0.6 8. 4.9 0.92
 
C 10000 160 1 10.5 1.0 10.5 0.43
C 10000 160 2 5.3 2.0 10.7 0.42
C 10000 160 4 2.6 4.0 10.2 0.44
C 10000 160 8 1.5 7.0 10.9 0.41
C 10000 160 10 1.6 6.6 11.8 0.38
C 10000 160 16 1.3 8.0 10.0 0.45
C 10000 160 20 1.4 7.5 11.0 0.41
C 10000 160 24 1.3 8.0 10.1 0.45
C 10000 160 32 1.3 8.0 10.2 0.44
C 10000 160 48 1.3 8.0 10.2 0.44
 
S 418 6.7 1 274.6 1.0 274.4 0.016
S 418 6.7 2 214.6 1.3 428.9 0.010
S 418 6.7 4 142.7 1.9 568.1 0.008
S 418 6.7 8 103.6 2.7 817.3 0.006
S 418 6.7 10 101.1 2.7 801.8 0.006
S 418 6.7 20 81.4 3.4 643.5 0.007
S 418 6.7 24 76.0 3.6 604.1 0.007
S 418 6.7 32 68.2 4.0 543.2 0.008
S 418 6.7 48 56.8 4.9 450.8 0.010
 
S 10000 160 1 291.2 1.0 291.1 0.015
S 10000 160 2 206.6 1.4 411.8 0.011
S 10000 160 4 135.4 2.2 538.9 0.008
S 10000 160 8 100.5 2.9 788.5 0.006
S 10000 160 10 104.7 2.8 823.9 0.005
S 10000 160 16 94.8 3.1 753.0 0.006
S 10000 160 20 92.5 3.1 733.0 0.006
S 10000 160 24 88.3 3.3 700.4 0.006
S 10000 160 32 83.0 3.5 658.9 0.007
S 10000 160 48 74.1 3.9 590.7 0.008