Last edited on 1999-03-26 17:17:01 by stolfi

Cache/Memory Performance Benchmarks of
Shared-Memory Multiprocessor Machines

Introduction and nomenclature

Caches and memory access patterns

Modern computer designs generally include processor caches in order to reduce the apparent delay of memory read/write operations.

The simplest kind of cache stores individual words, and is effective for applications that compute for a relatively long time over a small "working set" of memory locations. In that case most memory acesses will be satisfied immediately by the cache.

However, not all applications have small working sets. Many compute-intensive applications, in particular, have inner loops that scan sequentially a set of consecutive memory locations. For these applications, the basic word-by-word cache is not effective. For that reason, most machines have "wide" caches, where the data is managed in relatively large chunks ("lines"). Thus accessing one word of memory will normaly cause W consecutive words to be brought into the cache. If memory accesses are indeed sequential, a wide cache has the effect of predicting and anticipating memory accesses, so that the processor does not have to stop for them.

On the other hand, for applications that access memory in random order, or sequentially but with large strides, a cache will not be of much help, and will in fact multiply the memory-processor traffic by the cache width W. This extra traffic can easily slow down the system by an order of magnitude.

Caches in multiprocessor machines

A shared-memory multiprocessor machine (SMMP) can be idealized as a set of N processors connected to a set of M memory modules by a memory channel. The latter can be a simple or multiple bus, a crossbar, a high-speed packet network, etc.

SMMPs generally have dedicated caches attached to each processor, for the same reasons that monoprocessors have them. For applications with small working sets, the caches have the added benefit of reducing contention for the memory channels among the processors. For those applications, the cache allows the ideal speedup N to be achieved, for fairly large N, even when the memory channel is a simple bus.

For applications that scan large amounts of memory sequentially, contention for the memory-processor channels can easily become a bottleneck. Caches should be of little help in this case, since they will not reduce the total amount of memory-processor traffic. Wide caches may be slightly benefical, however, by reducing the number of separate memory transactions.

For such applications, the maximum number of processors N_max that can be usefully employed is roughly c·m, where c is the average number of computation clock cycles performed for each sequential memory read or write; and is the maximum number of words that can be transferred across the memory channel per clock cycle. For these applications, therefore, the design of the memory channel is critical, and single-bus designs start out with a distinct disadvantage.

For applications that access data randomly or in large strides, wide caches may have a doubly disastrous effect on SMMP performance. As in monoprocessor machines, the actual memory traffic for each processor will be multiplied by W, which may impact performance even when N=1. Moreover, the total traffic across the memory channel will be multiplied by W too; so the maximum amount of parallelism achievable N_max is c·m/W, which can easily be less than 1. In particular, on a single-bus machine, if a single processor is memory-bound when executing a given application, then activating additional processors can only make things worse.

A simple benchmark program

To get a sense for these effects, we have executed a special-purpose benchmark program on several SMMPs.

The program allocates a large array with R rows and C columns, and applies a certain linear algebra operation (let's call it a "reflection step") K times to each of P pairs of rows, in somewhat random order. Each reflection step scans the two rows in parallel, performing C "elementary operations"---each consisting of a couple of read/write, arithmetic, and pointer increment operations on their elements.

The parameter C is specified by the user; then R is computed so that the array fills the available memory; K is chosen so that the number of elementary operations applied to each pair of rows, K·C, is comfortably large (about 2,000,000 in these tests); and, finally, the number of pairs P is chosen so that the total number of elementary operations, P·K·C is about the same in all tests (about 300,000,000). Another user option determines whether the array is laid out in memory row-by-row, as in C, or column-by-column, as in FORTRAN.

The actual computations are carried out by T concurrent threads. Each thread repeatedly grabs two row indices i,j from a shared queue, applies the reflection step K times to rows i and j, and returns the indices to the queue. If the work per pair K·C is reasonably large (as it was in the experiments reported here), the thread synchronization overhead is negligible, and the computation is dominated by the inner loop of the reflection step.

The "working set" of each thread, for most of the computation, is therefore a pair of rows, each of them consisting of C words; this working set is scanned K times before the thread shifts to another pair of rows (usually disjoint from the first one).

Effect of parameters on memory access patterns

By varying the parameters C, T, and the array storage order, this program can probe the machine's performance in all the regimes discussed above. Here are the main cases:

Compact (row-by-row array layout): In this case, the working set consists of two blocks, each with C consecutive words. We have two sub-cases:
- Compact/Small: If the per-processor cache holds at least 2C words, we should observe zero memory latency and near ideal speedup (a factor of T), for T up to N. The total memory traffic will be 2P·C.
- Compact/Large: If the cache holds less than 2C words, we should still observe low memory latency, because most words will be pre-fetched; but the total memory traffic will be K times larger, i.e. 2K·P·C, since each pair of rows will be read and written K times instead of just once. Thus, (1) the performance of each processor may be reduced, due to the increased traffic at its memory port; and (2) the speedup factor for T greater than 1 is expected to level off sooner, as the memory channel should saturate more easily (especially on systems with a single memory bus.)
Scattered (column-by-column array layout): In this case, the working set consists of two series of C words each; successive words of the series will be separated by R skipped words. Since each element of the row will fall on a diffrent cache line, the memory traffic for a single processor will be multiplied by the cache width W; and further penalties may be incurred if various processors have to fight for the same cache line. We have two sub-cases:
- Scattered/Small: If the processor cache holds at least 2C·W words, we should observe relatively good performance for a single processor, as the working set will be loaded once and processed K times. The total memory traffic will be 2P·C·W.
  
  However, when two processors are working on adjacent columns, they will continuously steal cache lines from each other. Thus the cache will be ineffective, and the total traffic may approach 2P·K·C·W.
- Scattered/Large: If the processor cache holds less than 2C·W, then every memory access will read or write W words of memory. The total traffic will be close to 2P·K·C·W, even for a single processor.

It should be emphasized that the total number of arithmetic, read/write, and pointer increment operation is very nearly the same in all cases. In particular, the total memory traffic actually requested by the program, is 2K·P·C words. The variations in total memory traffic are due to cache effects.

Actual data

Compaq (Alpha) server

Processors: 8 CPUs alpha 21264 (EV6) 524 MHz.
Cache: not known.
Total memory: 8 GB.
Memory used (array size): 6 GB.
Time to read and write every element sequentially: 70 sec.

Array layout	Columns C	Wk.set KB	Threads T	Elapsed sec	Rel spdup	Total sec	Rel effcy

C	418	6.7	1	4.6	1.	4.5	1.00
C	418	6.7	2	2.3	2.	4.6	0.98
C	418	6.7	4	1.1	4.	4.3	1.04
C	418	6.7	8	0.6	8.	4.6	0.98
C	418	6.7	10	0.6	8.	4.7	0.96
C	418	6.7	16	0.6	8.	4.7	0.96
C	418	6.7	20	0.6	8.	4.8	0.94
C	418	6.7	24	0.6	8.	4.7	0.96
C	418	6.7	32	0.6	8.	4.8	0.94
C	418	6.7	48	0.6	8.	4.9	0.92

C	10000	160	1	10.5	1.0	10.5	0.43
C	10000	160	2	5.3	2.0	10.7	0.42
C	10000	160	4	2.6	4.0	10.2	0.44
C	10000	160	8	1.5	7.0	10.9	0.41
C	10000	160	10	1.6	6.6	11.8	0.38
C	10000	160	16	1.3	8.0	10.0	0.45
C	10000	160	20	1.4	7.5	11.0	0.41
C	10000	160	24	1.3	8.0	10.1	0.45
C	10000	160	32	1.3	8.0	10.2	0.44
C	10000	160	48	1.3	8.0	10.2	0.44

S	418	6.7	1	274.6	1.0	274.4	0.016
S	418	6.7	2	214.6	1.3	428.9	0.010
S	418	6.7	4	142.7	1.9	568.1	0.008
S	418	6.7	8	103.6	2.7	817.3	0.006
S	418	6.7	10	101.1	2.7	801.8	0.006
S	418	6.7	20	81.4	3.4	643.5	0.007
S	418	6.7	24	76.0	3.6	604.1	0.007
S	418	6.7	32	68.2	4.0	543.2	0.008
S	418	6.7	48	56.8	4.9	450.8	0.010

S	10000	160	1	291.2	1.0	291.1	0.015
S	10000	160	2	206.6	1.4	411.8	0.011
S	10000	160	4	135.4	2.2	538.9	0.008
S	10000	160	8	100.5	2.9	788.5	0.006
S	10000	160	10	104.7	2.8	823.9	0.005
S	10000	160	16	94.8	3.1	753.0	0.006
S	10000	160	20	92.5	3.1	733.0	0.006
S	10000	160	24	88.3	3.3	700.4	0.006
S	10000	160	32	83.0	3.5	658.9	0.007
S	10000	160	48	74.1	3.9	590.7	0.008

Cache/Memory Performance Benchmarks ofShared-Memory Multiprocessor Machines