

# Traditional Sources of Performance Improvement are Flat-Lining

### New Constraints

15 years of *exponential* 1,000,000 clock rate growth has ended

## But Moore's Law continues!

- How do we use all of those transistors to keep performance increasing at historical rates?
- Industry Response: #cores per chip doubles every 18 months *instead* of clock frequency!
- Is multicore the correct response?
  Office of Science





# **ERSC** Microprocessors: Up Against the Wall(s)

### From Joe Gebis

- Microprocessors are hitting a power wall
  - Higher clock rates and greater leakage increasing power consumption
- Reaching the limits of what non-heroic heat solutions can handle
- Newer technology becoming more difficult to produce, removing the previous trend of "free" power improvement





Site LBNL ANL

ORNL

PNNL

## ORNL Computing Power and Cooling 2006 - 2011 Computer Center Power Projections

 Immediate need to add 8 MW to prepare for 2007 installs of new systems

ERSC

- NLCF petascale system could require an additional 10 MW by 2008
- Need total of 40-50 MW for projected systems by 2011
- Numbers just for computers: add 75% for cooling
- Cooling will require 12,000 15,000 tons of chiller capacity

| 0   |            | 90 7 |         |        |                      |           |          |       |      |   |
|-----|------------|------|---------|--------|----------------------|-----------|----------|-------|------|---|
| V   |            | 80-  |         |        | Cooling<br>Computers |           |          | \$31M |      |   |
|     |            | 70-  | 1       |        |                      | -         |          |       |      |   |
| ,   |            | 60-  |         |        |                      |           | \$23M    |       |      |   |
|     | Power (MW) | 50 - |         |        |                      | \$17M     |          |       |      |   |
|     | ower       | 40-  |         |        |                      | $\square$ |          |       |      |   |
|     |            | 30 - |         |        | \$9M                 |           |          |       |      |   |
|     |            | 20-  |         | \$3M   |                      |           |          |       |      |   |
|     |            | 10-  |         |        |                      |           |          |       |      |   |
|     |            | 0    | 2005    | 2006   | 2007                 | 2008      | 2009     | 2010  | 2011 | - |
|     |            |      | Costo   | otimot |                      | Year      |          | br    |      |   |
|     |            |      | Coste   | sumau  | es based             | 011 20    | .05 KVV/ |       |      |   |
| ige | εE         | lect | rical P | ower I | Rates \$/            | MWh       |          |       |      |   |
| 00  | )7         | F    | Y 2008  | 3 F    | Y 2009               | FY 2      | 2010     |       |      |   |

| FY 2005 | FY 2006 | •     | FY 2008 |       | FY 2010 |
|---------|---------|-------|---------|-------|---------|
| 43.70   | 50.23   | 53.43 | 57.51   | 58.20 | 56.40 * |
| 44.92   | 53.01   |       |         |       |         |

Annual Avera

51.33

N/A

Data taken from Energy Management System-4 (EMS4). EMS4 is the DOE corporate system for collecting energy information from the sites. EMS4 is a web-based system that collects energy consumption and cost information for all energy sources used at each DOE site. Information is entered into EMS4 by the site and reviewed at Headquarters for accuracy.

AND THE SECTIONAL LABORATORY

46.34

49.82









Benchinnark results from Michael Wehner, Art Mirin, Patrick Worley, Leonid Oliker



- Coarse Grained DVFS (slow down entire chip or core)
- Clock gating
- Ad-hoc environmental monitoring

## Need innovations in

- Power efficiency metrics
- Tight coupling of instrumentation, system HW, & software response/instrumentation (sensors and actuators)
- Power aware algorithms
- Joule Counters
- PAPI-like analogue for collecting/unifying power/environmental monitoring data
- Fine grained DVFS
  - must always slow down for something
  - New notion of system balance (unbalanced if slow down to wait for the same resource)





# **Ultimate Destination is Manycore**

(what building blocks should we be leveraging from industry)

## Convergence between HPC and Embedded Computing

 Technology from embedded market is now trickling <u>up</u> into server design rather than traditional trickle <u>down</u> flow of innovation. (*BlueGene and SiCortex*)

## Convergence towards manycore

- hundreds of cores per chip (Cisco Metro, Intel TFLOPs, NVidia CUDA)

## • Effects on Computer architecture

- More/simpler cores per chip!
- Lower degree interconnects
- Constrained memory sizes (no longer 1 byte/flop)
- Doubling of concurrency every 18 months

## Effect on users

- How to ride a wave of exponentially increasing concurrency
- As significant as migration from vector to MPP (early 90's)
- Widespread panic regarding programming model



# Tension between concurrency and power efficiency

- Highly concurrent systems can be more power efficient
  - Dynamic power is proportional to V<sup>2</sup>fC
  - Build systems with even higher concurrency?
- However, many algorithms are unable to exploit massive concurrency yet
  - If higher concurrency cannot deliver faster time to solution, then power efficiency benefit wasted
  - So we should build fewer/faster processors?
- With Massive Concurrency, Assumptions our current software infrastructure is built upon are no longer valid
  - Programming model will break
  - System software will break
  - Applications will break
  - Hardware will be unbalanced

## Some of these fears are unfounded



Some require fundamental SW/HW innovation



# Multicore is NOT an SMP-on-a-Chip

- What about Message Passing on a chip?
  - MPI buffers & datastructures growing O(N) or O(N<sup>2</sup>) a problem for constrained memory
  - Redundant use of memory for shared variables and program image

#### What about SMP on a chip?

ERSC

- Hybrid Model: Long and mostly unsuccessful history due to loop startup/shutdown
- But it is NOT an SMP on a chip
  - 10-100x higher bandwidth on chip
  - 10-100x lower latency on chip
- SMP model ignores potential for much tighter coupling of cores
- Same deal for stream programming model!

### Looking beyond SMP

- Cache Coherency: necessary but not sufficient
- Fine-grained language elements difficult to build on top of CC protocol
- Hardware Support for Fine-grained hardware synchronization
- Message Queues
- Transactions: Protect against incorrect reasoning about concurrency











- More exotic solutions may be our ultimate destination
  - But need practical experience with exotic HW to find their limits
  - But research pipeline is pretty empty (killed by 15 yrs of relentless clock frequency scaling)
- Locality is key

**ERSC** 

- Must be able to expose/manage through language constructs
- Slim hope of full automation of locality management (existing serial programming languages do not offer sufficient guarantees about locality of effect. Too little information for compiler to make sane decisions)
- Rediscovering dataflow (although we aren't calling it that)
  - Hardware implementation of transactional memory look just like dataflow activation frames from Monsoon
  - Similar observation on programming models for Cell and G80 GPU.







## Basic Processor Efficiency The Usual List of Suspects

|                                          | IBM/<br>Sony/<br>Toshiba<br>Cell | IBM<br>BlueGene<br>/L<br>(PowerPC<br>440 ASIC) | AMD<br>Opteron<br>K8L | Intel Xeon<br>5100<br>Woodcrest | Intel<br>Itanium2 | Xtensa-<br>based<br>SIMD/LIW<br>Scientific<br>Engine |
|------------------------------------------|----------------------------------|------------------------------------------------|-----------------------|---------------------------------|-------------------|------------------------------------------------------|
| DP Operations per<br>Cycle per Processor | 0.6<br>per SPE                   | 4                                              | 4                     | 4                               | 4                 | 4                                                    |
| Cycles per second<br>(GHz)               | 3.2                              | 0.7                                            | 3.0                   | 3.0                             | 1.4               | 0.65                                                 |
| Processors per IC                        | 8                                | 2                                              | 2                     | 2                               | 1                 | 32                                                   |
| Aggregate DP GLFOPS<br>per IC            | 15                               | 5.6                                            | 24                    | 24                              | 5.6               | 83                                                   |
| Approx IC Power<br>(Watts)               | 30                               | 10                                             | 80                    | 65                              | 130               | 12                                                   |
| IC GFLOPS/Watt                           | 0.5                              | 0.6                                            | 0.3                   | 0.4                             | 0.06              | 7                                                    |

DP FP pipelines in FPGA: 15.9 GFLOPs @ 25W (Xilinx Virtex-4 LX200): 0.63 GFLOPS/W

Source: Vendor websites www.geek.com,www.answers.com



# Materials Science Mass Migration to New Algorithms

### Materials Science

- Predict bulk material properties from first principles (ab-initio)
- One algorithm, Planewave DFT, accounts for 75% of the materials science workload
- Codes: QBox, PARATEC, VASP
- QBox won Gordon Bell award for scalability!
- However, this is not the correct algorithm to use for petaflop scale calculations!
  - FLOP requirements grow O(N^3)
  - Increasingly dominated by BLAS3 (good for FLOPs)
  - But only get to simulate marginally larger system
  - Fails to exploit locality of quantum wave component!

### Classical DFT approach cannot continue!

- O(N) algorithms will eventually replace them
- O(N) methods are not yet fully developed because the attention is going to classical DFT because it generates impressive FLOP rates
- 75% of the NERSC MatSci workload is going to have to migrate to O(N) methods, but little support that migration









- Cell memory requests can be nearly completely hidden behind the computation due to asynchronous DMA engines
- Performance model is simple and deterministic (much simpler than modeling a complex cache hierarchy), min{time\_for\_memory\_ops, time\_for\_core\_exec}





Office of Science





# Disconnect Between Productive Science and Easy Scaling

### Combustion/Adaptive Mesh Refinement

- Not limited by bisection bandwidth

- Dominated by compute component for relevant problems
- Scaling of Hyperbolic problems trivial, Elliptic problems challenging

### Materials Science (PARATEC, LS3DF)

- Dominant algorithm: Planewave DFT dominates materials science workload at NERSC
- Dominated by O(N<sup>3</sup>) localized BLAS3 at petaflop scale (good for FLOP rates)
- Must move to O(N) methods beyond 1k atoms (mass migration required)

### Accelerator Modeling

- Currently formulated as direct solve on sparse matrix and will not scale
- Moving to petaflop scale requires innovation in the mathematical formulation of the problem!
- NERSC is handling all of the most difficult-to-scale applications
  - Leadership application selection process favors applications that already demonstrate scalability
  - Resulting distilling process concentrates hard-to-scale applications at NERSC!

