#### **Principle of Locality** - Programs access a small proportion of their address space at any time - Temporal locality - Items accessed recently are likely to be accessed again soon - e.g., instructions in a loop, induction variables - Spatial locality - Items near those accessed recently are likely to be accessed soon - E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3 #### **Taking Advantage of Locality** - Memory hierarchy - Store everything on disk - Copy recently accessed (and nearby) items from disk to smaller DRAM memory - Main memory - Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory - Cache memory attached to CPU MK #### **Tags and Valid Bits** - How do we know which particular block is stored in a cache location? - Store block address as well as the data - Actually, only need the high-order bits - Called the tag - What if there is no data in a location? - Valid bit: 1 = present, 0 = not present - Initially 0 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 8 #### Cache Example - 8-blocks, 1 word/block, direct mapped - Initial state MK | Index | ٧ | Tag | Data | |-------|---|-----|------| | 000 | N | | | | 001 | N | | | | 010 | N | | | | 011 | N | | | | 100 | N | | | | 101 | N | | | | 110 | N | | | | 111 | N | | | | Word | addr | Binary | addr | Hit/miss | Cache block | |------------|--------|--------|------|----------|-------------| | 2 | 2 | 101 | 10 | Miss | 110 | | 000 | N | | _ | | | | Index | V | Tag | Dat | а | | | | _ | | | | | | 001 | N | | | | | | | | | | | | | 010 | N | | | | | | 010<br>011 | N | | | | | | | | | | | | | 011 | N | | | | | | 011<br>100 | N<br>N | 10 | Me | m[10110] | | #### ## Block Size Considerations ■ Larger blocks should reduce miss rate ■ Due to spatial locality ■ But in a fixed-sized cache ■ Larger blocks ⇒ fewer of them ■ More competition ⇒ increased miss rate ■ Larger blocks ⇒ pollution ■ Larger miss penalty ■ Can override benefit of reduced miss rate ■ Early restart and critical-word-first can help #### **Cache Misses** - On cache hit, CPU proceeds normally - On cache miss - Stall the CPU pipeline - Fetch block from next level of hierarchy - Instruction cache miss - Restart instruction fetch - Data cache miss - Complete data access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18 #### Write-Through - On data-write hit, could just update the block in cache - But then cache and memory would be inconsistent - Write through: also update memory - But makes writes take longer - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1×100 = 11 - Solution: write buffer - Holds data waiting to be written to memory - CPU continues immediately - Only stalls on write if write buffer is already full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19 #### Write-Back - Alternative: On data-write hit, just update the block in cache - Keep track of whether each block is dirty - When a dirty block is replaced - Write it back to memory - Can use a write buffer to allow replacing block to be read first Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20 #### Write Allocation - What should happen on a write miss? - Alternatives for write-through - Allocate on miss: fetch the block - Write around: don't fetch the block - Since programs often write a whole block before reading it (e.g., initialization) - For write-back - Usually fetch the block Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 21 #### Example: Intrinsity FastMATH - Embedded MIPS processor - 12-stage pipeline - Instruction and data access on each cycle - Split cache: separate I-cache and D-cache - Each 16KB: 256 blocks × 16 words/block - D-cache: write-through or write-back - SPEC2000 miss rates - I-cache: 0.4% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24 #### **Advanced DRAM Organization** - Bits in a DRAM are organized as a rectangular array - DRAM accesses an entire row - Burst mode: supply successive words from a row with reduced latency - Double data rate (DDR) DRAM - Transfer on rising and falling clock edges - Quad data rate (QDR) DRAM - Separate DDR inputs and outputs MK MK Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26 # Measuring Cache Performance - Components of CPU time - Program execution cycles - Includes cache hit time - Memory stall cycles - Mainly from cache misses - With simplifying assumptions: Memory stall cycles - Memory accesses - Wiss rate Miss penalty - Instructions - Program - Misses - Instruction Misses - Instruction Misses - Chapter 5 – Large and Fast: Exploiting Memory Hierarchy – 28 ## Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 × 100 = 2 D-cache: 0.36 × 0.04 × 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 = 2.72 times faster Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 29 I-miss: 2% D-miss: 4% Penalidule: loo cidos CPI base: 2 LD/ST: 36% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30 #### **Performance Summary** - When CPU performance increased - Miss penalty becomes more significant - Decreasing base CPI - Greater proportion of time spent on memory stalls - Increasing clock rate - Memory stalls account for more CPU cycles - Can't neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31 #### **Associative Caches** - Fully associative - Allow a given block to go in any cache entry - Requires all entries to be searched at once - Comparator per entry (expensive) - n-way set associative - Each set contains n entries - Block number determines which set (Block number) modulo (#Sets in cache) - Search all entries in a given set at once - n comparators (less expensive) MK MK ### #### **How Much Associativity** - Increased associativity decreases miss rate - But with diminishing returns - Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 - 1-way: 10.3%2-way: 8.6%4-way: 8.3% - 8-way: 8.1% Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 37 #### **Replacement Policy** - Direct mapped: no choice - Set associative - Prefer non-valid entry, if there is one - Otherwise, choose among entries in the set - Least-recently used (LRU) - Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that - Random - Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 39 #### **Multilevel Caches** - Primary cache attached to CPU - Small, but fast - Level-2 cache services misses from primary cache - Larger, slower, but still faster than main memory - Main memory services L-2 cache misses - Some high-end systems include L-3 cache Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 40 #### **Multilevel Cache Example** - Given - CPU base CPI = 1, clock rate = 4GHz - Miss rate/instruction = 2% - Main memory access time = 100ns - With just primary cache - Miss penalty = 100ns/0.25ns = 400 cycles - Effective CPI = 1 + 0.02 × 400 = 9 MK #### **Example (cont.)** - Now add L-2 cache - Access time = 5ns - Global miss rate to main memory = 0.5% - Primary miss with L-2 hit - Penalty = 5ns/0.25ns = 20 cycles - Primary miss with L-2 miss - Extra penalty = 500 cycles - CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 - Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42 #### **Multilevel Cache Considerations** - Primary cache - Focus on minimal hit time - L-2 cache - Focus on low miss rate to avoid main memory access - Hit time has less overall impact - Results - L-1 cache usually smaller than a single cache - L-1 block size smaller than L-2 block size MK Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43 #### **Interactions with Advanced CPUs** - Out-of-order CPUs can execute instructions during cache miss - Pending store stays in load/store unit - Dependent instructions wait in reservation stations - Independent instructions continue - Effect of miss depends on program data flow - Much harder to analyse - Use system simulation Chapter 5 — Large and Fast Exploiting Memory Hierarchy — ### Interactions with Software Misses depend on Misses depend on memory access patterns Compiler optimization for memory access Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 45 #### **Virtual Memory** - Use main memory as a "cache" for secondary (disk) storage - Managed jointly by CPU hardware and the operating system (OS) - Programs share main memory - Each gets a private virtual address space holding its frequently used code and data - Protected from other programs - CPU and OS translate virtual addresses to physical addresses - VM "block" is called a page - VM translation "miss" is called a page fault MK #### **Page Fault Penalty** - On page fault, the page must be fetched from disk - Takes millions of clock cycles - Handled by OS code - Try to minimize page fault rate - Fully associative placement - Smart replacement algorithms Chapter 5 — Large and East Exploiting Memory Hierarchy — #### **Page Tables** - Stores placement information - Array of page table entries, indexed by virtual page number - Page table register in CPU points to page table in physical memory - If page is present in memory - PTE stores the physical page number - Plus other status bits (referenced, dirty, ...) - If page is not present - PTE can refer to location in swap space on disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49 #### Replacement and Writes - To reduce page fault rate, prefer leastrecently used (LRU) replacement - Reference bit (aka use bit) in PTE set to 1 on access to page - Periodically cleared to 0 by OS - A page with reference bit = 0 has not been used recently - Disk writes take millions of cycles - Block at once, not individual locations - Write through is impractical - Use write-back - Dirty bit in PTE set when page is written Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52 #### Fast Translation Using a TLB - Address translation would appear to require extra memory references - One to access the PTE - Then the actual memory access - But access to page tables has good locality - So use a fast cache of PTEs within the CPU - Called a Translation Look-aside Buffer (TLB) - Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate - Misses could be handled by hardware or software #### **TLB Misses** - If page is in memory - Load the PTE from memory and retry - Could be handled in hardware - Can get complex for more complicated page table structures - Or in software - Raise a special exception, with optimized handler - If page is not in memory (page fault) - OS handles fetching the page and updating the page table - Then restart the faulting instruction Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 55 #### **TLB Miss Handler** - TLB miss indicates - Page present, but PTE not in TLB - Page not preset - Must recognize TLB miss before destination register overwritten - Raise exception - Handler copies PTE from memory to TLB - Then restarts instruction - If page not present, page fault will occur Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 56 #### Page Fault Handler - Use faulting virtual address to find PTE - Locate page on disk - Choose page to replace - If dirty, write to disk first - Read page into memory and update page table - Make process runnable again - Restart from faulting instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 57 #### The Memory Hierarchy #### The BIG Picture - Common principles apply at all levels of the memory hierarchy - Based on notions of caching - At each level in the hierarchy - Block placement - Finding a block - Replacement on a miss - Write policy #### **Block Placement** - Determined by associativity - Direct mapped (1-way associative) - One choice for placement - n-way set associative - n choices within a set - Fully associative - Any location - Higher associativity reduces miss rate - Increases complexity, cost, and access time Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 61 #### Finding a Block | Associativity | Location method | Tag comparisons | |--------------------------|-----------------------------------------------|-----------------| | Direct mapped | Index | 1 | | n-way set<br>associative | Set index, then search entries within the set | n | | Fully associative | Search all entries | #entries | | | Full lookup table | 0 | - Hardware caches - Reduce comparisons to reduce cost - Virtual memory - Full table lookup makes full associativity feasible - Benefit in reduced miss rate Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62 #### Replacement - Choice of entry to replace on a miss - Least recently used (LRU) - Complex and costly hardware for high associativity - Random - Close to LRU, easier to implement - Virtual memory - LRU approximation with hardware support Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63 #### **Write Policy** - Write-through - Update both upper and lower levels - Simplifies replacement, but may require write buffer - Write-back - Update upper level only - Update lower level when block is replaced - Need to keep more state - Virtual memory - Only write-back is feasible, given disk write latency Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64 #### **Sources of Misses** - Compulsory misses (aka cold start misses) - First access to a block - Capacity misses - Due to finite cache size - A replaced block is later accessed again - Conflict misses (aka collision misses) - In a non-fully associative cache - Due to competition for entries in a set - Would not occur in a fully associative cache of the same total size Chapter 5 — Large and Fast Exploiting Memory Hierarchy — 65 #### Cache Design Trade-offs | Design change | Effect on miss rate | Negative performance effect | |------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------| | Increase cache size | Decrease capacity misses | May increase access time | | Increase associativity | Decrease conflict<br>misses | May increase access time | | Increase block size | Decrease compulsory misses | Increases miss<br>penalty. For very large<br>block size, may<br>increase miss rate<br>due to pollution. | MK #### **Concluding Remarks** - Fast memories are small, large memories are - We really want fast, large memories ⊗■ Caching gives this illusion ⊚ - Principle of locality - Programs use a small part of their memory space frequently - Memory hierarchy - L1 cache ↔ L2 cache ↔ ... ↔ DRAM memory ↔ disk - Memory system design is critical for multiprocessors