MO601 / MC973 - Computer Architecture II

22/12	Final grids.
18/12	Exam 2 grids.
12/12	Project 4 deadline postponed to Dec. 15th.
11/12	Project 2 grids.
05/12	Project 1 grids.
20/10	Class for 21 / oct canceled. Project 2 postponed to 23 / oct (end of day).
16/10	Exam 1 grids.
10/08	Lectures will start on Monday, 22nd.
July 22th	Important dates in the school calendar: Graduate and Graduate programs.

This course will cover tools and methodologies for Computer Architecture research including modern simulators, benchmarks for single / multi-colors and clusters. We will study recent papers on the area and how they model pipelines, caches, execution engines, power evaluation, etc.

The recommended bibliography contains:

Microarchitecture Processor: An Implementation Perspective. Antonio González, Fernando Latorre and Grigorios Magklis. Synthesis Lectures on Computer Architecture. Morgan && Claypool Publishers.
Modern Processor Design: Fundamentals of Superscalar Processors. John Paul Shen, Mikko H. Lipasti. Waveland Press. 2013.
Papers from Top Computer Architecture Conferences

Videos about virtual memory (Contributed by Alberto Oliveira): 1, 2, 3, 4.

2 Written exams: 60% of final grade (30% each).

Practical Projects: 40% of final grade.

Grid ranges: A for grade> 8.4, B for grade> 6.9, C for grade> 4.9, D for grade <5.

Any unethical behavior related to the evaluation process will result in failing the course with the lowest possible grade. Every assignment is an individual assignment unless otherwise mentioned. Students are not expected to see each other solutions to the assignments.

Exercises provided by Marcus Angeloni

What are benchmarks?
What is the difference between computer architecture and microarchitecture?
What are the most common dimensions for classifying processor microarchitectures? Explain three of them.
What is the difference between multicore and multithreaded processors?
What are the seven stages of a microprocessor pipeline?
How big would cache memory be on an ideal system? Why?
What is the difference between first-level cache and other levels?
What are the motivations for using virtual addressing?
What is virtual aliasing?
What is a page?
What is TLB and what is it for?
What is the difference between parallel tag and serial tag in the context of data array access?
What are the types of misses in the context of loockup-free caches?
What is the difference between MSHRs explicitly addressed and implicitly addressed?
What is the difference between multiport and multibank caches?
What is the responsibility of the instruction fetch unit?
How does branch prediction work and what is it for?
What is the difference between BTB and RAS?
What is the difference between conventional cache and trace cache?
What is the difference between static and dynamic prediction?
How do you choose which branch predictor to use?
What is the purpose of the instruction decoder? What does he identify?
What is the difference between RISC and CISC?
How big in bytes can an instruction take on an x86 architecture?
What are micro-operations and what are they for?
What is the role of the allocation phase in the microprocessor pipeline?
Why is register renaming necessary?
What is the main limitation on the number of registers on the processor?
What is the reorder buffer for?
What is the difference between rename buffer and reorder buffer?
How does the merged register file strategy work?
What are the strategies for when register values are read? What are the advantages of each of them?
What is the idea of the SimPoints approach? What are the benefits?
What are phases of a program?
What are SimPoints?

other exercises

What are the main limitations on the number of instructions executed in parallel?
Design a piece of C code containing a function call, a loop, and an if statement. Show where should be the branches in the assembly code. For each branch, explain how easy / difficult they can be predicted.
Show a piece of assembly code containing at least one WAR, RAR, WAW, RAW dependencies. What will happen to this code after register renaming?
Show an example (piece of code) of a Trace Cache performing better than a conventional cache. How good is this cache in this example?
How can you decode multiple x86 instructions from a block of bytes retrieved from the instruction cache?
Considering a sequence of 10 instructions to be fetched from memory in a scalar processor. Also consider that this processor has a branch predictor. How many times will the branch predictor hardware be accessed? Why?

Exercises provided by Marcus Angeloni

What is the purpose of the dispatch stage?
What is the difference between in-order and out-of-order dispatch?
How does scoreboarding work?
How does the dispatch queue work when operands are read before dispatch?
What are and what do the different dispatch queue events do?
What is the main difference between reading after dispatch and before dispatch?
How does the emission queue work when the operands are read after dispatch?
How do the different types of memory disambiguation work?
What are the concepts behind a distributed dispatch queue?
What are the matrices of indeterminacy and dependency for?
Why is speculating on memory access considered so critical?
What are the differences between conservative and speculative wakeup?
What is the purpose of the execution stage?
What are the most common execution units?
What is the bypassing network?
Why are the units of logic and arithmetic in general separated from that of multiplication and division?
How do multiplication and division operations on processors that do not implement these units?
What is the difference between segmented and flat memory models?
What is an effective address?
What is the purpose of the branch unit?
How do SIMD units work? What are its advantages?
Generally how is a SIMD unit composed? What are lanes?
What are bubbles? Can bypassing minimize them?
What are the advantages and disadvantages of bypass?
Why is result bypassing in orderly processors often more complex than out-of-order processors?
What are SRF and what are they for?
How does clustering work?
Name and explain two types of clustering.
Why is a commit stage necessary?
What are the architectural and speculative stages and how are they related?
What is the difference between architectural states based on the Remove Register File and Merged Register File? For what type of processors is it best to use each one?
How is a branch misprediction recovered?
What are the forms of treatment of branch misprediction?
How is an exception recovered?

Exercises provided by Pedro Henrique Amorim

What is the role of the Issue stage?
What are the main issue schemes?
In general how is the in order and out of order schema executed?
What is the role of scoreboarding?
Please briefly comment on how the "in-order" technique works on VLIW processors?
Describe the scenario in which the unified issue queue is assumed.
Describe the scenario in which reservation stations are assumed.
In a read before issue scheme there are several components, comment on the purpose of the following components:

What is the role of the "ctrl info" block in the issue out of order scheme?
What does CAM (Content Address Memory) do?
Arrays called Src1Id and Src2Id?
Blocks V1 and V2?
Blocks R1 and R2?

Describe the following events:

Issue queue allocation.
Wakeup instruction.
Instruction selection.
Issue queue complaint.

Name two reasons that can cause an instruction to stand still in the issue stage.
In the pipeline of a given processor, the Issue stage executes 8 instructions at a time but in the fetch stage it is possible to execute 16 instructions simultaneously, how many instructions will be obtained simultaneously after the commit stage is finished?

Exercises provided by Rafael Junio

How do you calculate the size of a flat page table? Calculate the size of a 64-bit processor flat page table using 4KB pages.
How can a multilevel page table save space compared to a flat page table?
Given a 4-level page table: 64-bit virtual address, 64 kb page size, 8B page entry. Only addresses between 0 and 4GB are used, calculate the size of the flat page table and the 4-level page table due equally.
Calculate the latency of a miss in a virtual to physical address translation considering a 3-level page table. Consider that:

1 cycle to compute the virtual address
1 cycle to access the cache
20 cycles to access memory
90% hit for data

page table is not cached.
page table is cached and has 90% data hit.

How big is a TLB with a 4KB page such that it has the same hit rate as a cache with 32KB in size and 64B in block size?

I will provide office hours after each class. If you need more or alternative time, feel free to schedule by email.

Every assignment is an individual assignment unless otherwise mentioned. Students are not expected to see each other solutions to the assignments.

Project 1

Infrastructure: PIN and SPEC 2006

Tasks:

Install SPEC 2006, run it, understand the runspec script.
Install PIN, run a few examples. Understand how it works.
Use the available pintools to count the number of instructions of each SPEC program.
Create a new pintool and use it in at least 5 SPEC programs.

Report:

Create a folder called project1 in your repository
Due to Sep 19, 10AM
report document. SBC Template🇧🇷 You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document🇧🇷 Create a file called presentation.pdf containing slides for a 5 minute presentation of your project.
csv file. include a file results.csv containing the results of the instruction count. This file should contain two columns where the first is the program name and the second is the instruction count. For programs that run more than one time, include the execution number (1, 2, 3) after their names.
Source code🇧🇷 Include your source code in the folder src. This folder should contain all code that you created / modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for prerequisites.

Project 2

Infrastructure: PIN, and 10 benchmarks

task:

Evaluate Virtual to Physical memory translation for 4KB and 4MB pages.
Consider up to 512 TLB entries for instruction and data.
Consider 3 or 4 levels page table.
Look for benchmarks with large memory footprint.
Create one toy benchmark to check your environment.

Report:

Create a folder called project2 in your repository
Due to 23 / Oct (end of the day)
report document. SBC Template🇧🇷 You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document🇧🇷 Create a file called presentation.pdf containing slides for a 5 minute presentation of your project.
csv file. include a file results.csv containing your results. For each benchmark, include the following columns: benchmark name (nd input if necessary), total memory access for instructions, total TLB misses for instructions, total page table access for instructions, total memory access for data, total TLB misses for data, total page table access for data.
Source code🇧🇷 Include your source code in the folder src. This folder should contain all code that you created / modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for prerequisites.

Project 3

Goal: Reproduce one item (graph, table, etc.) of a pre-selected paper from the last three editions of the following conferences: ISCA, ASPLOS, MICRO, HPCA.

Tasks:

Create a folder called project3 in your repository
Every Friday, up to 21 / Oct, include the reference to one paper that you preliminary inspected in a file called papers.txt
When you feel that you have selected the desired paper, talk to me to reserve it and avoid conflicts. Insert the PDF paper in your repository as paper.pdf and create a short overview of it together with the specification of your task and how you plan to execute it in the following weeks. This presentation should be in a file called presentation1.pdf and you are expected to talk about it for 15 minutes in the classes of 31 / Oct and 04 / Nov.

Report:

Create a folder called project3 in your repository
Due to Nov / 17 (end of the day)
report document. SBC Template🇧🇷 You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document🇧🇷 Create a file called presentation2.pdf containing slides for up to 15 minutes presentation of your project.
Source code🇧🇷 Include your source code in the folder src. This folder should contain all code that you created / modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for prerequisites.

selected papers

João Paulo Labegalini de Carvalho: Varun Agrawal, Abhiroop Dabral, Tapti Palit, Yongming Shen, and Michael Ferdman. 2015. Architectural Support for Dynamic Linking. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 691-702. DOI: http://dx.doi.org/10.1145/2694344.2694392
Gustavo Ciotto Pinton: R. Parihar and MC Huang, "Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 662-677.
Ciro Ceissler: Shao, Yakun Sophia, et al. "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures." 2014 ACM / IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, 2014.
Rafael Junio: FLUSH + RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. Usenix Security Simposium 2014.
Lucas Prado Melo: Hilton, AD, BC Lee, and Z. Huang. "Decoupling loads for nano-instruction set computers." Proceedings of The 43rd International Symposium on Computer Architecture. 2016.
Paulo Henrique Junqueira Amorim: Tri M. Nguyen and David Wentzlaff. 2015. MORC: the manycore-oriented compressed cache. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48).
Uglaybe Fernandes: Albericio, Jorge, et al. "Wormhole: Wisely predicting multidimensional branches." Proceedings of the 47th Annual IEEE / ACM International Symposium on Microarchitecture. IEEE Computer Society, 2014.
Alceu Emanuel Bissoto: Akanksha Jain, Calvin Lin. "Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement". Proceedings of The 43rd International Symposium on Computer Architecture (ISCA). 2016.
Marcus de Assis Angeloni: S. Bucur, J. Kinder, and G. Candea - "Prototyping symbolic execution engines for interpreted languages". In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 239-254.
Rafael Soares: S. Girbal, G. Mouchard, A. Cohen, and O. Temam. DiST: A simple, reliable and scalable method to significantly reduce processor architecture simulation time. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 1–12, June 2003.
Alfredo Salvarani: D. Gope and MH Lipasti, "Bias-Free Branch Predictor," 2014 47th Annual IEEE / ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 521-532.
George Araújo: S. Khan, AR Alameldeen, C. Wilkerson, O. Mutluy and DA Jimenezz, "Improving cache performance using read-write partitioning," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 452-463.
Paulo Henrique: Eric Rotenberg, Steve Bennett, and James E. Smith. 1996. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th annual ACM / IEEE international symposium on Microarchitecture (MICRO 29). IEEE Computer Society, Washington, DC, USA, 24-35.
Davi Castro: G. Aşılıoğlu, Z. Jin, M. Köksal, O. Javeri and S. Önder, "LaZy Superscalar," 2015 ACM / IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2015, pp . 260-271.
Pedro Tadahiro: The Inner Most Loop Iteration counter: a new dimension in branch history - Andre Seznec (INRIA / IRISA), Joshua San Miguel (University of Toronto), Jorge Albericio (University of Toronto)

Project 4

Goal: Expand an activity of the project 3 paper.

Tasks:

Create a folder called project4 in your repository
You can explore more configurations, you can make any variation to the algorithm, you can choose other opportunities. You do not need to get better results. You can even work to better explain the official result.

Report:

Create a folder called project4 in your repository
Due to 12 / Dec (end of the day)
report document. SBC Template🇧🇷 You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document🇧🇷 Create a file called presentation.pdf containing slides for up to 15 minutes presentation of your project.
Source code🇧🇷 Include your source code in the folder src. This folder should contain all code that you created / modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for prerequisites.

Date:	Topic
22 / ago	Introduction
26 / ago	Reading a paper - Project 3 description
29 / ago	Introduction to Microarchitecture
02 / set	Overview of Execution Environments
05 / set	Caches
09 / set	Worktime for Project 1
12 / set	Worktime for Project 1
16 / set	Fetch Unit
19 / set	Project 1
23 / set	DecodeUnit
26 / set	Allocation
30 / set	SimPoints
03 / out	Worktime for Project 2
07 / out	Worktime for Project 2
10 / out	Review
14 / out	exam 1
17 / out	exam 1 resolution
21 / out	Class canceled
24 / out	Project 2 - Presentation
28 / out	Holliday
31 / out	Project 3 - Preliminary presentation
04 / Nov	Project 3 - Preliminary presentation
07 / Nov	Issue
11 / Nov	ReviewProject 3
14 / Nov	Holliday
18 / Nov	Project 3 - Presentation
21 / Nov	Project 3 - Presentation
25 / Nov	Run
28 / Nov	Commit
02 / ten	cache coherence
05 / ten	Review
09 / ten	Holliday
12 / ten	exam 2
16 / ten	Project 4

MO601 / MC973 - Computer Architecture II

Update

Description

Bibliography

Evaluation

Exercises

Office hours

Course Projects

Project 1

Project 2

Project 3

Project 4

Schedule