# The Pentium<sup>®</sup> II/III Processor "Compiler on a Chip"

Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa

**Intel Corporation** 

Tel Aviv University intی January 20, 2004

## Agenda

- General Information
- µarchitecure basics
- Pentium<sup>®</sup> Pro Processor μarchitecure
- SW aspects

intപ്രം

# **Technology Profile**

#### Pentium Pro - 1995

#### Pentium-II - 1998

- Core @200MHz
- 256K L2 on package, @200MHz
- Performance:
  - **8.09** SPECint95 **6.70** SPECfp95
- 0.35 μm BiCMOS
- 5.5M transistors
- 195 sq mm (14x14)
- 3.3V, 11.2A
- 28.1W / 35.0W



#### intal



- Core @333MHz
- 512KB L2 in SEC @167MHz
- Performance:
  - **12.8 SPECint95**
  - **9.14** SPECfp95 (P55C: 7.12/5.21)

pentium I and

• 0.25 μm CMOS process

intal

• 7.5M transistors

#### Pentium-III - 1999

- Core @600MHz
- 512KB L2 @ ???MHz
- Performance:
  - 24.0 SPECint95
  - **15.9 SPECfp95**
- 0.25 μm CMOS process
- ???M transistors



### 3

# **Technology Profile (cont.)**

- Coppermine (Pentium-III 2000)
- Core @1000MHz
- 256KB L2 on chip @ 1000MHz
- Performance:
  - >46 SPECint95
  - >20 SPECfp95
- 0.18 μm CMOS process
- ~20M transistors

- Tualatin (Pentium-III 2002)
- Core @1400MHz
- 512KB L2 on chip @ 1400MHz
- Performance (estimated):
  - >60 SPECint95
  - >30 SPECfp95
- 0.13 μm CMOS process
- ~44M transistors
- Pentium M Processor Banias 2003
- Core @1800MHz
- 1024KB L2 on chip @ 1800MHz
- Performance (estimated):
  - >80 SPECint95
  - >50 SPECfp95
- 0.10 mm CMOS process

intel

~77M transistors

# Terminology

- Intel Architecture
- Pipeline, Super Scalar
- Branch Prediction
- Speculative Execution
- Dynamic Scheduling
- Data dependency
- Register Renaming
- Out Of Order
- Re-order Buffer & Memory Order Buffer
- Reservation Stations
- Micro-Operations

intel

Skip to µarch











































ROB

MIS: Produces uops for complex instructions.

RAT: Register Alias Table

intel

D)

MIS

RAT

# **Branch Prediction**

#### Implementation

- Use local history to predict direction
- Need to predict multiple branches
- ⇒ Need to predict branches before previous branches are resolved
- ⇒ Branch history updated first based on prediction, later based on actual execution (speculative history).
- Target address taken from BTB
- Prediction rate: ~92%
  - ~60 instructions between mispredictions (assuming 1 branch per 5 inst. on average)
  - High prediction rate is very crucial for long pipelines
  - Especially important for OOOE, speculative execution:
    - On misprediction all instructions following the branch in the instruction window are flushed

27

- Effective size of the window is determined by prediction accuracy.
- RSB used for Call/Return pairs
- Totally re-done on Banias!

int<sub>el</sub>.

















## Memory Order Buffer (MOB)

- Goal allow out-of-order among memory operations
- Problem- Memory dependencies cannot be fully resolved statically (memory disambiguation)
  - store r1,a; load r2,b ⇒ can advance load before store - store r1,[r3]; load r2,b ⇒ load should wait till r3 is known
- Structure similar in concept to ROB
- Every memory uop is allocated an entry in order.
- Address & data (for stores), are updated when known
- Loads may pass loads/stores
- Stores are in order











## Flow of Uops through OOO Cluster

- ISSUE:
  - ALLOC unit allocates one entry per uop in the RS and in the ROB (for up to 3 uops per cycle)
    - If source data is available from the ROB (either from the RRF of from the Result Buffer (RB) it is written in the RS entry
    - Otherwise, it is marked invalid in the RS (and should be captured from the WB bus)
- READY/SCHEDULE:
  - Data-ready uops are checked to see if desired functional unit available
  - Up to 5 resource-ready uops are selected, and dispatched per clock
- DISPATCH:
  - Ship scheduled uops to appropriate functional unit (RS)
- WRITEBACK:
  - Capture results returned by the functional units in a result buffer (ROB)
  - Snoop result writeback ports for results that are sources to uops in RS
  - Update data-ready status of these uops (RS)



#### RETIREMENT:

- 3 consecutive entries read out of the ROB
  - these entries are candidates for retirement
- Algorithm to determine fitness for retirement: candidate is retired
  - its ready bit is set
  - it will not cause an exception
  - all preceding candidates are eligible for retirement
- Commit results from result buffer to architecturally visible state in original "Issue" order
- Clear machine and restart execution if "badness" occurs (ROB)

intel



intel





## Code Example (rename & Sched)

### Lets follow this code:

| <u>PC</u> | Instruct | tions | After Renaming   | Execution |
|-----------|----------|-------|------------------|-----------|
| n         | mov      | r4,r1 | r4, r1_1         | DEW       |
| n+1       | add      | r1,r2 | r1_1, r2, r2_1 🔪 | DEW       |
| n+2       | mov      | M2,r1 | M2, r1_2         | DE W      |
| n+3       | add      | r1,r3 | r1_2, r3, r3_1   | DEW       |
| n+4       | jmp      | L2    |                  | DEW       |
| n+5       | add      | r3,r4 | r3_1, r4, r4_1   | DEW       |
| n+6       | mov      | M3,r1 | M3, r1_3         | DEW       |
| n+7       | add      | r1,r4 | r1_3, r4_1, r4_2 | DEW       |
| n+8       | dec      | r5    | r5, r5_1         | DE W      |
|           |          |       | cvcle            | :0123456  |

Every cycle, 4 instructions are decoded

intപ്പം























