



![](_page_1_Figure_0.jpeg)

| Itanium Caches |                                                                                                                                                                                             |  |  |  |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| • L1           | Data Cache (128 bits in one read)                                                                                                                                                           |  |  |  |
| _              | 16Kbytes, 4-way set associative, write through, no write allocate with 32-<br>byte lines. "No Write Allocate" means that on a write miss, the missed<br>block is NOT loaded into the cache. |  |  |  |
| -              | Can sustain 2 loads, 2 stores, or 1 load and 1 store per clock (dual ported cache!!! Needed for superscalar operation!).                                                                    |  |  |  |
| -              | Integer mem op hits in L1 have a 2 cycle latency to most operations.<br>Floating point mem ops bypass this cache.                                                                           |  |  |  |
| • L1           | Instruction Cache                                                                                                                                                                           |  |  |  |
| _              | 16 Kbytes, 4-way set associative with 32-byte lines                                                                                                                                         |  |  |  |
| _              | 1 Read = 128 bits = 3 instruction 'bundle' + 2 parity bits (1 inst = 41 bits)                                                                                                               |  |  |  |
| • L2           | Unified Cache                                                                                                                                                                               |  |  |  |
| _              | 96Kbyte, 6 way set associative, write back, write allocate (on write miss, load missed block into cache), 64 byte line.                                                                     |  |  |  |
| _              | Two ports, can sustain two memory ops/clock or one line fill.                                                                                                                               |  |  |  |
| -              | Integer mem op hits have 6 cycle latency, Floating point mem op hits have 9 cycle latency.                                                                                                  |  |  |  |
| 3/27/01        | BR 4                                                                                                                                                                                        |  |  |  |

![](_page_2_Figure_0.jpeg)

| A-64 Virtual Memory                                                                                                                                                                                                                                                                                                                                                         |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Virtual Region Number) use to select 1 of 8 Region<br>that contains a Region ID (RID).                                                                                                                                                                                                                                                                                      |
| f Region Registers managed by OS. Typically, one<br>is constant and is assigned to the OS, while the<br>ned by the current task.                                                                                                                                                                                                                                            |
| ister is 24 bits.                                                                                                                                                                                                                                                                                                                                                           |
| address spaces (16 Million), each space 2 <sup>61</sup> bytes abytes).                                                                                                                                                                                                                                                                                                      |
| al Page Number (VPN) offset                                                                                                                                                                                                                                                                                                                                                 |
| 61- offset bits                                                                                                                                                                                                                                                                                                                                                             |
| Virtual address is RID + VPN<br>RID<br>24<br>BR $6$                                                                                                                                                                                                                                                                                                                         |
| that contains a Region ID (RID).<br>f Region Registers managed by OS. Typically, one<br>is constant and is assigned to the OS, while the<br>ned by the current task.<br>ister is 24 bits.<br>address spaces (16 Million), each space $2^{61}$ bytes<br>abytes).<br>0<br>al Page Number (VPN) offset<br>61- offset bits<br>Virtual address is RID + VPN<br>RID<br>24<br>BR 6 |

![](_page_3_Figure_0.jpeg)

![](_page_3_Figure_1.jpeg)

![](_page_4_Figure_0.jpeg)

![](_page_4_Figure_1.jpeg)

![](_page_5_Figure_0.jpeg)

![](_page_5_Figure_1.jpeg)

![](_page_6_Figure_0.jpeg)

![](_page_6_Picture_1.jpeg)

![](_page_7_Figure_0.jpeg)

![](_page_7_Picture_1.jpeg)

## A Numerical Example

Assume a 64K byte page size (2<sup>16</sup>) with a VMA size of 48 bits without the region bits.

What is the address of the PTE for the address given that each PTE is 8 bytes using a linear mapping?

Let VMA = 0x 07FF1A002000

Offset = 0x2000 (lower 16 bits).

VPN = 0x07FFF1A00

VMA of PTE = VPN \* 8 =  $0x0000 07FFF1A00 * 2^3 =$ = 0x0000FFF8D000

BR

17

3/27/01

![](_page_8_Figure_8.jpeg)

![](_page_9_Figure_0.jpeg)

![](_page_9_Figure_1.jpeg)

|                                                                            | Technical by Each Instruct                                                                                 |                                                                  | Instruction Tur                                               |                                                                           |
|----------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------|---------------------------------------------------------------------------|
| Mnemonic                                                                   | Locality Hint                                                                                              | Load                                                             | Store                                                         | lfetch,<br>lfetch.fault                                                   |
| none                                                                       | Temporal, level 1                                                                                          | x                                                                | x                                                             | x                                                                         |
| nt1                                                                        | Non-temporal, level 1                                                                                      | x                                                                |                                                               | x                                                                         |
| nt2                                                                        | Non-temporal, level 2                                                                                      |                                                                  |                                                               | x                                                                         |
| nta                                                                        | Non-temporal, all levels                                                                                   | x                                                                | x                                                             | x                                                                         |
|                                                                            |                                                                                                            |                                                                  |                                                               |                                                                           |
| streamin<br>ween the                                                       | g and non-streamin<br>caches according                                                                     | ng data.<br>to the hint                                          | Data is<br>ts (see r                                          | allocated<br>ext page).                                                   |
| streamin<br>ween the<br>e IA-64 in<br>a post-in<br>the block<br>ded into a | g and non-streamin<br>caches according a<br>mplements implicit<br>acrement addressin<br>of the post-increm | ng data.<br>to the hint<br>t prefetchi<br>g mode (l<br>ented add | Data is<br>ts (see r<br>ing for 1<br>ld r3, [r4<br>lress is a | allocated<br>next page).<br>loads/stores t<br>4++]). The<br>automatically |

![](_page_10_Figure_1.jpeg)

![](_page_11_Figure_0.jpeg)

| • An                   | advanced load instruction $(ld.a)$ is used for a                                                                                                                                                | lata                  |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| spec                   | Culative loading<br>The register number used in the load and load address<br>in Advanced Load Address Table (ALAT)                                                                              | s is stored in        |
| • Befe<br>spec<br>be u | bore the result of the load can be used by a not<br>culative instruction, the check instruction (ch<br>used to see if load value is valid<br>Check must use same register number, simply checks | n-<br>nk.a) must      |
|                        | ALAT entry is still present. If NOT PRESENT, then<br>compiler-generated recovery code that re-executes the<br>lependent instructions                                                            | jump to<br>e load and |
| • An                   | ALAT table entry is removed if                                                                                                                                                                  |                       |
| - t                    | here is a STORE that overlaps an ALAT entry                                                                                                                                                     |                       |
| - A<br>S               | An ALAT entry has been replaced by another entry d ame register or cache replacement choice                                                                                                     | ue to the             |
| _ ]                    | The OS or other hardware conditions invalidate the end                                                                                                                                          | ntry                  |

![](_page_12_Figure_0.jpeg)

![](_page_12_Figure_1.jpeg)

![](_page_13_Figure_0.jpeg)

![](_page_13_Figure_1.jpeg)

## **Branch Prediction**

- IA-64 has many ways to reduce branch mis-prediction
- Branch prediction hints can be given by compiler
  - within the branch itself
  - via a separate instruction (branch hint instruction)
- For branches, hints added by compiler are:
  - Prediction method: Static Not-Taken, Static Taken, Dynamic-Not-Taken (use dynamic predictor, 1<sup>st</sup> guess is not-taken), Dynamic-Taken
  - Sequential Prefetch hint of FEW or MANY. If FEW, then only prefetch a few instruction cache lines. If MANY, then prefetch more.
  - Predictor de-allocation hint of NONE or CLEAR. If NONE, then remember branch history and branch prediction of this branch. If CLEAR, then free all branch prediction resources for this branch after execution (guess that you will not execute this branch again).

29

BR

3/27/01

<section-header><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item>

## Special Branches

- In addition to the normal conditional branch instructions, there are special branch instructions that are useful for common structures and which can eliminate misprediction
- *Counted loop* Branch (cloop) uses a special register for counting executing of a loop. No mispredictions for this branch.
- While.Top, While.Exit loop branches for while loops with tests at top of loop and end of loop
  - Does not affect branch prediction
  - Works with hardware support for software pipelining of loops may discuss this later in the course.

3/27/01

BR

31