Bruce Jacob

University of Maryland

# How Not to Configure Your DRAM System

**Bruce Jacob** 

**Electrical & Computer Engineering University of Maryland, College Park** 

http://www.ece.umd.edu/~blj/

#### **OUTLINE:**

- DRAM Primer
- Yesterday's Results
- Today's Experiments & Results
- Conclusions



**Bruce Jacob** 

University of Maryland

### Sources

"A Performance Study of Contemporary DRAM Architectures," *Proc. ISCA '99.*V. Cuppu, B. Jacob, B. Davis, and T. Mudge

"DDR2 and Low Latency Variants," *Memory Wall Workshop*, in conjunction w/ ISCA '00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob.

Recent experiments by Vinodh Cuppu, Ph.D. student at University of Maryland

**Bruce Jacob** 

University of Maryland

### Goal

#### PRELIMINARY DRAM STUDY:

- Bus Transmission
- Row Access
- Column Access
- Data Transfer
- Bus Wait/Synch Time
- Stalls Due to Refresh
- The OVERLAP of These Components (with each other) (with CPU execution)

MODEL EXISTING TECHNOLOGY

**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **BUS TRANSMISSION**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **ROW ACCESS**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **COLUMN ACCESS**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **DATA TRANSFER**



note: page mode enables overlap with COL

**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **BUS TRANSMISSION**



note: overlapped component not shown

**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **Read Timing for Conventional DRAM**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **Read Timing for Fast Page Mode DRAM**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### Read Timing for Extended Data Out DRAM



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

### **Read Timing for Synchronous DRAM**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **Read Timing for Rambus DRAM**



**Bruce Jacob** 

University of Maryland

# **DRAM Primer**

#### **Read Timing for Direct Rambus DRAM**



**Bruce Jacob** 

University of Maryland

# **Simulator Overview**

CPU: SimpleScalar v3.0a

- 8-way out-of-order
- L1 cache: split 64K/64K, lockup free x32
- L2 cache: unified 1MB, lockup free x1
- L2 blocksize: 128 bytes

Main Memory: 8 64Mb DRAMs

- 100MHz/128-bit memory bus
- Optimistic open-page policy (close-immediately can be calculated)

Represents a "typical" workstation

**Bruce Jacob** 

University of Maryland





**Bruce Jacob** 

University of Maryland

### **Conclusions**

100MHz/128-bit Bus is Current Bottleneck

 Solution: Fast Bus/es & MC on CPU (e.g. Alpha 21364, Sony Emotion, ...)

**Current DRAMs Solving Bandwidth Problem (but not Latency Problem)** 

- Solution: New cores with on-chip SRAM (e.g. ESDRAM, VCDRAM, ...)
- Solution: New cores with smaller banks (e.g. MoSys "SRAM", FCRAM, ...)

**Bruce Jacob** 

University of Maryland

### **Recent Work**

Detailed Study of DDR2 Proposals in Concurrent Environment, Including Comparison with DRDRAM

Highly Concurrent System Organizations (Multiple Channels, Queueing Mechanisms, Priority Schemes, Optimal Burst Sizes)

**Bruce Jacob** 

University of Maryland

# **DDR2 Study Results**



**Bruce Jacob** 

University of Maryland





**Bruce Jacob** 

University of Maryland

# **DDR2 Study Results**





Trace

**Bruce Jacob** 

University of Maryland

# **Concurrency Study: Timing**





**Bruce Jacob** 

University of Maryland

# Read/Write Request Shapes

#### **READ REQUESTS:**



#### **WRITE REQUESTS:**



**Bruce Jacob** 

University of Maryland

# **Pipelined/Split Transactions**

(a) Legal if R/R to different banks:



**(b)** Legal if turnaround ≤ 8.75ns and R/W to different banks: (note: write can start up to 7.5ns later if turnaround = 1.25ns)



**(C)** Back-to-back R/W pair that cannot be nestled:



**Bruce Jacob** 

University of Maryland

# **Channels & Banks**



One independent channel Banking degrees of 1, 2, 4, ...

Two independent channels Banking degrees of 1, 2, 4, ...



Four independent channels Banking degrees of 1, 2, 4, ...

1, 2, 4 800 MHz Channels

8, 16, 32, 64 Data Bits per Channel

1, 2, 4, 8 Banks per Channel (Indep.)

32, 64, 128 Bytes per Burst

**Bruce Jacob** 

University of Maryland

# **Burst Scheduling**

(Back-to-Back Read Requests)



- Critical-burst-first
- Non-critical bursts are promoted
- Writes have lowest priority (tend back up in request queue ...)
- Tension between large & small bursts: amortization vs. faster time to data

**Bruce Jacob** 

University of Maryland

# **The Bottom Line**



**Bruce Jacob** 

University of Maryland

# It's Not Queue Size ...



Black = infinite request queue, Red = 32-entry request queue

**Bruce Jacob** 

University of Maryland

# ... It's Also Not Turnaround ...



Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus

**Bruce Jacob** 

University of Maryland

# ... It's Related to Concurrency



Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus

**Bruce Jacob** 

University of Maryland

# **New Bar-Chart Definition**

- t<sub>PROC</sub> CPU with 1-cycle L2 miss
- t<sub>REAL</sub> realistic CPU/DRAM config
- t<sub>SYS</sub> CPU with 1-cycle DRAM latency
- t<sub>DRAM</sub> time seen by DRAM system



**Bruce Jacob** 

University of Maryland





**Bruce Jacob** 

University of Maryland

# Bandwidth vs. Burst Width



System Bandwidth (GB/s = Channels \* Width \* 800MHz)

**Bruce Jacob** 

University of Maryland

# Bandwidth vs. Burst Width



System Bandwidth (GB/s = Channels \* Width \* 800MHz)

**Bruce Jacob** 

University of Maryland

# Bandwidth vs. Burst Width



System Bandwidth (GB/s = Channels \* Width \* 800MHz)

**Bruce Jacob** 

University of Maryland

# Bandwidth vs. Burst Width





System Bandwidth (GB/s = Channels \* Width \* 800MHz)

**Bruce Jacob** 

University of Maryland

## Bandwidth vs. Burst Width





System Bandwidth (GB/s = Channels \* Width \* 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel

**Bruce Jacob** 

University of Maryland

## Bandwidth vs. Burst Width





System Bandwidth (GB/s = Channels \* Width \* 800MHz)

Benchmark = GCC (SPEC 2000), 2 banks/channel

**Bruce Jacob** 





**Bruce Jacob** 





**Bruce Jacob** 





**Bruce Jacob** 





**Bruce Jacob** 

University of Maryland





Benchmark = MCF (SPEC 2000)

**Bruce Jacob** 





**Bruce Jacob** 

University of Maryland





Benchmark = BZIP (SPEC 2000)

**Bruce Jacob** 

University of Maryland

### **Conclusions**

**CAREFUL TUNING YIELDS 30–40% GAIN** 

MORE CONCURRENCY == BETTER

- Via Channels → NOT w/ LARGE BURSTS
- Via Banks → ALWAYS SAFE
- Via Bursts → DOESN'T PAY OFF
- Via MSHRs → NECESSARY

**WIDER == BETTER (Thank you, Pontiac)** 

Gang Multiple RAMBUS Channels

#### **BURSTS AMORTIZE COST OF PRECHARGE**

- Typical Systems: 32 bytes (even DDR2)
  - → THIS IS NOT ENOUGH

**Bruce Jacob** 

University of Maryland

## **CONTACT INFO:**

**Prof. Bruce Jacob** 

Electrical & Computer Engineering University of Maryland, College Park

http://www.ece.umd.edu/~blj/

blj@eng.umd.edu



**Bruce Jacob** 

University of Maryland

## Dilemma: THIS ...

# STATUS QUO in MEMORY-SYSTEM RESEARCH:

```
if ( INSTR.loadstore ) {
   if (L1_cache_miss( INSTR.daddr )) {
      if (L2_cache_miss( INSTR.daddr )) {
          cycles += DRAM_LATENCY;
      }
   }
}
```

**Bruce Jacob** 

University of Maryland

## ... or THIS ...

# STATUS QUO in MEMORY-SYSTEM RESEARCH:

```
if ( INSTR.loadstore ) {
   if (L1_cache_miss( INSTR.daddr )) {
      if (L2_cache_miss( INSTR.daddr )) {

          INSTR.ready = now() + DRAM_LATENCY;
      }
   }
}
```

**Bruce Jacob** 

University of Maryland

## ... or THIS

#### Fast Page Mode Read Cycle



**Bruce Jacob** 

University of Maryland

## **Motivation**

#### **HERE'S WHAT YOU MISS:**



#### **DRAM LATENCY:**



**Bruce Jacob** 

University of Maryland

## **Motivation**

#### **HERE'S WHAT YOU MISS:**



#### **DRAM LATENCY:**



**Bruce Jacob** 

University of Maryland

## **Definitions** (var. on Burger, et al)

- t<sub>PROC</sub> processor with perfect memory
- t<sub>REAL</sub> realistic configuration
- t<sub>BW</sub> CPU with wide memory paths
- t<sub>DRAM</sub> time seen by DRAM system



**Bruce Jacob** 

University of Maryland

# **DRAM Configurations**





**Note: TRANSFER WIDTH of Direct Rambus Channel** 

- equals that of ganged FPM, EDO, etc.
- is 2x that of Rambus & SLDRAM

**Bruce Jacob** 

University of Maryland

# **DRAM Configurations**

Strawman: Rambus, etc.



**Bruce Jacob** 

University of Maryland

# Overhead: Memory vs. CPU



Variable: speed of processor & caches

**Bruce Jacob** 

University of Maryland





note: SLDRAM & RDRAM 2x data transfers

**Bruce Jacob** 

University of Maryland





note: SLDRAM & RDRAM 2x data transfers

**Bruce Jacob** 

University of Maryland

### **Cost-Performance**

#### FPM, EDO, SDRAM, ESDRAM:

- Lower Latency => Wide/Fast Bus
- Increase Capacity => Decrease Latency
- Low System Cost

#### Rambus, Direct Rambus, SLDRAM:

- Lower Latency => Multiple Channels
- Increase Capacity => Increase Capacity
- High System Cost

However, 1 DRDRAM = Multiple SDRAM