my-lec13

18
Lecture 13 Main Memory Computer Architecture COE 501

Upload: ah-chong

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 1/18

Lecture 13

Main Memory

Computer Architecture

COE 501

Page 2: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 2/18

µProc60%/yr.(2X/1.5yr)

DRAM

9%/yr.(2X/10 yrs)1

10

100

1000

   1   9   8   0

   1   9   8   1

   1   9   8   3

   1   9   8   4

   1   9   8   5

   1   9   8   6

   1   9   8   7

   1   9   8   8

   1   9   8   9

   1   9   9   0

   1   9   9   1

   1   9   9   2

   1   9   9   3

   1   9   9   4

   1   9   9   5

   1   9   9   6

   1   9   9   7

   1   9   9   8

   1   9   9   9

   2   0   0   0

DRAM

CPU

   1   9   8   2

Processor-MemoryPerformance Gap:(grows 50% / year)

   P  e  r   f  o  r  m  a  n  c  e

Time

“Moore’s Law” 

Processor-DRAM Memory Gap (latency)

Who Cares About the Memory Hierarchy?

Page 3: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 3/18

Impact on Performance• Suppose a processor executes at

 – Clock Rate = 200 MHz (5 ns per cycle)

 – CPI = 1.1

 – 50% arith/logic, 30% ld/st, 20% control

• Suppose that 10% of memoryoperations get 50 cycle

miss penalty• CPI = ideal CPI + average stalls per instruction

= 1.1(cyc) +( 0.30 (datamops/ins)x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle

= 2. 6• 58 % of the time the processor

is stalled waiting for memory!

• a 1% instruction miss rate would add

an additional 0.5 cycles to the CPI!

DataMiss

(1.6)

49%

Ideal CPI

(1.1)

35%

Inst Miss

(0.5)

16%

Page 4: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 4/18

Main Memory Background

•Performance of Main Memory: – Latency: Cache Miss Penalty

» Access Time : time between request and when desired word arrives

» Cycle Time : time between requests

 – Bandwidth: I/O & Large Block Miss Penalty (L2)

• Main Memory is DRAM : Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms)

 – Addresses divided into 2 halves:

» RAS or Row Access Strobe  

» CAS or Column Access Strobe 

• Cache uses SRAM : Static Random Access Memory – No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)

 – Address not divided

• Size : DRAM/SRAM 4-8 ,

Cost & Cycle time : SRAM/DRAM 8-16 

Page 5: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 5/18

Typical SRAM Organization: 16-word x 4-bit

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &

Precharger - +Wr Driver &

Precharger - +Wr Driver &

Precharger - +Wr Driver &

Precharger

A d  d r  e  s 

 s D e  c  o d  e r 

WrEn

Precharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3

Q: Which is longer:

word line or

 bit line?

Page 6: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 6/18

• Write Enable is usually active low (WE_L)

•Bi-directional data – A new control signal, output enable (OE_L) is needed

 – WE_L is asserted (Low), OE_L is disasserted (High)

» D serves as the data input pin

 – WE_L is disasserted (High), OE_L is asserted (Low)

» D is the data output pin – Both WE_L and OE_L are asserted:

» Result is unknown. Don’t do that!!! 

A

DOE_L

2 Nwords

x M bitSRAM

N

M

WE_L

Logic Diagram of a Typical SRAM

Page 7: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 7/18

Classical DRAM Organization (square)

row

dec

oder

rowaddress

Column Selector &I/O Circuits ColumnAddress

data

RAM CellArray

word (row) select

bit (data) lines

• Row and Column Addresstogether:

 – Select 1 bit a time

Each intersection represents

a 1-T DRAM Cell

Page 8: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 8/18

AD

OE_L

256K x 8

DRAM9 8

WE_L

• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are allactive low

• Din and Dout are combined (D): – WE_L is asserted (Low), OE_L is disasserted (High)

» D serves as the data input pin

 – WE_L is disasserted (High), OE_L is asserted (Low)

» D is the data output pin• Row and column addresses share the same pins (A)

 – RAS_L goes low: Pins A are latched in as row address

 – CAS_L goes low: Pins A are latched in as column address

 – RAS/CAS edge-sensitive

CAS_LRAS_L

Logic Diagram of a Typical DRAM

Page 9: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 9/18

• Simple : – CPU, Cache, Bus, Memory

same width(32 bits)

• Interleaved : – CPU, Cache, Bus 1 word:

Memory N Modules

(4 Modules); example isword interleaved 

• Wide : – CPU/Mux 1 word;

Mux/Cache, Bus,

Memory N words(Alpha: 64 bits & 256bits) 

Main Memory Performance

Page 10: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 10/18

• DRAM (Read/Write) Cycle Time >> DRAM(Read/Write) Access Time – 2:1; why?

• DRAM (Read/Write) Cycle Time : – How frequent can you initiate an access?

 – Analogy: A little kid can only ask his father for money on Saturday

• DRAM (Read/Write) Access Time: – How quickly will you get what you want once you initiate an access?

 – Analogy: As soon as he asks, his father will give him the money

• DRAM Bandwidth Limitation analogy: – What happens if he runs out of money on Wednesday?

TimeAccess Time

Cycle Time

Main Memory Performance

Page 11: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 11/18

Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

   A  c  c  e  s  s   B  a  n   k   0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

Memory

Bank 1

MemoryBank 0

Memory

Bank 3

Memory

Bank 2

Increasing Bandwidth - Interleaving

Page 12: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 12/18

Main Memory Performance(Figure 5.32, pg. 432)

• How long does it take to send 4 words?

• Timing model – 1 cycle to send address,

 – 6 cycles to access data + 1 cycle to send data

 – Cache Block is 4 words

•Simple Mem. = 4 x (1+6+1) = 32• Wide Mem. = 1 + 6 + 1 = 8

• Interleaved Mem. = 1 + 6 + 4x1 = 11

address

Bank 0

048

12

address

Bank 1

159

13

address

Bank 2

26

1014

address

Bank 3

37

1115

Page 13: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 13/18

Wider vs. Interleaved Memory

• Wider memory – Cost for the wider connection

 – Need a multiplexor between cache and CPU

 – May lead to significantly larger/more expensive memory

• Intervleaved memory – Send address to several banks simultaneously

 – Want Number of banks > Number of clocks to access bank

 – As the size of memory chips increases, it becomes difficult to

interleave memory

 – Difficult to expand memory by small amounts

 – May not work well for non-sequential accesses

Page 14: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 14/18

Avoiding Bank Conflicts

• Lots of banks int x[256][512];

for (j = 0; j < 512; j = j+1)

for (i = 0; i < 256; i = i+1)

x[i][j] = 2 * x[i][j]; 

• Even with 128 banks their are conflicts, since 512 is aneven multiple of 128

• SW: loop interchange or declaring array not power of 2

• HW: Prime number of banks

 – bank number = address mod number of banks

 – address within bank = address / number of banks

 – modulo & divide per memory access?

 – address within bank = address mod number words in bank (3, 7, 31)

Page 15: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 15/18

Page Mode DRAM: Motivation

• Regular DRAM Organization: –

N rows x N column x M-bit – Read & Write M-bit at a time

 – Each M-bit access requiresa RAS / CAS cycle

• Fast Page Mode DRAM

 – N x M “register” to save a row 

A Row Address Junk

CAS_L

RAS_L

Col Address Row Address JunkCol Address

1st M-bit Access 2nd M-bit Access

   N   r

  o  w  s

N cols

DRAM

M bits

Row

Address

Column

Address

M-bit Output

Page 16: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 16/18

Fast Page Mode Operation

• Fast Page Mode DRAM –

N x M “SRAM” to save a row • After a row is read into the

register – Only CAS is needed to access

other M-bit blocks on that row

 – RAS_L remains asserted whileCAS_L is toggled

A Row Address

CAS_L

RAS_L

Col Address Col Address

1st M-bit Access

   N   r

  o  w  s

N cols

DRAM

Column

Address

M-bit OutputM bits

N x M “SRAM” 

Row

Address

Col Address Col Address

2nd M-bit 3rd M-bit 4th M-bit

Page 17: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 17/18

Fast Memory Systems: DRAM specific

•Multiple column accesses: page mode – DRAM buffers a row of bits inside the DRAM for column access

(e.g., 16 Kbits row for 64 Mbits DRAM)

 – Allow repeated column access without another row access

 – 64 Mbit DRAM: cycle time = 90 ns, optimized access = 25 ns

• New DRAMs: what will they cost, will they survive? – Synchronous DRAM : Provide a clock signal to DRAM, transfer

synchronous to system clock

 – RAMBUS : startup company; reinvent DRAM interface

» Each chip a module vs. slice of memory

» Short bus between CPU and chips» Does own refresh

» Variable amount of data returned

» 1 byte / 2 ns (500 MB/s per chip)

• Niche memory or main memory? – e.g., Video RAM for frame buffers, DRAM + fast serial output

Page 18: my-lec13

8/3/2019 my-lec13

http://slidepdf.com/reader/full/my-lec13 18/18

Main Memory Summary

• Wider Memory

• Interleaved Memory: for sequential or independentaccesses

• Avoiding bank conflicts: SW & HW

• DRAM specific optimizations: page mode &Specialty DRAM