the arm10 family of advanced microprocessor cores · the architecture for the digital world tm hot...

20
1 THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

1THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

The ARM10 Family of Advanced

Microprocessor Cores

Stephen HillARM Austin Design Center

Page 2: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

2THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

Agenda

Design overview

Microarchitecture� ARM10

o Memory System

o Interrupt response

3. Powero Dynamic power

o Power down modes

4. VFP10

ETM10

Summary

Page 3: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

3THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Overview

Max frequency: 400MHz0.9V, worst case

TSMC 0.13um LV

MIPS/MHz: 1.25500 MIPS @ 400MHz

Dhrystone 2.1

Active power consumption: 0.51mA/MIPSRoom Temp / Typical / 1.1V

Average when running Dhrystone 2.1

AreaARM1022E (2x16KB): 6.9mm2

ARM1020E (2x32KB): 10.3mm2

Page 4: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

4THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM10200 System Overview

ARM10E

BIU

I-Side AHB

D-Side AHB

BIU

VFP10

16 or 32k Data Cache 16 or 32k Instruction

Cache

D-SideMMU

WriteBuffer

I-SideMMU

Bridge SDRAM CTL

16 or 32k Data Cache

16 or 32k Data Cache

ARM1020E

ETM10

ARM10200

64-bit data

Page 5: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

5THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM10E Microarchitecture

64-bit instruction and data interfaces

Static branch prediction with branch folding

Parallel load/store pipelineDedicated machine for LDM/STM execution; all butthe first cycle of these instructions are hidden if nodependencies are encountered

Parallel execution of multi-cyclecoprocessor operations

Multiply 16 bits per cycle1-3 cycle throughput and 2-4 cycle latency

No data-dependent MUL cycle counts

Page 6: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

6THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM7 Pipeline versus ARM10

InstructionFetch

Reg Select

RegisterRead

Shift ALU RegWrite

FETCH DECODE EXECUTE

ARM7TDMIThumb→→→→ARMdecompress

ARM decode

ARM10

InstructionAddress

Generator

Data CacheInterface

RegWrite

FETCH DECODE EXECUTE MEMORY WRITE

ARM orThumb

InstructionDecode

CoprocessorInstruction

Issue

Register

Read

+

Result

Forward

+

ScoreboardMultiply

Add

CoprocessorData

Interface

BranchPredictor

InstructionFetch

ISSUE

Shift+ ALU

Multiply

Data + BranchAddress

Generator

Page 7: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

7THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM9 Pipeline versus ARM10

InstructionFetch

Shift + ALU MemoryAccess

RegWrite

RegRead

RegDecode

FETCH DECODE EXECUTE MEMORY WRITE

ARM9TDMIARM or Thumb

Inst Decode

ARM10

InstructionAddress

Generator

Shift+

ALU

Data CacheInterface

RegWrite

FETCH DECODE EXECUTE MEMORY WRITE

ARM orThumb

InstructionDecode

CoprocessorInstruction

Issue

Register

Read

+

Result

Forward

+

Scoreboard

MultiplyMultiply

Add

CoprocessorData

Interface

BranchPredictor

InstructionFetch

ISSUE

Data + Branch

Address

Generator

Page 8: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

8THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Memory System

Instruction & Data Caches

32Kbyte Instruction and Data caches

Virtually addressed, 64-way set-associative,32-byte lines, 64-bit R/W

Configurable for Write Through or Write Back operation

Lockable by line (1/64 of the cache)

MMUs

Two fully associative (I and D) 64-entry TLBs

Lockable by entry

Support for software loadable TLBs

Write Buffer

Eight 64-bit entries, plus 32-byte cache line castout buffer

AHB Bus Interface

64-bit wide data transfers, split transactions

Multi-layer AHB support (separate I and D-side system interfaces)

Page 9: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

9THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Memory System

Performance features:

Critical word first

Non-blocking data cache

Hit-under-miss (H-U-M)

Data cache streaming (forwarding) fromlinefills

Data cache store merging into linefills

Page 10: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

10THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Interrupts

Interrupts taken in Execute stage

Fast interrupt mode:

First load miss stops further memory ops but not other instructions

Limit write buffer depth

Recommended measures for fast interrupt response:

Lock handler code into Caches and TLB

Set data cache to write-though (no cast outs)

Limit LDM length to 9 registers (spans only 2 cache lines)

Page 11: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

11THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E InterruptsWorst case Interrupt response time to enter handler (G:H Clock 1:1)

Worst case of outstanding memory operations (LDM just started)

3 table walks needed (unless TLB locked down)

Write buffer full, bus not granted by default

1663

948

9129

Writethrough D

cache

16148

16171

TLBlockeddown

Max Regsin LDM

FastInterrupt

mode

CYCLES

(approx)

Page 12: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

12THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Dynamic Power

ARM10E Core

32%

Clocks

6%

Buffers

1%

BIU

5%

Data MMU

5%

Instruction MMU

5%

Data Cache

24%

Instruction Cache

22%

ARM1020E Dhrystone 2.1

(PowerMill simulation)

Page 13: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

13THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM10E Dynamic Power

ETM Interface

3%Forwarding

3%

CP Interface

2%

Multiplier

2%

RegBank

5%

Execute

7%

Clock

8%

Other

12%

LSU

10%

Prefetch

15%

Decode and

Sequencer

33%

ARM10E CoreDhrystone 2.1

(PowerMill simulation)

Page 14: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

14THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

Power Down

Active Power

Cycle Delay

to Resume

CPU executing

(Fast/Normal/Slow)

CPU clock stopped.Wake on interrupt ordebug event.

CPU state lost.Cache statepreserved.

CPU & Cache statelost. All core powerremoved.

>100 cycles

>101 cycles

>104 cycles

>102 cycles

Resume

RUN

STANDBY

DORMANT

SHUTDOWN

Page 15: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

15THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

Power Down

Power

Management

Controller

PLL

Debug

Hardware

Interrupt

Controller

PM Layer Clo

cks

Clo

cks

ARM High Performance Bus

ROMRAM

Cache

PM Layer

CORE and SCC

Cache

PM Layer

CORE and SCC

DSP Core

PM Layer PM Layer

PM Layer

PM

Layer

PM Layer

Page 16: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

16THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

VFP10

Full IEEE 754 compliant (with SW support)

Performance:236 MFLOPS Linpack (SAxPY) @ 400MHz

400M FIR Taps (800 Peak MFLOPS) @ 400MHz

Functions supported in hardwareMultiply, add, multiply-add, subtract, multiply-subtract,negate, negate multiply, negate multiply-add, negatemultiply-subtract, absolute value, compare, convert,divide and square root, conversions

Most IEEE 754 exceptions handled in

hardware

RunFast mode

No trapping enabled (Denormals flush to +0)

NaN fractions not propagated (not typical)

Page 17: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

17THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

VFP10

7 Stage pipelineFetch - Issue - Decode - Execute (E 1) - E 2 - E 3 - E 4/WB

32 Single precision / 16 Double precision registers

High performance short vector operationsRegister banks operate as hardware circular queues and canbe addressed as short vectors (up to 8 values)

Separate divide/square root unitSupports load/store, and arithmetic operation in parallel withdivide/square root operation

Separate load/store unitLoad/store operations may be done in parallel with dataprocessing operations

64-bit unidirectional data interfaces

Area: ~1.6mm2 in 0.13um

Page 18: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

18THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ETM10

Address

Control

EmbeddedICE

Logic

ARM1020E

SOC

ETM10

Trace

Port

JTAG

Port

Trace Port

Analyzer

Multi-ICE

Data

TAP

Embedded Trace Macrocell

Page 19: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

19THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ETM10

Full real-time instruction and data tracing

Monitors the core’s internal buses

Zero performance overheadSupports high frequency trace with demux-port

Configurable synthesis for optimum:

area

features

pin count

Programmed non-intrusively through JTAG

Page 20: The ARM10 Family of Advanced Microprocessor Cores · THE ARCHITECTURE FOR THE DIGITAL WORLD TM Hot Chips 13 3 ARM1020E Overview Max frequency: 400MHz 0.9V, worst case TSMC 0.13um

20THE ARCHITECTURE FOR THE DIGITAL WORLDTM Hot Chips 13

ARM1020E Family Summary

ARM1020E:500 DMIPS @400 MHz

0.51 mA/MIPS

10.3mm2 / 6.9mm2

VFP10:236 MFLOPS @400MHz

IEEE 754 Compatible

ETM10:Full speed, real time

embedded trace

VFP10

Integer UnitDMMUIMMU

I-Cache D-Cache

(ARM10200r1)