Download - Simulation Techniques.ppt

8/17/2019 Simulation Techniques.ppt

1/39

Computer

ArchitectureSimulators

Alex RamirezEines i Tecniques

de Mesura

Master CANS

* Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.


2/39

2

What is a simulator?

● A tool to reproduce the behavior of a computing device

● Why use a simulator?

● Obtain fine-grain details about internal behavior

● Performance analysis

● Enable software development on not available platforms

● Obtain performance predictions for candidate system architectures


3/39

3

Taxonomy of simulation tools

Architecture simulators

Functional Performance / Timing

Trace-driven Execution-drivenTrace-driven Execution-driven

User code Full system


4/39

4

Functional vs. Timing simulators

● Functional simulators implement the visible architecture state

● Maintain the correctness of the programmer vision of the architecture● Main purpose: software development and/or emulation

● Eg. Virtual machines (simnow, qemu, vmware)

● Timing simulators implement the microarchitecture details

● Model the system internals tructures● Processor pipeline

● Memory hierarchy

● Branch predictors

● Interconnection network

● ...

● Functional simulation is much faster than performance simulation


5/39

5

Trace vs. Execution driven

● Trace driven simulators

● Instrument application, and execute on real platform

● Instrumentation records an execution trace

● Recorded trace is used as input to the simulator

● Execution driven simulators

● The application runs directly on top of the simulator

● Simulators maintains both application state and architecture state

● Trace-driven simulation is usually faster than execution driven

● Only needs to maintain the architecture state

● No need to reproduce the computation parts of the application

● Trace driven simulator makes host and target system independent

● Obtain traces on machine A, simulate on machine B● Traces enable simulation of propietary applications & input sets

● Take traces away, independent of inputs & binaries

● Traces can not capture dynamic properties of the application


6/39

6

User-code vs Full-system

● User-code simulators only simulate the application code

● System calls and I/O fall-back to functional simulation● Often calls the native OS to perform the functional emulation

● Requires that host OS and target OS are the same

● Full-system simulators include OS and I/O devices

● Functional and timing simulation of OS codes● Functional and timing simulation of all devices

● Disks, Network, ...


7/397

Building a supercomputer from the ground up

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

Processor exploits ILP

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

F WMAD

Multicore exploits TLP Node board connects

multiple chips

Rack connects

multiple nodes

Many interconnected racks

build a supercomputer


8/398

Multiple levels of abstraction

● Simulation can be performed at multiple levels of abstraction

● Cluster-level

● Processors as atomic blocks

● No simulation of pipeline or memory system

● Detailed simulation of interconnection network

● Shared memory node

● Processor microarchitecture as atomic block

● Detailed simulation of memory system

● No interaction with I/O and interconnection network

● Processor

● Detailed simulation of processor piepeline stages

● Detailed simulation of first leevl caches

● Little interaction with off-chip memory system

● No interaction with I/O and interconnection network


9/399

The Zen of architecture simulators

● Can't build a simulator that achieves

everything ...

● FPGA prototypes

● Fast + accurate, but not flexible

● Detailed software models

● Accurate + flexible, but slow

● Abstract software models

● Fast + flexible, but not accurate

Speed

Accuracy Flexibility


10/3910

Developing architecture simulators

● Monolithic C code

● Models architecture in low-level c code

● Fast, Accurate (but error-prone), not reallyflexible

● Need to model interfaces + timing of

components

● Object-oriented (C++, Java)

● Objects closely match architecturecomponents

● Methods implement the object interfaces

● Still need to model timing of components

● Domain-specific languages

● Modular infrastructures

● Programmer defines components +

interfaces

● Programmer defines componnets

interconnect

● Simulation negine models timing


11/3911

The simulation speed problem

● Trace-driven simulation

● Collect traces on one machine

● Run simulations in a different machine

● Run multiple parallel simulations

● No need for total correctness

● Easier for first-approach data

● Execution-driven simulation

● Must run application and simulator on the same machine

● Cross-ISA emulation is costly

● In all cases, simulation is slow (and traces are huge)

● 100x to 100.000x slower than real runs

● 1 min. of real application can take

● 2h to 2 years of simulation

● Gigabytes of trace storage


12/3912

Simulator speed example

338x147,648x43h 13m 41sw/Ruby + Opal

89x39,131x11h 27m 25sw/Ruby

1437x7m 41ssimics

25x4,029x1h 11m 07ssim-outorder

1158x2m 47ssim-fast----1.054sNative

Ratio to FunctionalRatio to NativeTime

● Simplescalar toolset

● sim-fast provides functional emulation

● sim-outorder provides timing simulation of processor out of order pipeline

● Simics + GEMS toolset● simics provides multicore full-system functional simulation

● Ruby provides timing simulation of the caches + memory hierarchy

● Processor abstracted as fixed 1IPC

● Opal provides timing simulation of the out-of-order processor pipeline

* gcc from SPEC'00 small input on Xeon 3.8GHz


13/3913

Don't simulate faster: Simulate LESS

● Simulation is a single-thread process

● Even when simulating parallel hardware● Processors do not get any faster: we have hit the power wall

● Simulation speed no longer improves with technology

● Parallel machines used to run multiple simulations in parallel

● We can not make a faster simulator ● Do not simulate the entire system

● Do not simulate the entire application


14/3914

Downsizing the simulated platform

● Simulate a smaller system

● eg. Multiprocessor of 16 cores instead of 1024● Does not expose scalability issues

● Competition for shared resources

● Conflicts

● Race conditions

● Simulate a smaller application

● eg. Matrix Multiply of 1K x 1K instead of 1M x 1M

● Does not exercise the hardware properly

● Smaller working set fits in cache

● Capacity and conflict issues● Less data / code reuse

● Heavier weight of initialization vs. steady state


15/3915

Simulation techniques

● Sampling

● Simulate random samples of the application● Fast-forwarding

● Reaching the sample point in the application

● Warm-up

● Fast simulation of components previous to measuring stage

● Checkpointing● Dump application and architecture state before the simulation sample

● Phase detection

● Select (non-random) samples based on application analysis


16/3916

Simulation sampling

● Statistical approach

● Do not examine the entire population

● Interview a representative sample

● Mathematical approach

● confidence margin (eg. 95%)

● confidence interval (eg. +/- 2.5)

● Two sampling approaches● Systematic sampling

● Sample every N instructions

● Random sampling

U

Actual simulator

measurement

N instructions (total benchmark execution)


17/3917

Number of required samples

● Sample population is proportional to

● Variance of the target metric● Desired confidence interval

● Desired confidence level

● n >= ( level * variance / (1 - interval) ) ^ 2


18/39

18

SimFlex sample sizes

● Multiprocessor applications exhibit much higher variance

● Larger samples required to stabilize measurements

95 +/- 550.000 cycles100.000 cyclesOLTP (TPC-C)

Web server

99.7 +/- 31.000 instr 2.000 instr SPEC CPU

Confidence intervalSimulation unitDetailed warm-up Application


19/39

19

Simulator warm-up

● Can't simulate samples on an empty architecture state

● Caches do not start in invalid state

● Branch predictor, TLB, ...

● Operating system state● Files open / close, read / write pointers, ...

● The architecture takes some time to warm-up

● The larger the structure, the longer the warm-up

Functional simulation

Warm-up of large structures

W

Detailed simulator

warm-up

U

Actual simulator

measurement



20/39

20

Checkpointing and sampling simulation

● Store warmed-up state before each sample to a checkpoint file● Amortizes the functional simulation overhead time over more detailed simulations

● Allows parallel simulation of all samples

Functional simulation

Warm-up of large structures

W

Detailed simulator

warm-up

U

Actual simulator

measurement


Restore checkpoint state


21/39

21

Non-random sampling

● Exploit application's repetitive behavior to select valid samples

● Assume application behavior is linked to static code execution

● Each phase corresponds to a static section of the code


22/39

22

Basic block vector

● Count executions of each basic block

● For the whole program execution● For every N instructions executed

● Typical N is 100 million instructions

● Normalize vectors to the total number of basic blocks in the BBV

● The sum of all elements in the BBV adds up to 1

● Compare sample BBV to global BBV

● Manhattan distance

● Sum of absolute differences produces a value between 0 and 2

● Euclidean distance

● SQRT of the sum of the squared difference


23/39

23

BBV comparison

● Comparison of BBVs also shows

periodic behavior in the application

● BBV difference graph used to

● Identify initialization phase

● Identify repetitive interval

● Fourier analysis


24/39

24

SimPoint issues

● The most representative BBV might be far ahead in the code

● Large fast-forward period● Pick the earliest significant sample

● A single sample might not represent the whole application

● Pick one sample from each application phase

● Selected samples are architecture dependent

● However, the same code with a different compiler has the same BBV


25/39

25

Multithreaded workloads

● Random samples of multiple threads do not overlap in time

● Invalidates simulation of multithreaded systems

● Taking a vertical sample does not guarantee the proper alignment either

● Thread speed is not the same during

● Functional simulation

● Warm-up

● Simulation

● Fast-forward is measured in instructions, not time


26/39

26

Alternatives to simulation

● Statistical simulation

● Analytical modeling● Abstract simulation

● Hyerarchical simulation


27/39

27

Task-based programming model

● Programmer annotates the code to identify tasks, inputs, and outputs

● Total task working set must fit in the LS

● Runtime library takes care of

● Detecting parallelism

● Bundling tasks together

● Transferring data in / out of the LS

● Automatically performs double buffering for tasks in a bundle

The first Tasks on a Bundle

require to wait for DMAs to

finished

The following Tasks on a Bundle

don't have to wait much, since their

data has been requested

before the previous Task was executed


28/39

28

The TaskSim simulator

● TaskSim is a system simulator based on application level

traces

● Thread phases + inter-thread dependencies + DMA events

● Abstract simulation of CPU bursts

● Obtain burst duration from trace● Aplly CPU speed factor based on phase id

● Detailed (cycle-accurate) simulation of DMA controller, caches,

interconnect, memory controller, and DRAM


29/39

29

Sample TaskSim application trace

Helper Thread

CPU 8, 15.51us Prepare bundle

CPU 9, 1.19us Prepare submissionCPU 11, 1.51us Submit bundle

SIGNAL 8 Bundle 8 is ready

CPU 16, 0us Low level wait for events

WAIT 101 Wait for Task 101

CPU 7, 5,52us Schedule

CPU 8, 15.61us Prepare bundle

SPE Thread 8

CPU 19, 4521.23us Wait for tasks

WAIT 8 Wait for Bundle 8

CPU 20, 6.42us Get task description

CPU 21, 1.35us Task Stage in

CPU 23, 561.21us Task execution

In the simulation, duration of wait

phases will be ignored, as it depends

on the WAIT event


30/39

30

Where does CPU burst timing come from?

● CPU bursts in the TaskSim trace are annotated with their duration

● TaskSim can then adjust such CPU burst duration

● Change duration to a fixed amount (eg. DMA Wait goes to 1ns)

● Multiply duration for a given factor (eg. SPU computation gets 2x faster)

● CPU burst duration obtained from● The real execution at the time the trace was collected

● Cycle-accurate simulation on another simulator

● Take an instruction-level trace through another simulator

● Simulate a sample in an execution-driven simulator

● All CPU bursts can be simulated independently on such cycle-accurate

models

● Hierarchical simulation


31/39

31

Dynamic task scheduling automatically achieves load balancing

● The simulator task scheduler dynamically allocates tasks (bundles) to the first

available processor

● Simulate a variable number of processors from a single application trace


32/39

32

Matrix Multiply simulations (8 to 256 CPU)

Original execution, 8 SPU, 25.6 GB/s

16 SPU, 25.6 GB/s

3x PPU, 16 SPU, 25.6 GB/s

5x PPU, 32 SPU, 25.6 GB/s

3 PPU, 10x PPU, 256 SPU, 409.6 GB/s


33/39

33

MatMul zoom on 256 CPU simulation

● Not even 3 masters at 10x speed can keep up with 256 workers● Wrong task creation + scheduling strategy (depth first vs. breatdh first)


34/39

34

Task generation scheme scalability

● Task generation (green) on the master task limits scalability (on the left)

● Parallelization of task generation (on the right) is crucial to avoid this bottleneck

16p

32p

64p


35/39

35

Simulator results also allow analysis of per-module behavior

● Paraver trace shows cache accesses / ns for each of the 4 cache banks● 128-byte interleave evenly balances pressure on the 4 caches

● 128 KB interleave is only using 1 out of 4 caches at any given time

● Worst case quotient of data size (or DMA size?) vs. interleave grain

● 256KB interleave is using 2 out of 4 caches

Cache interleave every 128 bytes

Cache interleave every 128 KB

Cache interleave every 256 KB


36/39

36

Also produces statistics (counters + histograms) in addition to traces

● Graphs show the number of simultaneous data transfers on the cluster and global buses● Little intra-cluster traffic

● All traffic competes to reach the global bus port

● High traffic on the global bus

● Continously handles 2-4 transfers

● Very little use for more than 4 simultaneous transfers


37/39

37

Why FFT3D does not perform well with 32 cache modules?

● Cache write-allocate policy hurts 1st FFT pass too badly

● Even if afterwards the whole working set fits in the cache

● TransposeYZ works better with cache due tor educed page conflicts and lower latency

FFT3D with 800GB/s bandwidth

FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth

Cache is better

for some

access patterns


38/39

38

Hyerarchical simulation methodology

● First execution

● Trace real application at a high level

of abstraction (CPU bursts,synchronization, communicationevents)

● Find Representative segment ofexecution

● Non-linear filtering and spectralanalysis

● 10-100x smaller trace

● CPU burst clustering algorithm

● Density based clustering usingperformance counters

● Selection of CPU burstsrepresentatives

● 10-100x smaller trace● Second execution

● Trace CPU bursts representatives atmicroarchitecture level


39/39

Trace Obtention: WRF

Download - Simulation Techniques.ppt

Top Related