Download - Simulation Techniques.ppt
-
8/17/2019 Simulation Techniques.ppt
1/39
Computer
ArchitectureSimulators
Alex RamirezEines i Tecniques
de Mesura
Master CANS
* Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.
-
8/17/2019 Simulation Techniques.ppt
2/39
2
What is a simulator?
● A tool to reproduce the behavior of a computing device
● Why use a simulator?
● Obtain fine-grain details about internal behavior
● Performance analysis
● Enable software development on not available platforms
● Obtain performance predictions for candidate system architectures
-
8/17/2019 Simulation Techniques.ppt
3/39
3
Taxonomy of simulation tools
Architecture simulators
Functional Performance / Timing
Trace-driven Execution-drivenTrace-driven Execution-driven
User code Full system
-
8/17/2019 Simulation Techniques.ppt
4/39
4
Functional vs. Timing simulators
● Functional simulators implement the visible architecture state
● Maintain the correctness of the programmer vision of the architecture● Main purpose: software development and/or emulation
● Eg. Virtual machines (simnow, qemu, vmware)
● Timing simulators implement the microarchitecture details
● Model the system internals tructures● Processor pipeline
● Memory hierarchy
● Branch predictors
● Interconnection network
● ...
● Functional simulation is much faster than performance simulation
-
8/17/2019 Simulation Techniques.ppt
5/39
5
Trace vs. Execution driven
● Trace driven simulators
● Instrument application, and execute on real platform
● Instrumentation records an execution trace
● Recorded trace is used as input to the simulator
● Execution driven simulators
● The application runs directly on top of the simulator
● Simulators maintains both application state and architecture state
● Trace-driven simulation is usually faster than execution driven
● Only needs to maintain the architecture state
● No need to reproduce the computation parts of the application
● Trace driven simulator makes host and target system independent
● Obtain traces on machine A, simulate on machine B● Traces enable simulation of propietary applications & input sets
● Take traces away, independent of inputs & binaries
● Traces can not capture dynamic properties of the application
-
8/17/2019 Simulation Techniques.ppt
6/39
6
User-code vs Full-system
● User-code simulators only simulate the application code
● System calls and I/O fall-back to functional simulation● Often calls the native OS to perform the functional emulation
● Requires that host OS and target OS are the same
● Full-system simulators include OS and I/O devices
● Functional and timing simulation of OS codes● Functional and timing simulation of all devices
● Disks, Network, ...
-
8/17/2019 Simulation Techniques.ppt
7/397
Building a supercomputer from the ground up
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
Processor exploits ILP
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
F WMAD
Multicore exploits TLP Node board connects
multiple chips
Rack connects
multiple nodes
Many interconnected racks
build a supercomputer
-
8/17/2019 Simulation Techniques.ppt
8/398
Multiple levels of abstraction
● Simulation can be performed at multiple levels of abstraction
● Cluster-level
● Processors as atomic blocks
● No simulation of pipeline or memory system
● Detailed simulation of interconnection network
● Shared memory node
● Processor microarchitecture as atomic block
● Detailed simulation of memory system
● No interaction with I/O and interconnection network
● Processor
● Detailed simulation of processor piepeline stages
● Detailed simulation of first leevl caches
● Little interaction with off-chip memory system
● No interaction with I/O and interconnection network
-
8/17/2019 Simulation Techniques.ppt
9/399
The Zen of architecture simulators
● Can't build a simulator that achieves
everything ...
● FPGA prototypes
● Fast + accurate, but not flexible
● Detailed software models
● Accurate + flexible, but slow
● Abstract software models
● Fast + flexible, but not accurate
Speed
Accuracy Flexibility
-
8/17/2019 Simulation Techniques.ppt
10/3910
Developing architecture simulators
● Monolithic C code
● Models architecture in low-level c code
● Fast, Accurate (but error-prone), not reallyflexible
● Need to model interfaces + timing of
components
● Object-oriented (C++, Java)
● Objects closely match architecturecomponents
● Methods implement the object interfaces
● Still need to model timing of components
● Domain-specific languages
● Modular infrastructures
● Programmer defines components +
interfaces
● Programmer defines componnets
interconnect
● Simulation negine models timing
-
8/17/2019 Simulation Techniques.ppt
11/3911
The simulation speed problem
● Trace-driven simulation
● Collect traces on one machine
● Run simulations in a different machine
● Run multiple parallel simulations
● No need for total correctness
● Easier for first-approach data
● Execution-driven simulation
● Must run application and simulator on the same machine
● Cross-ISA emulation is costly
● In all cases, simulation is slow (and traces are huge)
● 100x to 100.000x slower than real runs
● 1 min. of real application can take
● 2h to 2 years of simulation
● Gigabytes of trace storage
-
8/17/2019 Simulation Techniques.ppt
12/3912
Simulator speed example
338x147,648x43h 13m 41sw/Ruby + Opal
89x39,131x11h 27m 25sw/Ruby
1437x7m 41ssimics
25x4,029x1h 11m 07ssim-outorder
1158x2m 47ssim-fast----1.054sNative
Ratio to FunctionalRatio to NativeTime
● Simplescalar toolset
● sim-fast provides functional emulation
● sim-outorder provides timing simulation of processor out of order pipeline
● Simics + GEMS toolset● simics provides multicore full-system functional simulation
● Ruby provides timing simulation of the caches + memory hierarchy
● Processor abstracted as fixed 1IPC
● Opal provides timing simulation of the out-of-order processor pipeline
* gcc from SPEC'00 small input on Xeon 3.8GHz
-
8/17/2019 Simulation Techniques.ppt
13/3913
Don't simulate faster: Simulate LESS
● Simulation is a single-thread process
● Even when simulating parallel hardware● Processors do not get any faster: we have hit the power wall
● Simulation speed no longer improves with technology
● Parallel machines used to run multiple simulations in parallel
● We can not make a faster simulator ● Do not simulate the entire system
● Do not simulate the entire application
-
8/17/2019 Simulation Techniques.ppt
14/3914
Downsizing the simulated platform
● Simulate a smaller system
● eg. Multiprocessor of 16 cores instead of 1024● Does not expose scalability issues
● Competition for shared resources
● Conflicts
● Race conditions
● Simulate a smaller application
● eg. Matrix Multiply of 1K x 1K instead of 1M x 1M
● Does not exercise the hardware properly
● Smaller working set fits in cache
● Capacity and conflict issues● Less data / code reuse
● Heavier weight of initialization vs. steady state
-
8/17/2019 Simulation Techniques.ppt
15/3915
Simulation techniques
● Sampling
● Simulate random samples of the application● Fast-forwarding
● Reaching the sample point in the application
● Warm-up
● Fast simulation of components previous to measuring stage
● Checkpointing● Dump application and architecture state before the simulation sample
● Phase detection
● Select (non-random) samples based on application analysis
-
8/17/2019 Simulation Techniques.ppt
16/3916
Simulation sampling
● Statistical approach
● Do not examine the entire population
● Interview a representative sample
● Mathematical approach
● confidence margin (eg. 95%)
● confidence interval (eg. +/- 2.5)
● Two sampling approaches● Systematic sampling
● Sample every N instructions
● Random sampling
U
Actual simulator
measurement
N instructions (total benchmark execution)
-
8/17/2019 Simulation Techniques.ppt
17/3917
Number of required samples
● Sample population is proportional to
● Variance of the target metric● Desired confidence interval
● Desired confidence level
● n >= ( level * variance / (1 - interval) ) ^ 2
-
8/17/2019 Simulation Techniques.ppt
18/39
18
SimFlex sample sizes
● Multiprocessor applications exhibit much higher variance
● Larger samples required to stabilize measurements
95 +/- 550.000 cycles100.000 cyclesOLTP (TPC-C)
Web server
99.7 +/- 31.000 instr 2.000 instr SPEC CPU
Confidence intervalSimulation unitDetailed warm-up Application
-
8/17/2019 Simulation Techniques.ppt
19/39
19
Simulator warm-up
● Can't simulate samples on an empty architecture state
● Caches do not start in invalid state
● Branch predictor, TLB, ...
● Operating system state● Files open / close, read / write pointers, ...
● The architecture takes some time to warm-up
● The larger the structure, the longer the warm-up
Functional simulation
Warm-up of large structures
W
Detailed simulator
warm-up
U
Actual simulator
measurement
N instructions (total benchmark execution)
-
8/17/2019 Simulation Techniques.ppt
20/39
20
Checkpointing and sampling simulation
● Store warmed-up state before each sample to a checkpoint file● Amortizes the functional simulation overhead time over more detailed simulations
● Allows parallel simulation of all samples
Functional simulation
Warm-up of large structures
W
Detailed simulator
warm-up
U
Actual simulator
measurement
N instructions (total benchmark execution)
Restore checkpoint state
-
8/17/2019 Simulation Techniques.ppt
21/39
21
Non-random sampling
● Exploit application's repetitive behavior to select valid samples
● Assume application behavior is linked to static code execution
● Each phase corresponds to a static section of the code
-
8/17/2019 Simulation Techniques.ppt
22/39
22
Basic block vector
● Count executions of each basic block
● For the whole program execution● For every N instructions executed
● Typical N is 100 million instructions
● Normalize vectors to the total number of basic blocks in the BBV
● The sum of all elements in the BBV adds up to 1
● Compare sample BBV to global BBV
● Manhattan distance
● Sum of absolute differences produces a value between 0 and 2
● Euclidean distance
● SQRT of the sum of the squared difference
-
8/17/2019 Simulation Techniques.ppt
23/39
23
BBV comparison
● Comparison of BBVs also shows
periodic behavior in the application
● BBV difference graph used to
● Identify initialization phase
● Identify repetitive interval
● Fourier analysis
-
8/17/2019 Simulation Techniques.ppt
24/39
24
SimPoint issues
● The most representative BBV might be far ahead in the code
● Large fast-forward period● Pick the earliest significant sample
● A single sample might not represent the whole application
● Pick one sample from each application phase
● Selected samples are architecture dependent
● However, the same code with a different compiler has the same BBV
-
8/17/2019 Simulation Techniques.ppt
25/39
25
Multithreaded workloads
● Random samples of multiple threads do not overlap in time
● Invalidates simulation of multithreaded systems
● Taking a vertical sample does not guarantee the proper alignment either
● Thread speed is not the same during
● Functional simulation
● Warm-up
● Simulation
● Fast-forward is measured in instructions, not time
-
8/17/2019 Simulation Techniques.ppt
26/39
26
Alternatives to simulation
● Statistical simulation
● Analytical modeling● Abstract simulation
● Hyerarchical simulation
-
8/17/2019 Simulation Techniques.ppt
27/39
27
Task-based programming model
● Programmer annotates the code to identify tasks, inputs, and outputs
● Total task working set must fit in the LS
● Runtime library takes care of
● Detecting parallelism
● Bundling tasks together
● Transferring data in / out of the LS
● Automatically performs double buffering for tasks in a bundle
The first Tasks on a Bundle
require to wait for DMAs to
finished
The following Tasks on a Bundle
don't have to wait much, since their
data has been requested
before the previous Task was executed
-
8/17/2019 Simulation Techniques.ppt
28/39
28
The TaskSim simulator
● TaskSim is a system simulator based on application level
traces
● Thread phases + inter-thread dependencies + DMA events
● Abstract simulation of CPU bursts
● Obtain burst duration from trace● Aplly CPU speed factor based on phase id
● Detailed (cycle-accurate) simulation of DMA controller, caches,
interconnect, memory controller, and DRAM
-
8/17/2019 Simulation Techniques.ppt
29/39
29
Sample TaskSim application trace
Helper Thread
CPU 8, 15.51us Prepare bundle
CPU 9, 1.19us Prepare submissionCPU 11, 1.51us Submit bundle
SIGNAL 8 Bundle 8 is ready
CPU 16, 0us Low level wait for events
WAIT 101 Wait for Task 101
CPU 7, 5,52us Schedule
CPU 8, 15.61us Prepare bundle
SPE Thread 8
CPU 19, 4521.23us Wait for tasks
WAIT 8 Wait for Bundle 8
CPU 20, 6.42us Get task description
CPU 21, 1.35us Task Stage in
CPU 23, 561.21us Task execution
In the simulation, duration of wait
phases will be ignored, as it depends
on the WAIT event
-
8/17/2019 Simulation Techniques.ppt
30/39
30
Where does CPU burst timing come from?
● CPU bursts in the TaskSim trace are annotated with their duration
● TaskSim can then adjust such CPU burst duration
● Change duration to a fixed amount (eg. DMA Wait goes to 1ns)
● Multiply duration for a given factor (eg. SPU computation gets 2x faster)
● CPU burst duration obtained from● The real execution at the time the trace was collected
● Cycle-accurate simulation on another simulator
● Take an instruction-level trace through another simulator
● Simulate a sample in an execution-driven simulator
● All CPU bursts can be simulated independently on such cycle-accurate
models
● Hierarchical simulation
-
8/17/2019 Simulation Techniques.ppt
31/39
31
Dynamic task scheduling automatically achieves load balancing
● The simulator task scheduler dynamically allocates tasks (bundles) to the first
available processor
● Simulate a variable number of processors from a single application trace
-
8/17/2019 Simulation Techniques.ppt
32/39
32
Matrix Multiply simulations (8 to 256 CPU)
Original execution, 8 SPU, 25.6 GB/s
16 SPU, 25.6 GB/s
3x PPU, 16 SPU, 25.6 GB/s
5x PPU, 32 SPU, 25.6 GB/s
3 PPU, 10x PPU, 256 SPU, 409.6 GB/s
-
8/17/2019 Simulation Techniques.ppt
33/39
33
MatMul zoom on 256 CPU simulation
● Not even 3 masters at 10x speed can keep up with 256 workers● Wrong task creation + scheduling strategy (depth first vs. breatdh first)
-
8/17/2019 Simulation Techniques.ppt
34/39
34
Task generation scheme scalability
● Task generation (green) on the master task limits scalability (on the left)
● Parallelization of task generation (on the right) is crucial to avoid this bottleneck
16p
32p
64p
-
8/17/2019 Simulation Techniques.ppt
35/39
35
Simulator results also allow analysis of per-module behavior
● Paraver trace shows cache accesses / ns for each of the 4 cache banks● 128-byte interleave evenly balances pressure on the 4 caches
● 128 KB interleave is only using 1 out of 4 caches at any given time
● Worst case quotient of data size (or DMA size?) vs. interleave grain
● 256KB interleave is using 2 out of 4 caches
Cache interleave every 128 bytes
Cache interleave every 128 KB
Cache interleave every 256 KB
-
8/17/2019 Simulation Techniques.ppt
36/39
36
Also produces statistics (counters + histograms) in addition to traces
● Graphs show the number of simultaneous data transfers on the cluster and global buses● Little intra-cluster traffic
● All traffic competes to reach the global bus port
● High traffic on the global bus
● Continously handles 2-4 transfers
● Very little use for more than 4 simultaneous transfers
-
8/17/2019 Simulation Techniques.ppt
37/39
37
Why FFT3D does not perform well with 32 cache modules?
● Cache write-allocate policy hurts 1st FFT pass too badly
● Even if afterwards the whole working set fits in the cache
● TransposeYZ works better with cache due tor educed page conflicts and lower latency
FFT3D with 800GB/s bandwidth
FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth
Cache is better
for some
access patterns
-
8/17/2019 Simulation Techniques.ppt
38/39
38
Hyerarchical simulation methodology
● First execution
● Trace real application at a high level
of abstraction (CPU bursts,synchronization, communicationevents)
● Find Representative segment ofexecution
● Non-linear filtering and spectralanalysis
● 10-100x smaller trace
● CPU burst clustering algorithm
● Density based clustering usingperformance counters
● Selection of CPU burstsrepresentatives
● 10-100x smaller trace● Second execution
● Trace CPU bursts representatives atmicroarchitecture level
-
8/17/2019 Simulation Techniques.ppt
39/39
Trace Obtention: WRF