lec wolf slides

8/9/2019 Lec Wolf Slides

1/55

Smart Cameras as HighPerformance Embedded

Systems

Wayne Wolf, Tiehan Lv,

Burak Ozer, Jason Fritts


2/55

Outline

Problem and methodology.Ozer, Lv, Wolf: smart camera system.

Ozer, Lv, Wolf: Optimizing the smartcamera software.

Wolf, Lv, Ozer: Architectures for smart

cameras.

Xu, Wolf: Wave pipelining for NoCs.


3/55

What is video processing?

Initial steps operate on pixels, aredominated by data.

Later steps operate on other types of

data:

Smaller data volumes.

Wider variety of data types.More variation in control flow, run time.


4/55

Multimedia requirements

Complex algorithms:multiple phases;

data and control.

Todays applications: compression.

Tomorrows applications: analysis.


5/55

The multimedia processing

funnel

Data

volumeData

abstraction

pixel processing

principal component analysis,

hidden Markov models

Edge extraction


6/55


7/55

Questions

Measurement:

What do we measure?

On whatimplementation do wemeasure it?

How accurate do ourmeasurements have tobe?

Architecture:

What uniprocessorarchitecture is best?

Do we need amultiprocessor?

How do we balanceprogrammability withother goals?


8/55

The Parapet Project

Goal: design SoC networks for real-timedistributed vision.

The best way to get a good design example

is to create our own.Video is a high-performance, low-power,

cost-sensitive application.

Vision is an important problem.


9/55

Parapet goals

Algorithms (gesture recognition).How do we adapt algorithms to the needs of real-

time embedded video.

Distributed systems.Communicating cameras.

Embedded software.Middleware, code optimization.

SoC architecture.Heterogeneous multiprocessors.


10/55

Smart cameras for smart

rooms

Coordinated cameras track subject:


11/55

Why multiple cameras?

Helps with occlusion.

Can synthesize new views:

Pan/focus of attention.

Zoom/resolution.

Aperture.


12/55

Why distributed?

Centralized server is impractical:

Network bandwidth.

Network power consumption.

Latency for system adjustements (lighting,etc.)

Smart cameras plays into strengths ofVLSI---increases volume.


13/55

Ozer et al: human activity

recognition algorithm

Regionextraction Contourfollowing

Ellipse

fitting

Graph

matching

HMM head

HMM body

HMM hand1

HMM hand2

Gestureclassifier


14/55

Real-time analysis


15/55

Original

Region finding Ellipse fitting


16/55

Tuning the smart camera

software

Initial C/Trimedia was direct translationfrom Matlab.

Goals:

Increase frame rate.

Reduce latency.

Identify bottlenecks for next-generationarchitecture.


17/55

Real-time vs. just fast

Real-time computing adheres toconstraints:

Must perform at a given rate.

To satisfy the rate, must minimize variationsin processing time.


18/55

Stage times before

optimization

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

ProcessingTime(%)

Region Contour Super Match


19/55

Smart camera CPU times

0 5 10 15 20 25 30 35 40 4526.5

27

27.5

28

28.5

29

29.5

Frame number

Processingtime

(msec)

Skin Detection

0 5 10 15 20 25 30 35 40 4565.6

65.8

66

66.2

66.4

66.6

66.8

67

67.2

Frame number

Processingtime(msec)

Contour Following

Skin detection Contour detection

67.2 ms29.5 ms


20/55

Smart camera CPU times,

contd.

0 5 10 15 20 25 30 35 40 4550

100

150

200

250

300

350

Frame number


Superellipse Fitting

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

35

Frame number


Graph Matching

Superellipse fitting Graph matching

35 ms250 ms


21/55

Normalized standard

deviation of stage times

0

0.1

0.2

0.3

0.4

0.50.6

0.7

0.8

0.9

region contour super match all


22/55

Optimizations

Change the algorithm.

Change the program structure.

Change the instructions.


23/55

Algorithmic changes

Superellipses were expensive to fit andoverkill.

Replaced with ellipse fitting.

Improved adjacency algorithm.


24/55

Region finding

Operates on 3 x 3 window.

Roughly linear in frame size.

Sequential algorithm---window moves onepixel per step.


25/55

Program changes

Contour fitting is very control intensive:

Compares local configurations of bits.

Transformed into data-parallel operations

for VLIW: 011

0

1

0

1

1

107 Table

Control-orientedData-oriented


26/55

Instruction changes

Trimedia provides library of intrinsicfunctions that map onto Trimediainstruction sequences.

Goal: eliminate branches.Special instructions.

Loop unrolling.


27/55

Before and after stage

times

0

10

20

30

40

50

60

Processing

Time(ms

Region Contour SuperFit Match Total

Original Optimized


28/55

Results

Before: 5 frames/sec.

After: 31 frames/sec w/o HMM, 25frames/sec with HMM.

Latency approx. 100 ms.

Smaller variation in frame processing

time.Intel port runs well.

M lti


29/55

Multiprocessor

architectures for video

Interested in high-speed videoprocessing.

150 frames/sec.

Want reasonably low-power operation forpervasive applications.


30/55

High-speed smart cameras

High frame rates provide better motioncapture.

Frame rate of 150 frames/sec is

considered desirable.

Stanford CMOS camera can digitize at

10,000 frames/sec.

Wh h t


31/55

Why heterogeneous

architectures make sense

Data

volumeData

abstraction

pixel processing

principal component analysis,

hidden Markov models

Edge extractionVLIWRISC


32/55

Algorithm flow

Region

extraction

Contour

following

Ellipse

fitting

Graph

matching

HMM head

HMM body

HMM hand1

HMM hand2

Gestureclassifier


33/55

Memory structure

Feed-forward data communication.

Not much global memory required.

Allocating memory depends on datavolumes, access patterns, flexibility.

A i ti


34/55

Average processing time

by stage

0.00E+00

2.00E+05

4.00E+05

6.00E+05

8.00E+051.00E+06

1.20E+06

1.40E+06

1.60E+061.80E+06

clock_

cycles/frame

region contour ellipse match hmm

algorithm stages


35/55

Average IPC by stage

0

2

4

6

IPC

region contour ellipse match hmm

Algorithm Stages

Ti h VLIW


36/55

Tiehans VLIW

implementation

Unroll loop to perform multiplecomparisons in parallel.

Pack results into bit vector to address

results table.

Register file, cache provide for reuse of

pixel values.


37/55

Contour crawler machine

Hardware implementation of VLIW code:

memory

Q

Q

Q

FSM


38/55

Crawler and memory

Crawler performance depends on memorysystem.

Access patterns vary in 2 dimensions:

1 2 3

X 4

567

8


39/55

Memory system design

Want to minimize number of partitions toreduce row/column overhead.

Only memory organization that allows for

all parallel accesses is one-word partition.

Assume we fetch one row or column at a

time---3 fetches/cycle.


40/55

Single contour crawler

Assuming row/column access pattern,crawler is faster than VLIW by a relativelysmall constant.


41/55

Multiple crawlers

Assuming we can patch together contours, we

can start multiple crawlers.

Multiple crawler performance is limited bymemory.Multiple crawlers memory accesses can conflict.


42/55

Full-frame SIMD

Can build a large SIMD array with oneprocessor per pixel.

Area*delay:

Speed is roughly constant.

PE is probably about the same size as the

crawler.Not clear it is worth the silicon.


43/55

Heterogeneous system

Region:

Stream processor with current algorithm.Stream processor + RISC for others.

Contour:Crawler.

Ellipse:Superscalar/RISC.

Graph:RISC.


44/55

Stage pipelining

Significant amount of time in non-streaming

stages.

010

20

30

40

50

60

ProcessingTime(m

s)

Region Contour SuperFit Match Total

Original Optimized


45/55

Heterogeneous vs. VLIW

VLIW:

Off-the-shelf IP.

Easy to program.

10 mm2 in 0.13 micron.

Heterogeneous:

Requires more design of blocks, memory.Pipelineable for 2.3X speed-up.

Heterogeneous


46/55

Heterogeneous

multiprocessor size

stage PE

area

(mm^2)

background

MIPS32

4Km 0.9

contour custom 0.001

ellipse, graph MIPS64 5Kf 5

total frame processor 5.901

classification MIPS64 5Kf 5

number of frame processors 3

grand total 22.703


47/55

Thoughts on networking

Dont want to send all the video all of thetime.Need to send bursts, possibly lower frame

rate or resolution.

Distributed control requires low latency.

Kodak is developing a low-cost wireless

network for ad-hoc camera networks.Consider 802.11 too expensive.


48/55

Networks-on-chips

SoC trends:Lots of transistors.

Designs built from

multiple IP blocks.Structured as

heterogeneous

multiprocessors.Why not bundle

interconnect as IP?

CPU CPU

CPU CPU

mem mem

Advantages of NoC based


49/55

Advantages of NoC-based

design

More structured wiring---eases physicaldesign problems.

Encourages design reuse of interconnect.

Structures and simplifies applications byproviding communication primitives.

Leverages networked multiprocessordesign knowledge.

Challenges in NoC based


50/55

Challenges in NoC-based

design

What networks, protocols, etc?

Physical layer---rich but crummyinterconnect.

Application-specific design: what is worthmodifying and what should be left alone?


51/55

Wave pipelining history

L. Cotton introduced wave pipelining in 1969.Used in Cray-1.

Pipelined memory access in Rambus.

Difficult to design.

Multiple data elements in flight.

F1 F2 F3Buffer

Buffer

Pipelining with latch Wave pipelining

Data1 Data2 Data3

F1 F2 F3Buffer

Buffer

Data1 Data2 Data3


52/55

Simulation model

Cadence Spectre 10000m interconnection

Inverters are evenly spaced

Worst case and best case 0.25m Al (width, space)=(1.2m, 2.0m)

[KAH98]

Loa

d

Load

Load

wire model

R/3 R/3 R/3

C/6 C/3 C/3 C/6


53/55

Simulation results (I)

When inverter size is 100md=379ps, t=226ps

When inverter size reduces to 40mN=3.28, t=325ps On one end, throughput is 70% higher; on the other end,

save about 60% inverter area; in middle, increase

throughput while save area.

0

100

200

300

400

500

600

700

120 110 100 90 80 70 60 50 40 30

Inverter sizes (um)

Delay

s(ps)

d t


54/55

Simulation results (II)

When inverter size is 100mEt=20.5pJ/bit, Ew=17.1pJ/bit

When inverter size reduces to 40m

Ew=9.88pJ/bit

Wave pipelining could save energy.

0

5

10

15

20

25

120 110 100 90 80 70 60 50 40 30

Inverter sizes (um)

Averageenergiespe

rbit(pJ/bit) Ew

Et


55/55

Summary

Multimedia applications are already morecomplex and will become more so:

multiple algorithms;

complex control and data.

Instruction-level parallelism helps, but

isnt everything.

lec wolf slides

Documents