lec wolf slides

Upload: basky

Post on 29-May-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Lec Wolf Slides

    1/55

    Smart Cameras as HighPerformance Embedded

    Systems

    Wayne Wolf, Tiehan Lv,

    Burak Ozer, Jason Fritts

  • 8/9/2019 Lec Wolf Slides

    2/55

    Outline

    Problem and methodology.Ozer, Lv, Wolf: smart camera system.

    Ozer, Lv, Wolf: Optimizing the smartcamera software.

    Wolf, Lv, Ozer: Architectures for smart

    cameras.

    Xu, Wolf: Wave pipelining for NoCs.

  • 8/9/2019 Lec Wolf Slides

    3/55

    What is video processing?

    Initial steps operate on pixels, aredominated by data.

    Later steps operate on other types of

    data:

    Smaller data volumes.

    Wider variety of data types.More variation in control flow, run time.

  • 8/9/2019 Lec Wolf Slides

    4/55

    Multimedia requirements

    Complex algorithms:multiple phases;

    data and control.

    Todays applications: compression.

    Tomorrows applications: analysis.

  • 8/9/2019 Lec Wolf Slides

    5/55

    The multimedia processing

    funnel

    Data

    volumeData

    abstraction

    pixel processing

    principal component analysis,

    hidden Markov models

    Edge extraction

  • 8/9/2019 Lec Wolf Slides

    6/55

  • 8/9/2019 Lec Wolf Slides

    7/55

    Questions

    Measurement:

    What do we measure?

    On whatimplementation do wemeasure it?

    How accurate do ourmeasurements have tobe?

    Architecture:

    What uniprocessorarchitecture is best?

    Do we need amultiprocessor?

    How do we balanceprogrammability withother goals?

  • 8/9/2019 Lec Wolf Slides

    8/55

    The Parapet Project

    Goal: design SoC networks for real-timedistributed vision.

    The best way to get a good design example

    is to create our own.Video is a high-performance, low-power,

    cost-sensitive application.

    Vision is an important problem.

  • 8/9/2019 Lec Wolf Slides

    9/55

    Parapet goals

    Algorithms (gesture recognition).How do we adapt algorithms to the needs of real-

    time embedded video.

    Distributed systems.Communicating cameras.

    Embedded software.Middleware, code optimization.

    SoC architecture.Heterogeneous multiprocessors.

  • 8/9/2019 Lec Wolf Slides

    10/55

    Smart cameras for smart

    rooms

    Coordinated cameras track subject:

  • 8/9/2019 Lec Wolf Slides

    11/55

    Why multiple cameras?

    Helps with occlusion.

    Can synthesize new views:

    Pan/focus of attention.

    Zoom/resolution.

    Aperture.

  • 8/9/2019 Lec Wolf Slides

    12/55

    Why distributed?

    Centralized server is impractical:

    Network bandwidth.

    Network power consumption.

    Latency for system adjustements (lighting,etc.)

    Smart cameras plays into strengths ofVLSI---increases volume.

  • 8/9/2019 Lec Wolf Slides

    13/55

    Ozer et al: human activity

    recognition algorithm

    Regionextraction Contourfollowing

    Ellipse

    fitting

    Graph

    matching

    HMM head

    HMM body

    HMM hand1

    HMM hand2

    Gestureclassifier

  • 8/9/2019 Lec Wolf Slides

    14/55

    Real-time analysis

  • 8/9/2019 Lec Wolf Slides

    15/55

    Original

    Region finding Ellipse fitting

  • 8/9/2019 Lec Wolf Slides

    16/55

    Tuning the smart camera

    software

    Initial C/Trimedia was direct translationfrom Matlab.

    Goals:

    Increase frame rate.

    Reduce latency.

    Identify bottlenecks for next-generationarchitecture.

  • 8/9/2019 Lec Wolf Slides

    17/55

    Real-time vs. just fast

    Real-time computing adheres toconstraints:

    Must perform at a given rate.

    To satisfy the rate, must minimize variationsin processing time.

  • 8/9/2019 Lec Wolf Slides

    18/55

    Stage times before

    optimization

    0.0%

    10.0%

    20.0%

    30.0%

    40.0%

    50.0%

    60.0%

    ProcessingTime(%)

    Region Contour Super Match

  • 8/9/2019 Lec Wolf Slides

    19/55

    Smart camera CPU times

    0 5 10 15 20 25 30 35 40 4526.5

    27

    27.5

    28

    28.5

    29

    29.5

    Frame number

    Processingtime

    (msec)

    Skin Detection

    0 5 10 15 20 25 30 35 40 4565.6

    65.8

    66

    66.2

    66.4

    66.6

    66.8

    67

    67.2

    Frame number

    Processingtime(msec)

    Contour Following

    Skin detection Contour detection

    67.2 ms29.5 ms

  • 8/9/2019 Lec Wolf Slides

    20/55

    Smart camera CPU times,

    contd.

    0 5 10 15 20 25 30 35 40 4550

    100

    150

    200

    250

    300

    350

    Frame number

    Processingtime(msec)

    Superellipse Fitting

    0 5 10 15 20 25 30 35 40 450

    5

    10

    15

    20

    25

    30

    35

    Frame number

    Processingtime(msec)

    Graph Matching

    Superellipse fitting Graph matching

    35 ms250 ms

  • 8/9/2019 Lec Wolf Slides

    21/55

    Normalized standard

    deviation of stage times

    0

    0.1

    0.2

    0.3

    0.4

    0.50.6

    0.7

    0.8

    0.9

    region contour super match all

  • 8/9/2019 Lec Wolf Slides

    22/55

    Optimizations

    Change the algorithm.

    Change the program structure.

    Change the instructions.

  • 8/9/2019 Lec Wolf Slides

    23/55

    Algorithmic changes

    Superellipses were expensive to fit andoverkill.

    Replaced with ellipse fitting.

    Improved adjacency algorithm.

  • 8/9/2019 Lec Wolf Slides

    24/55

    Region finding

    Operates on 3 x 3 window.

    Roughly linear in frame size.

    Sequential algorithm---window moves onepixel per step.

  • 8/9/2019 Lec Wolf Slides

    25/55

    Program changes

    Contour fitting is very control intensive:

    Compares local configurations of bits.

    Transformed into data-parallel operations

    for VLIW: 011

    0

    1

    0

    1

    1

    107 Table

    Control-orientedData-oriented

  • 8/9/2019 Lec Wolf Slides

    26/55

    Instruction changes

    Trimedia provides library of intrinsicfunctions that map onto Trimediainstruction sequences.

    Goal: eliminate branches.Special instructions.

    Loop unrolling.

  • 8/9/2019 Lec Wolf Slides

    27/55

    Before and after stage

    times

    0

    10

    20

    30

    40

    50

    60

    Processing

    Time(ms

    Region Contour SuperFit Match Total

    Original Optimized

  • 8/9/2019 Lec Wolf Slides

    28/55

    Results

    Before: 5 frames/sec.

    After: 31 frames/sec w/o HMM, 25frames/sec with HMM.

    Latency approx. 100 ms.

    Smaller variation in frame processing

    time.Intel port runs well.

    M lti

  • 8/9/2019 Lec Wolf Slides

    29/55

    Multiprocessor

    architectures for video

    Interested in high-speed videoprocessing.

    150 frames/sec.

    Want reasonably low-power operation forpervasive applications.

  • 8/9/2019 Lec Wolf Slides

    30/55

    High-speed smart cameras

    High frame rates provide better motioncapture.

    Frame rate of 150 frames/sec is

    considered desirable.

    Stanford CMOS camera can digitize at

    10,000 frames/sec.

    Wh h t

  • 8/9/2019 Lec Wolf Slides

    31/55

    Why heterogeneous

    architectures make sense

    Data

    volumeData

    abstraction

    pixel processing

    principal component analysis,

    hidden Markov models

    Edge extractionVLIWRISC

  • 8/9/2019 Lec Wolf Slides

    32/55

    Algorithm flow

    Region

    extraction

    Contour

    following

    Ellipse

    fitting

    Graph

    matching

    HMM head

    HMM body

    HMM hand1

    HMM hand2

    Gestureclassifier

  • 8/9/2019 Lec Wolf Slides

    33/55

    Memory structure

    Feed-forward data communication.

    Not much global memory required.

    Allocating memory depends on datavolumes, access patterns, flexibility.

    A i ti

  • 8/9/2019 Lec Wolf Slides

    34/55

    Average processing time

    by stage

    0.00E+00

    2.00E+05

    4.00E+05

    6.00E+05

    8.00E+051.00E+06

    1.20E+06

    1.40E+06

    1.60E+061.80E+06

    clock_

    cycles/frame

    region contour ellipse match hmm

    algorithm stages

  • 8/9/2019 Lec Wolf Slides

    35/55

    Average IPC by stage

    0

    2

    4

    6

    IPC

    region contour ellipse match hmm

    Algorithm Stages

    Ti h VLIW

  • 8/9/2019 Lec Wolf Slides

    36/55

    Tiehans VLIW

    implementation

    Unroll loop to perform multiplecomparisons in parallel.

    Pack results into bit vector to address

    results table.

    Register file, cache provide for reuse of

    pixel values.

  • 8/9/2019 Lec Wolf Slides

    37/55

    Contour crawler machine

    Hardware implementation of VLIW code:

    memory

    Q

    Q

    Q

    FSM

  • 8/9/2019 Lec Wolf Slides

    38/55

    Crawler and memory

    Crawler performance depends on memorysystem.

    Access patterns vary in 2 dimensions:

    1 2 3

    X 4

    567

    8

  • 8/9/2019 Lec Wolf Slides

    39/55

    Memory system design

    Want to minimize number of partitions toreduce row/column overhead.

    Only memory organization that allows for

    all parallel accesses is one-word partition.

    Assume we fetch one row or column at a

    time---3 fetches/cycle.

  • 8/9/2019 Lec Wolf Slides

    40/55

    Single contour crawler

    Assuming row/column access pattern,crawler is faster than VLIW by a relativelysmall constant.

  • 8/9/2019 Lec Wolf Slides

    41/55

    Multiple crawlers

    Assuming we can patch together contours, we

    can start multiple crawlers.

    Multiple crawler performance is limited bymemory.Multiple crawlers memory accesses can conflict.

  • 8/9/2019 Lec Wolf Slides

    42/55

    Full-frame SIMD

    Can build a large SIMD array with oneprocessor per pixel.

    Area*delay:

    Speed is roughly constant.

    PE is probably about the same size as the

    crawler.Not clear it is worth the silicon.

  • 8/9/2019 Lec Wolf Slides

    43/55

    Heterogeneous system

    Region:

    Stream processor with current algorithm.Stream processor + RISC for others.

    Contour:Crawler.

    Ellipse:Superscalar/RISC.

    Graph:RISC.

  • 8/9/2019 Lec Wolf Slides

    44/55

    Stage pipelining

    Significant amount of time in non-streaming

    stages.

    010

    20

    30

    40

    50

    60

    ProcessingTime(m

    s)

    Region Contour SuperFit Match Total

    Original Optimized

  • 8/9/2019 Lec Wolf Slides

    45/55

    Heterogeneous vs. VLIW

    VLIW:

    Off-the-shelf IP.

    Easy to program.

    10 mm2 in 0.13 micron.

    Heterogeneous:

    Requires more design of blocks, memory.Pipelineable for 2.3X speed-up.

    Heterogeneous

  • 8/9/2019 Lec Wolf Slides

    46/55

    Heterogeneous

    multiprocessor size

    stage PE

    area

    (mm^2)

    background

    MIPS32

    4Km 0.9

    contour custom 0.001

    ellipse, graph MIPS64 5Kf 5

    total frame processor 5.901

    classification MIPS64 5Kf 5

    number of frame processors 3

    grand total 22.703

  • 8/9/2019 Lec Wolf Slides

    47/55

    Thoughts on networking

    Dont want to send all the video all of thetime.Need to send bursts, possibly lower frame

    rate or resolution.

    Distributed control requires low latency.

    Kodak is developing a low-cost wireless

    network for ad-hoc camera networks.Consider 802.11 too expensive.

  • 8/9/2019 Lec Wolf Slides

    48/55

    Networks-on-chips

    SoC trends:Lots of transistors.

    Designs built from

    multiple IP blocks.Structured as

    heterogeneous

    multiprocessors.Why not bundle

    interconnect as IP?

    CPU CPU

    CPU CPU

    mem mem

    Advantages of NoC based

  • 8/9/2019 Lec Wolf Slides

    49/55

    Advantages of NoC-based

    design

    More structured wiring---eases physicaldesign problems.

    Encourages design reuse of interconnect.

    Structures and simplifies applications byproviding communication primitives.

    Leverages networked multiprocessordesign knowledge.

    Challenges in NoC based

  • 8/9/2019 Lec Wolf Slides

    50/55

    Challenges in NoC-based

    design

    What networks, protocols, etc?

    Physical layer---rich but crummyinterconnect.

    Application-specific design: what is worthmodifying and what should be left alone?

  • 8/9/2019 Lec Wolf Slides

    51/55

    Wave pipelining history

    L. Cotton introduced wave pipelining in 1969.Used in Cray-1.

    Pipelined memory access in Rambus.

    Difficult to design.

    Multiple data elements in flight.

    F1 F2 F3Buffer

    Buffer

    Pipelining with latch Wave pipelining

    Data1 Data2 Data3

    F1 F2 F3Buffer

    Buffer

    Data1 Data2 Data3

  • 8/9/2019 Lec Wolf Slides

    52/55

    Simulation model

    Cadence Spectre 10000m interconnection

    Inverters are evenly spaced

    Worst case and best case 0.25m Al (width, space)=(1.2m, 2.0m)

    [KAH98]

    Loa

    d

    Load

    Load

    wire model

    R/3 R/3 R/3

    C/6 C/3 C/3 C/6

  • 8/9/2019 Lec Wolf Slides

    53/55

    Simulation results (I)

    When inverter size is 100md=379ps, t=226ps

    When inverter size reduces to 40mN=3.28, t=325ps On one end, throughput is 70% higher; on the other end,

    save about 60% inverter area; in middle, increase

    throughput while save area.

    0

    100

    200

    300

    400

    500

    600

    700

    120 110 100 90 80 70 60 50 40 30

    Inverter sizes (um)

    Delay

    s(ps)

    d t

  • 8/9/2019 Lec Wolf Slides

    54/55

    Simulation results (II)

    When inverter size is 100mEt=20.5pJ/bit, Ew=17.1pJ/bit

    When inverter size reduces to 40m

    Ew=9.88pJ/bit

    Wave pipelining could save energy.

    0

    5

    10

    15

    20

    25

    120 110 100 90 80 70 60 50 40 30

    Inverter sizes (um)

    Averageenergiespe

    rbit(pJ/bit) Ew

    Et

  • 8/9/2019 Lec Wolf Slides

    55/55

    Summary

    Multimedia applications are already morecomplex and will become more so:

    multiple algorithms;

    complex control and data.

    Instruction-level parallelism helps, but

    isnt everything.