speech recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/dlhlp20/asr2 (v6).pdfย ยท recognition speech...

Post on 14-Aug-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SpeechRecognitionHUNG-YI LEE ๆŽๅฎๆฏ…

LAS: ๅฐฑๆ˜ฏ seq2seq

CTC: decoder ๆ˜ฏ linear classifier ็š„ seq2seq

RNA: ่ผธๅ…ฅไธ€ๅ€‹ๆฑ่ฅฟๅฐฑ่ฆ่ผธๅ‡บไธ€ๅ€‹ๆฑ่ฅฟ็š„ seq2seq

RNN-T: ่ผธๅ…ฅไธ€ๅ€‹ๆฑ่ฅฟๅฏไปฅ่ผธๅ‡บๅคšๅ€‹ๆฑ่ฅฟ็š„ seq2seq

Neural Transducer: ๆฏๆฌก่ผธๅ…ฅไธ€ๅ€‹ window ็š„ RNN-T

MoCha: window ็งปๅ‹•ไผธ็ธฎ่‡ชๅฆ‚็š„ Neural Transducer

Last Time

Two Points of Views

Source of image: ๆŽ็ณๅฑฑ่€ๅธซใ€Šๆ•ธไฝ่ชž้Ÿณ่™•็†ๆฆ‚่ซ–ใ€‹

Seq-to-seq HMM

Hidden Markov Model (HMM)

SpeechRecognition

speech text

X Y

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ Y|๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ ๐‘‹|Y ๐‘ƒ Y

๐‘ƒ ๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘ƒ ๐‘‹|Y ๐‘ƒ Y

๐‘ƒ ๐‘‹|Y : Acoustic Model

๐‘ƒ Y : Language Model

HMM

Decode

HMM

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

โ€ฆโ€ฆ t-d+uw d-uw+y uw-y+uw y-uw+th โ€ฆโ€ฆ

d-uw+y1 d-uw+y2 d-uw+y3

Phoneme:

Tri-phone:

State:

๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A token sequence Y corresponds to a sequence of states S

HMM๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A sentence Y corresponds to a sequence of states S

๐‘Ž ๐‘ ๐‘

Start End

๐‘† ๐‘‹

HMM๐‘ƒ ๐‘‹|Y ๐‘ƒ ๐‘‹|๐‘†

A sentence Y corresponds to a sequence of states S

Transition Probability

๐‘ ๐‘|๐‘Ž๐‘Ž ๐‘

Emission Probability

t-d+uw1

d-uw+y3

P(x|โ€t-d+uw1โ€)

P(x|โ€d-uw+y3โ€)

Gaussian Mixture Model (GMM)

Probability from one state to another

HMM โ€“ Emission Probability

โ€ข Too many states โ€ฆโ€ฆ

P(x|โ€t-d+uw3โ€)P(x|โ€d-uw+y3โ€)

Tied-state

Same Addresspointer pointer

็ต‚ๆฅตๅž‹ๆ…‹: Subspace GMM [Povey, et al., ICASSPโ€™10]

(Geoffrey Hinton also published deep learning for ASR in the same conference)

[Mohamed , et al., ICASSPโ€™10]

P๐œƒ ๐‘‹|๐‘† =?

transition

๐‘ ๐‘|๐‘Ž

๐‘Ž ๐‘emission(GMM)

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘›(๐‘†)

๐‘ƒ ๐‘‹|โ„Ž

alignment

which state generates which vector

โ„Ž1 = ๐‘Ž๐‘Ž๐‘๐‘๐‘๐‘

๐‘Ž ๐‘ ๐‘

โ„Ž2 = ๐‘Ž๐‘๐‘๐‘๐‘๐‘

โ„Ž = ๐‘Ž๐‘๐‘๐‘๐‘๐‘

๐‘ƒ ๐‘‹|โ„Ž1

๐‘ƒ ๐‘‹|โ„Ž2

โ„Ž = ๐‘Ž๐‘๐‘๐‘๐‘

๐‘Ž ๐‘Ž ๐‘ ๐‘ ๐‘ ๐‘๐‘ƒ ๐‘Ž|๐‘Ž ๐‘ƒ ๐‘|๐‘Ž ๐‘ƒ ๐‘|๐‘ ๐‘ƒ ๐‘|๐‘ ๐‘ƒ ๐‘|๐‘

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐‘ƒ ๐‘ฅ1|๐‘Ž ๐‘ƒ ๐‘ฅ2|๐‘Ž ๐‘ƒ ๐‘ฅ3|๐‘ ๐‘ƒ ๐‘ฅ4|๐‘ ๐‘ƒ ๐‘ฅ5|๐‘ ๐‘ƒ ๐‘ฅ6|๐‘

How to use Deep Learning?

Method 1: Tandem

โ€ฆโ€ฆ โ€ฆโ€ฆ

๐‘ฅ๐‘–

Size of output layer= No. of states DNN

โ€ฆโ€ฆ

โ€ฆโ€ฆ

Last hidden layer or bottleneck layer are also possible.

New acoustic feature for HMM

๐‘(๐‘Ž|๐‘ฅ๐‘–) ๐‘(๐‘|๐‘ฅ๐‘–) ๐‘(๐‘|๐‘ฅ๐‘–)

State classifier

Method 2: DNN-HMM Hybrid

DNN

๐‘ƒ ๐‘Ž|๐‘ฅโ€ฆโ€ฆ

๐‘ฅ

๐‘ƒ ๐‘ฅ|๐‘Ž

๐‘ƒ ๐‘ฅ|๐‘Ž =๐‘ƒ ๐‘ฅ, ๐‘Ž

๐‘ƒ ๐‘Ž=๐‘ƒ ๐‘Ž|๐‘ฅ ๐‘ƒ ๐‘ฅ

๐‘ƒ ๐‘ŽCount from training data

CNN, LSTM โ€ฆ

DNN output

How to train a state classifier?

Train HMM-GMM model

state sequence: { a , b , c }

Acoustic features:

Utterance + Label(without alignment)

Utterance + Label(aligned)

state sequence:

Acoustic features:

a a a b b c c

How to train a state classifier?

Train HMM-GMM model

Utterance + Label(without alignment)

Utterance + Label(aligned)

DNN1

How to train a state classifier?

Utterance + Label(without alignment)

Utterance + Label(aligned)

DNN2DNN1realignment

Human Parity!

โ€ข ๅพฎ่ปŸ่ชž้Ÿณ่พจ่ญ˜ๆŠ€่ก“็ช็ ด้‡ๅคง้‡Œ็จ‹็ข‘๏ผšๅฐ่ฉฑ่พจ่ญ˜่ƒฝๅŠ›้”ไบบ้กžๆฐดๆบ–๏ผ(2016.10)

โ€ข https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216

โ€ข IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)

โ€ข http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/

Machine 5.9% v.s. Human 5.9%

Machine 5.5% v.s. Human 5.1%

[Yu, et al., INTERSPEECHโ€™16]

[Saon, et al., INTERSPEECHโ€™17]

Very Deep

[Yu, et al., INTERSPEECHโ€™16]

Back to End-to-end

LAS

๐‘ƒ ๐‘Œ|X =?

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ฅ4๐‘ฅ3๐‘ฅ2๐‘ฅ1

โ€ข LAS directly computes ๐‘ƒ ๐‘Œ|X

๐‘Ž ๐‘

๐‘ƒ ๐‘Œ|X = ๐‘ ๐‘Ž|๐‘‹ ๐‘ ๐‘|๐‘Ž, ๐‘‹ โ€ฆ๐‘ง0

๐‘0

๐‘ง1

Size V

๐‘1

๐‘ง1 ๐‘ง2

a

๐‘ ๐‘Ž ๐‘ ๐‘

๐‘2

๐‘ง3

b

๐‘ ๐ธ๐‘‚๐‘†

Beam Search

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Decoding:

Training:

CTC, RNN-T

๐‘ƒ ๐‘Œ|X =?

๐‘ฅ4๐‘ฅ3๐‘ฅ2๐‘ฅ1

โ€ข LAS directly computes ๐‘ƒ ๐‘Œ|X

๐‘Ž ๐‘

๐‘ƒ ๐‘Œ|X = ๐‘ ๐‘Ž|๐‘‹ ๐‘ ๐‘|๐‘Ž, ๐‘‹ โ€ฆ

โ€ข CTC and RNN-T need alignment

Encoder

โ„Ž4โ„Ž3โ„Ž2โ„Ž1

โ„Ž = ๐‘Ž ๐œ™ ๐‘ ๐œ™๐‘ƒ โ„Ž|๐‘‹

๐‘ ๐‘Ž ๐‘ ๐‘๐‘ ๐œ™ ๐‘ ๐œ™

P Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

โ†’ ๐‘Ž ๐‘

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹Beam Search

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Decoding:

Training:

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|๐‘† =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘†

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

1. Enumerate all the possible alignments

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

HMM, CTC, RNN-T

๐‘ƒ ๐‘‹|๐‘† =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘†

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

1. Enumerate all the possible alignments

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

All the alignments

HMM

CTC

RNN-T

LAS

ไฝ ๅ€‘ๅœจๅฟ™ไป€้บผโ˜บ

c a t c c a a a t c a a a a t

c ๐œ™ a a t t

add ๐œ™

duplicate to length ๐‘‡

๐œ™ c a ๐œ™ t ๐œ™

add ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™

๐‘‡ = 6

SpeechRecognition

speech text

c a t

๐‘ = 3

โ€ฆ

duplicatec a t

to length ๐‘‡

โ€ฆ

c a t โ€ฆโ€ฆ

๐‘Œ โ„Ž โˆˆ ๐‘Ž๐‘™๐‘–๐‘”๐‘›(๐‘Œ)

HMM c a t c c a a a t c a a a a tduplicate to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

nexttoken

duplicate

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint: ๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ = ๐‘‡, ๐‘ก๐‘› > 0Trellis Graph

c c c

a

a

t

aa

t

t t

HMM c a t c c a a a t c a a a a tduplicate to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

nexttoken

duplicate

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint: ๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ = ๐‘‡, ๐‘ก๐‘› > 0Trellis Graph

c c c c cc

For n = 1 to ๐‘

output the n-th token ๐‘ก๐‘› times

constraint:

output โ€œ๐œ™โ€ ๐‘๐‘› times

๐‘ก๐‘› > 0 ๐‘๐‘› โ‰ฅ 0

output โ€œ๐œ™โ€ ๐‘0 times

๐‘ก1 + ๐‘ก2 +โ‹ฏ๐‘ก๐‘ +

๐‘0 + ๐‘1 +โ‹ฏ๐‘๐‘ = ๐‘‡

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

next token

duplicate

insert ๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

(๐œ™ can be skipped)

c

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

duplicate ๐œ™

next token

cannot skip any token

๐œ™

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

next token

duplicate

insert ๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

duplicate

next token

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

cc

a

t

๐œ™

a

๐œ™

t

๐œ™๐œ™

c

๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

c

๐œ™

a

๐œ™

t

๐œ™

c

a

t

๐œ™๐œ™

c c ca

t

c

๐œ™

CTC c ๐œ™ a a t t

add ๐œ™

๐œ™ c a ๐œ™ t ๐œ™duplicate

c a t

to length ๐‘‡

โ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐œ™

s

๐œ™

e

๐œ™

e

๐œ™

next token

duplicate

insert ๐œ™

โ€ฆ ee โ€ฆ โ†’ e

Exception: when the next token is the same token

c

๐œ™

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

For n = 1 to ๐‘

output the n-th token 1 times

constraint:

output โ€œ๐œ™โ€ ๐‘๐‘› times

output โ€œ๐œ™โ€ ๐‘0 times

๐‘0 + ๐‘1 +โ‹ฏ๐‘๐‘ = ๐‘‡

c a t

Put some ๐œ™ (option)

Put some ๐œ™ (option)

Put some ๐œ™ (option)

Put some ๐œ™

at least once

๐‘๐‘ > 0

๐‘๐‘› โ‰ฅ 0 for n = 1 to ๐‘ โˆ’ 1

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

output token

Insert ๐œ™

๐œ™

๐œ™

c

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

๐œ™c

๐œ™ ๐œ™๐‘Ž

๐‘ก๐œ™

๐‘ก๐œ™ ๐œ™

c

a

๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

RNN-Tadd ๐œ™ x ๐‘‡

c ๐œ™ ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™

c ๐œ™ ๐œ™ a ๐œ™ ๐œ™ t ๐œ™ ๐œ™c a t โ€ฆโ€ฆ

๐œ™

๐œ™ ๐œ™ ๐œ™ ๐œ™ ๐œ™

๐‘ก

c

a

๐œ™

c a tStart End

c a tStart End

๐œ™ ๐œ™ ๐œ™ ๐œ™

c a tStart End

๐œ™ ๐œ™ ๐œ™ ๐œ™

HMM

CTC

RNN-T

not gen not gen not gen

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

๐‘ƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

๐‘ƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

This part is challenging.

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

output token

Insert ๐œ™

๐œ™c

๐œ™ ๐œ™๐‘Ž

๐œ™๐‘ก๐œ™ ๐œ™

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘ƒ โ„Ž|๐‘‹

= ๐‘ƒ ๐œ™|๐‘‹

ร— ๐‘ƒ ๐‘|๐‘‹, ๐œ™

ร— ๐‘ƒ ๐œ™|๐‘‹, ๐œ™๐‘ โ€ฆโ€ฆ

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐‘1,0 ๐œ™

๐‘2,0 ๐‘

๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™๐‘4,1 ๐‘Ž

๐‘4,2 ๐œ™๐‘5,2 ๐‘ก

๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘ƒ โ„Ž|๐‘‹

Score Computation

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

c<B

OS>

a

๐‘™ 0๐‘™ 1

๐‘™ 2

Because ๐œ™ is not considered!

๐‘4,2 ๐œ™

๐‘4,2 ๐‘Ž๐‘4,2

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

๐œ™

โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘ ๐œ™ ๐œ™ ๐‘Ž ๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™2 ๐‘™2 ๐‘™3๐‘™3๐œ™ ๐‘๐œ™ ๐œ™ ๐‘Ž

๐œ™๐‘ ๐œ™ ๐œ™๐‘Ž

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ผ4,2

๐›ผ4,2 = ๐›ผ4,1๐‘4,1 ๐‘Ž + ๐›ผ3,2๐‘3,2 ๐œ™

๐›ผ4,1

๐›ผ3,2

generate โ€œaโ€

read ๐‘ฅ4

generate โ€œ๐œ™ โ€,

๐›ผ๐‘–,๐‘—:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens

๐›ผ๐‘–,๐‘—

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ผ4,2 = ๐›ผ4,1๐‘4,1 ๐‘Ž + ๐›ผ3,2๐‘3,2 ๐œ™

๐›ผ๐‘–,๐‘—:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens

You can compute summation of the scores of all the alignments.

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Training

๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”๐‘ƒ ๐‘Œ|๐‘‹

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

Each arrow is a component in ๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

Training

๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™

๐‘1,0 ๐œ™ ๐‘2,0 ๐‘ ๐‘2,1 ๐œ™ ๐‘3,1 ๐œ™ ๐‘4,1 ๐‘Ž ๐‘4,2 ๐œ™ ๐‘5,2 ๐‘ก ๐‘5,3 ๐œ™ ๐‘6,3 ๐œ™

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”๐‘ƒ ๐‘Œ|๐‘‹

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž

๐‘ƒ โ„Ž|๐‘‹

๐œƒ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

โ€ฆโ€ฆ๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐œƒ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

โ€ฆโ€ฆ๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

Each arrow is a component

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

๐‘4,1 ๐‘Ž

๐‘3,2 ๐œ™

๐‘1,0 ๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4

๐‘1,0

๐‘2,0 ๐‘

๐‘2,0

๐‘2,1 ๐œ™

๐‘2,1

๐‘3,1 ๐œ™

๐‘3,1

๐‘4,1 ๐‘Ž

๐‘4,1

c<BOS>

๐‘™0 ๐‘™1

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ=?

Backpropagation (through time)

To encoder

๐‘ƒ ๐‘Œ|๐‘‹ =

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹ +

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž๐‘œ๐‘ข๐‘ก ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

=

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

๐‘4,1 ๐‘Ž

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ

๐‘4,1 ๐‘Ž ร— ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž

=1

๐‘4,1 ๐‘Ž

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹

=

โ„Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ฝ4,2

๐›ฝ4,2 = ๐›ฝ4,3๐‘4,2 ๐‘ก + ๐›ฝ5,2๐‘4,2 ๐œ™

generate โ€œtโ€

read ๐‘ฅ5

generate โ€œ๐œ™โ€

๐›ฝ๐‘–,๐‘—:the summation of the score of all the alignmentsstaring from i-th acoustic features and j-th tokens

๐›ฝ5,2

๐›ฝ4,3

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

c

a

t

๐›ฝ4,2

๐›ผ4,1

๐‘4,1 ๐‘Ž

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž=

1

๐‘4,1 ๐‘Ž

๐‘Ž ๐‘ค๐‘–๐‘กโ„Ž ๐‘4,1 ๐‘Ž

๐‘ƒ โ„Ž|๐‘‹ ๐›ผ4,1 ๐‘4,1 ๐‘Ž ๐›ฝ4,2

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

๐œ•๐‘4,1 ๐‘Ž

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘4,1 ๐‘Ž+๐œ•๐‘3,2 ๐œ™

๐œ•๐œƒ

๐œ•๐‘ƒ ๐‘Œ|๐‘‹

๐œ•๐‘3,2 ๐œ™+โ‹ฏ๐›ผ4,1๐›ฝ4,2

1. Enumerate all the possible alignments

HMM, CTC, RNN-T

P๐œƒ ๐‘‹|Y =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ ๐‘‹|โ„Ž

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

๐‘™๐‘œ๐‘”P๐œƒ ๐‘Œ|๐‘‹

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”๐‘ƒ Y|๐‘‹

P๐œƒ Y|๐‘‹ =

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

HMM CTC, RNN-T

๐œ•P๐œƒ ๐‘Œ|๐‘‹

๐œ•๐œƒ=?

2. How to sum over all the alignments

3. Training:

4. Testing (Inference, decoding):

Yโˆ— = ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”P Y|๐‘‹

= ๐‘Ž๐‘Ÿ๐‘”maxY

๐‘™๐‘œ๐‘”

โ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹

โ‰ˆ ๐‘Ž๐‘Ÿ๐‘”maxY

maxโ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

maxโ„Žโˆˆ๐‘Ž๐‘™๐‘–๐‘”๐‘› ๐‘Œ

๐‘ƒ โ„Ž|๐‘‹็†ๆƒณ

โ„Žโˆ— = ๐‘Ž๐‘Ÿ๐‘” maxโ„Ž

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

Yโˆ— = ๐‘Ž๐‘™๐‘–๐‘”๐‘›โˆ’1 โ„Žโˆ—

Decoding

โ„Ž = ๐œ™ c ๐œ™ ๐œ™ a ๐œ™ t ๐œ™ ๐œ™ โ€ฆ

๐‘ƒ โ„Ž|๐‘‹ = ๐‘ƒ โ„Ž1|๐‘‹ ๐‘ƒ โ„Ž2|๐‘‹, โ„Ž1 ๐‘ƒ โ„Ž3|๐‘‹, โ„Ž1, โ„Ž2 โ€ฆ

โ„Ž1 โ„Ž2 โ„Ž3

็พๅฏฆ

โ„Žโˆ— = ๐‘Ž๐‘Ÿ๐‘” maxโ„Ž

๐‘™๐‘œ๐‘”๐‘ƒ โ„Ž|๐‘‹

๐œ™

โ„Ž1 โ„Ž2 โ„Ž2 โ„Ž3 โ„Ž4 โ„Ž4 โ„Ž5 โ„Ž5 โ„Ž6

๐‘1,0

๐‘

๐‘2,0

๐œ™

๐‘2,1

๐œ™

๐‘3,1

๐‘Ž

๐‘4,1

๐œ™

๐‘4,2

๐‘ก

๐‘5,2

๐œ™

๐‘5,3

๐œ™

๐‘6,3

c a t<BOS>

๐‘™0 ๐‘™1 ๐‘™2 ๐‘™3

๐‘™0 ๐‘™0 ๐‘™1 ๐‘™1 ๐‘™1 ๐‘™2 ๐‘™2 ๐‘™3๐‘™3

max

โ„Žโˆ—

Beam Search for better

approximation

Summary

LAS CTC RNN-T

Decoder dependent independent dependent

Alignment not explicit (soft alignment)

Yes Yes

Training just train it sum over alignment

sum over alignment

On-line No Yes Yes

Reference

โ€ข [Yu, et al., INTERSPEECHโ€™16] Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, INTERSPEECH, 2016

โ€ข [Saon, et al., INTERSPEECHโ€™17] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, BhuvanaRamabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, English Conversational Telephone Speech Recognition by Humans and Machines, INTERSPEECH, 2017

โ€ข [Povey, et al., ICASSPโ€™10] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Kumar Goel, Martin Karafiat, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas, Subspace Gaussian Mixture Models for speech recognition, ICASSP, 2010

โ€ข [Mohamed , et al., ICASSPโ€™10] Abdel-rahman Mohamed and Geoffrey Hinton, Phone recognition using Restricted Boltzmann Machines, ICASSP, 2010

top related