the jhu multi-microphone multi-speaker asr system for the

50
The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge *Ashish Arora, *Desh Raj, *Aswin Shanmugam Subramanian, *Ke Li, Bar Ben-Yair, Matthew Maciejewski, Piotr Żelasko, Paola Garcia, Shinji Watanabe, Sanjeev Khudanpur

Upload: others

Post on 19-Jan-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The JHU Multi-Microphone Multi-Speaker ASR System for the

The JHU Multi-Microphone Multi-Speaker ASR System for

the CHiME-6 Challenge*Ashish Arora, *Desh Raj, *Aswin Shanmugam Subramanian, *Ke Li, Bar Ben-Yair, Matthew

Maciejewski, Piotr Żelasko, Paola Garcia, Shinji Watanabe, Sanjeev Khudanpur

Page 2: The JHU Multi-Microphone Multi-Speaker ASR System for the

Challenge Overview

• ASR only • Oracle speaker segments provided

Track 1

• Diarization + ASR• No speaker segments

Track 2

Page 3: The JHU Multi-Microphone Multi-Speaker ASR System for the

Challenge Overview

• ASR only • Oracle speaker segments provided

Track 1

• Diarization + ASR• No speaker segments

Track 2

We will present our Track-2 system.

Same ASR model used in both tracks.

Page 4: The JHU Multi-Microphone Multi-Speaker ASR System for the

Our Track-2 Pipeline

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 5: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speech Enhancement

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 6: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

• The first preprocessing step performed was weighted prediction error (WPE*) based online multi-channel dereverberation^.

• Subsequently a delay and sum beamformer (BeamformIt+) was used to denoise the signal.

Speech EnhancementWPE & BeamformIt

* Tomohiro Nakatani, Takuya Yoshioka, Keisuke Kinoshita, Masato Miyoshi, and Biing-Hwang Juang. "Speech dereverberation based on variance-normalized delayed linear prediction”,IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717-1731, Sep. 2010.

^ L. Drude, J. Heymann, C. Boeddeker and R. Haeb-Umbach, "NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing," 13th ITG-Symposium, 2018.+ Xavier Anguera, Chuck Wooters and Javier Hernando. " Acoustic beamforming for speaker diarization of meetings," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011-2023, Sep. 2007.

Page 7: The JHU Multi-Microphone Multi-Speaker ASR System for the

• Trained a neural network to separate noise from speech mixtures.• Used voxceleb data for simulation with CHiME-6 noises (mixtures of 1-4 speakers).• Perceptually seemed better in very noisy segments but didn’t help final performance. So it was not

used in our final system.

Enhancement Array Dev WER (%) Eval WER (%)BeamformIt U06 76.0 72.6

Neural Beamforming U06 76.2 73.6

Denoising AlternativeNeural Beamforming

Page 8: The JHU Multi-Microphone Multi-Speaker ASR System for the

• If blind speech separation (BSS) can be performed before speaker segmentation we

won’t require speaker diarization.

• We tried two methods: (1) independent vector analysis (IVA)* and (2) TaSNet+

• We were not able to obtain good results with this approach, so it was not used in our

final system.

* R. Scheibler, E. Bezzam and I. Dokmanić. ”Pyroomacoustics: A Python package for audio room simulations and array processing algorithms”, ICASSP 2018. + Y. Luo and N. Mesgarani. "TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation," ICASSP 2018.

Other Ideas We TriedBSS & TaSNet

Page 9: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speech SeparationMulti-Array GSS*

• Guided source separation (GSS) was used to separate the target source using the time annotations.

• The groundtruth annotations were used for Track 1 and diarization outputs were used to obtain the time annotations for Track 2.

* Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, and Reinhold Haeb-Umbach. "Guided source separation meets a strong ASR backend: Hitachi/Paderborn University joint investigation for dinner party ASR." Interspeech 2019.

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 10: The JHU Multi-Microphone Multi-Speaker ASR System for the

Importance of Overlap Assignment for GSS – Track 2Multi-Array GSS - Sensitivity to Diarization Output

Method Overlap Detection

Dev WER (%) Eval WER (%)

Multi-Array GSS N 71.0 68.8

Multi-Array GSS Y 69.3 68.8

• GSS is very sensitive to the diarization output.• VB-HMM based overlap detection helps GSS

Page 11: The JHU Multi-Microphone Multi-Speaker ASR System for the

Combining Early & Late Fusion – Track 2Multi-View Decoding with GSS

Separation Method Early Fusion Late Fusion Dev WER (%) Eval WER (%)

Multi-Array GSS Y N 69.3 68.8

Single-Array GSS (U06) N N 73.1 72.1

Single-Array GSS (U04) N N 72.3 74.4

Multi-Array GSS + Single-Array GSS Y Y 68.3 68.3

• Early fusion was performed by incorporating all arrays while beamforming.• Late fusion was performed by lattice combination of GSS output on multi-array with

individual arrays.

Page 12: The JHU Multi-Microphone Multi-Speaker ASR System for the

Combining Early & Late Fusion – Track 2Multi-View Decoding with GSS

Separation Method Early Fusion Late Fusion Dev WER (%) Eval WER (%)

Multi-Array GSS Y N 69.3 68.8

Single-Array GSS (U06) N N 73.1 72.1

Single-Array GSS (U04) N N 72.3 74.4

Multi-Array GSS + Single-Array GSS Y Y 68.3 68.3

• Early fusion was performed by incorporating all arrays while beamforming.• Late fusion was performed by lattice combination of GSS output on multi-array with

individual arrays.

Page 13: The JHU Multi-Microphone Multi-Speaker ASR System for the

Combining Early & Late Fusion – Track 2Multi-View Decoding with GSS

Separation Method Early Fusion Late Fusion Dev WER (%) Eval WER (%)

Multi-Array GSS Y N 69.3 68.8

Single-Array GSS (U06) N N 73.1 72.1

Single-Array GSS (U04) N N 72.3 74.4

Multi-Array GSS + Single-Array GSS Y Y 68.3 68.3

• Early fusion was performed by incorporating all arrays while beamforming.• Late fusion was performed by lattice combination of GSS output on multi-array with

individual arrays.

Early Fusion

Page 14: The JHU Multi-Microphone Multi-Speaker ASR System for the

Combining Early & Late Fusion – Track 2Multi-View Decoding with GSS

• Early fusion was performed by incorporating all arrays while beamforming.• Late fusion was performed by lattice combination of GSS output on multi-array with

individual arrays.

Separation Method Early Fusion Late Fusion Dev WER (%) Eval WER (%)

Multi-Array GSS Y N 69.3 68.8

Single-Array GSS (U06) N N 73.1 72.1

Single-Array GSS (U04) N N 72.3 74.4

Multi-Array GSS + Single-Array GSS Y Y 68.3 68.3

Early Fusion

Late Fusion

Lattice Weights

Page 15: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speech Enhancement Takeaways

GSS gives good improvements and it is sensitive to the diarization output

Further improvement can be obtained by combining early and late fusions

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 16: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speech Activity Detection

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 17: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speech Activity Detection

§ We built a TDNN-Stats based architecture for baseline

§ We use the same core model

§ NEW: Multi-array extension

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Stats pooling

TDNNTDNNTDNN

TDNNTDNNTDNN

TDNNTDNNTDNN

TDNNTDNNTDNN

TDNNTDNNTDNN

Stats pooling

Softmax

Page 18: The JHU Multi-Microphone Multi-Speaker ASR System for the

Baseline SAD

Beamforming over channels within array

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

segments

Page 19: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

segments

Random selection of array U06

Baseline SAD

Page 20: The JHU Multi-Microphone Multi-Speaker ASR System for the

U06 is not the best selection

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

segments

Page 21: The JHU Multi-Microphone Multi-Speaker ASR System for the

Multi-channel + multi-array

1. Can we do better than beamforming for channel combination?

2. Can we do better than random array selection?

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

segments

Page 22: The JHU Multi-Microphone Multi-Speaker ASR System for the

Can channel-level posterior fusion do better than beamforming?

TDNN-StatsCH1CH2CH3CH4

WPE

U01 TDNN-Stats

TDNN-StatsTDNN-Stats

segments

Post

erio

r fus

ion

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 23: The JHU Multi-Microphone Multi-Speaker ASR System for the

Posterior Mean < Posterior Max ≈ Beamforming

But 4x more compute -> keep Beamforming!

TDNN-StatsCH1CH2CH3CH4

WPE

U01 TDNN-Stats

TDNN-StatsTDNN-Stats

segments

Post

erio

r fus

ion

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 24: The JHU Multi-Microphone Multi-Speaker ASR System for the

Posterior fusion for array combination

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

Post

erio

r fus

ion

Page 25: The JHU Multi-Microphone Multi-Speaker ASR System for the

Posterior fusion helps array combination

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

Post

erio

r fus

ion

Page 26: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

TDNN-Stats neuralnetwork

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

Beam

form

ingCH1

CH2CH3CH4

WPE

.

.

.

U06 TDNN-Stats neural

network

segments

Post

erio

r fus

ion

34.3%

Posterior fusion helps array combination significantly

• Huge improvement in missed speech

• Directly affects downstream DER and WER

Page 27: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

SAD Takeaways

Beamforming is better than channel-level posterior fusion.

Array-level posterior fusion gives 34% improvement.

Page 28: The JHU Multi-Microphone Multi-Speaker ASR System for the

Speaker Diarization

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 29: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Baseline system

Beam

form

ingCH1

CH2CH3CH4

WPE

U06

segments

x-vector extraction1.5s/0.75s PLDA scoring AHC RTTM

Baseline SAD

Page 30: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Better SAD improves diarization

Beam

form

ingCH1

CH2CH3CH4

WPE

U06

segments

x-vector extraction1.5s/0.75s PLDA scoring AHC RTTM

Multi-array SAD

DER on dev for U06 (evaluated using original reference without UEM)

63.49% -> 58.95% (59.82% mean DER of all arrays)

All further results using multi-array SAD

Page 31: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Two Questions

Beam

form

ingCH1

CH2CH3CH4

WPE

U06

segments

x-vector extraction1.5s/0.75s PLDA scoring AHC RTTM

Multi-array SAD

1. Only 1 array is used -> how to use multi-array information?

2. Overlaps are ignored -> how to handle overlapping speakers?

Page 32: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

DOVER to combine array outputs

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.75s PLDA scoring AHC RTTM 1

Multi-array SAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.75s PLDA scoring AHC RTTM 6

DOVER RTTM

Stolcke, Andreas and Takuya Yoshioka. “DOVER: A Method for Combining DiarizationOutputs.” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019): 757-763.

System Dev DER (%)Baseline (mean) 59.8DOVER 61.1

Page 33: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

PLDA score fusion helps slightly

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.75s PLDA scoring

Multi-array SAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.75s PLDA scoring

PLDA ScoreFusion AHC RTTM

System Dev DER (%)Baseline (mean) 59.8

PLDA score MEAN 59.9

PLDA score MAX 59.0

Page 34: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Can we do better?Be

amfo

rmin

gCH1CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.75s PLDA scoring

Multi-array SAD

Beam

form

ingCH1CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.75s PLDA scoring

PLDA ScoreFusion AHC RTTM

System Dev DER (%)Baseline (mean) 59.8

PLDA score MEAN 59.9

PLDA score MAX 59.0

• Small improvement in DER, but underwhelming!

• Maybe need to break segments into more pieces?

Page 35: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Use 0.25s shift for x-vector extraction

Landini, Federico et al. “BUT System for the Second DIHARD Speech DiarizationChallenge.” arXiv: Audio and Speech Processing (2020): n. pag.

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-array SADBe

amfo

rmin

gCH1CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion AHC RTTM

Page 36: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

System Dev DER (%)Baseline (mean) 59.8

PLDA score MAX 59.0

+ 0.25s shift 57.9

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-array SAD

Beam

form

ingCH1CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion AHC RTTM

Better DER with smaller segments

Page 37: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Overlap Handling

RTTM

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-arraySAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion

AHCBe

amfo

rmin

gCH1CH2CH3CH4

WPE

U01 LSTM-based

Overlap DetectorOverlaps

OverlapAssignment

RTTM

Bullock, Latané et al. “Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection.” ArXiv abs/1910.11646 (2019): n. pag.

Page 38: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Overlap Handling

RTTM

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-arraySAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion

AHCBe

amfo

rmin

gCH1CH2CH3CH4

WPE

U01 LSTM-based

Overlap DetectorOverlaps

OverlapAssignment

RTTM

Bullock, Latané et al. “Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection.” ArXiv abs/1910.11646 (2019): n. pag.

Overlap assignment:i. Heuristic-basedii. VB-HMM based

Page 39: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Simple heuristic gives small DER improvement

RTTM

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-arraySAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion

AHC

Beam

form

ingCH1

CH2CH3CH4

WPE

U01 LSTM-based

Overlap DetectorOverlaps

OverlapAssignment

RTTM

System Dev DER (%)

Baseline (mean) 59.8

PLDA score MAX 59.0

+ 0.25s shift 57.9

Overlap (heuristic) 56.2

Page 40: The JHU Multi-Microphone Multi-Speaker ASR System for the

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

VB-HMM gives large DER improvement

RTTM

Beam

form

ingCH1

CH2CH3CH4

WPE

U01

segments

x-vector extraction1.5s/0.25s PLDA scoring

Multi-arraySAD

Beam

form

ingCH1

CH2CH3CH4

WPE

U06 x-vector extraction

1.5s/0.25s PLDA scoring

PLDA ScoreFusion

AHC

Beam

form

ingCH1

CH2CH3CH4

WPE

U01 LSTM-based

Overlap DetectorOverlaps

OverlapAssignment

RTTM

System Dev DER (%)Baseline (mean) 59.8

PLDA score MAX 59.0

+ 0.25s shift 57.9

Overlap (heuristic) 56.2

Overlap (VB-HMM) 50.4

Page 41: The JHU Multi-Microphone Multi-Speaker ASR System for the

Diarization Takeaways

Multi-array PLDA fusion helps with smaller segments.

VB-HMM overlap assignment helps significantly!

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1

Page 42: The JHU Multi-Microphone Multi-Speaker ASR System for the

Acoustic Model Training Pipeline

• HMM-GMM/DNN with LFMMI Objective• Training data: combination of worn mic utterances (80h), beamformed array data

(160h), and multi-array GSS enhanced data (40h)• Pronunciation and Silence modeling is used in HMM-GMM training stage.• Speed perturbation is used in LFMMI training stage.

Page 43: The JHU Multi-Microphone Multi-Speaker ASR System for the

Optimizing Model Architecture

Model Architectures Dev WER (%) Eval WER(%)

TDNN-F 51.8 51.3

CNN-TDNN-LSTM 50.1 49.8

SA-CNN-TDNN-F 49.9 49.4

CNN-TDNN-F 48.3 48.5

• CNN-TDNN-F model outperforms other models on CHiME 6 data.

• Oracle overlap information bit is appended to the Fbank features.

Page 44: The JHU Multi-Microphone Multi-Speaker ASR System for the

Training Data Selection

• Enhanced far-field data outperforms raw far-field and simulated data.

• Data selection speeds up experiments.

• Speed perturbation in all cases.

Training data Dev WER (%) Eval WER(%)

Worn Mic (80h)

Worn Mic + Aug (320h)

Array (200h)

Array +Beamformit(160h)

Array + GSS (40h)

yes yes yes 49.6 49.4

yes yes yes yes 44.6 45.4

yes yes yes 44.5 44.9

Zorila, Catalin, et al. "An Investigation into the Effectiveness of Enhancement in ASR Training and Test for CHiME-5 Dinner Party Transcription." arXivpreprint arXiv:1909.12208 (2019).

Page 45: The JHU Multi-Microphone Multi-Speaker ASR System for the

Appending Oracle Overlap Information (Track 1)

Overlap info in ivector

Overlap info in nnet

Dev WER

Eval WER

no no 44.5 44.9

yes no 44.9 45.2

yes yes 44.3 44.4

• Using overlap information as auxiliary input for AM training obtains slight WER improvements.

• Track 2 experiments in progress.

Page 46: The JHU Multi-Microphone Multi-Speaker ASR System for the

Neural LM and Rescoring

• A forward and a backward LSTM• Each is a 2-layer projected LSTM• Hidden dim = 512, projection dim = 128• Backward LSTM is trained on transcription reversed on sentence level

• 2-stage pruned lattice rescoring• Stage 1: Forward LSTM• Stage 2: Backward LSTM

• Kaldi for neural LM training and rescoring

Page 47: The JHU Multi-Microphone Multi-Speaker ASR System for the

Results and Discussion

Page 48: The JHU Multi-Microphone Multi-Speaker ASR System for the

Step-by-Step Improvements for Track 1

21.1%22.2%

Page 49: The JHU Multi-Microphone Multi-Speaker ASR System for the

Contribution for applied methods (Track 2)Step-by-Step Improvements for Track 2

13.4%19.6%

Page 50: The JHU Multi-Microphone Multi-Speaker ASR System for the

SummaryFrontend

- GSS performance improved with improvement in DER.SAD

- Multi-array posterior fusion improves error rate by 34% relativeDiarization

- Multi-array PLDA score fusion shows small improvement in DER- VB-HMM based overlap assignment shows large DER gain and some

WER improvement

Acoustic Model- Deep CNN-TDNN-F model is an effective architecture.- Enhanced far-field data outperforms raw far-field and simulated data

as data augmentation.

Language model- Neural LM rescoring obtains modest WER reductions.

CH

1C

H2

CH

3C

H4

U01 U06

Beamforming

WPE

CH

1C

H2

CH

3C

H4

Beamforming

WPE

SAD with mutli-arrayposterior fusion

. . .

Diarization with multi-arrayPLDA score fusion

VB-based overlapassignment

Multi-array GSS

Acoustic ModelASR

Lattice Combination

Language Model

Transcription

Rep

lace

d w

ith o

racl

e in

trac

k 1