when the distribution is the answer - vizwizwhen the distribution is the answer vizwiz challenge...

When the Distribution Is the AnswerVizWiz Challenge

Sandro Pezzelle

Contacts:[email protected]

skype: sandro.pezzellemobile: +39 349 0537325

sandropezzelle.github.ioresearchgate

linkedInscholar

arXiv

Work address:CIMeC, University of Trento

Corso Bettini, 3138068 Rovereto (TN), Italy

Skills

Languages

� Italian� English� FrenchProgramming

� Unix� Python� Keras� Tensorflow

� Matlab� Psychtoolbox

� Lua/TorchStatistics & Others

� R/RStudio� lme4� ggplot2

� LaTeX� LibreOffice� Inkscape� HtmlSoft Skills

� Communication� Writing� Organization� Learning� Networking

Sandro PezzellePhD Student

About me PhD Student in Cognitive and Brain Sciences, track Language,Interaction and Computation. My current research - at the intersectionbetween Computational Linguistics, Computer Vision and Cognition - isfocused on the learning of quantity expressions (numbers, proportions,quantifiers). I’d define myself as an enthusiastic, communicative, multi-faceted person. Proactive and inclined to lifelong learning. “Let’s try!” asa personal motto. My code is full of print().

Education2015 - present, PhD in Cognitive and Brain SciencesCIMeC, University of Trento, Italy. Supervisor: Raffaella BernardiComputational Linguistics, Computer Vision, Cognitive Sciences, MachineLearning, AI

2012 - 2015, MSc in Linguistics, 110/110 cum laudeUniversity of Padova, Italy. Supervisors: Laura Vanelli, Marco MarelliDistributional Semantics, Psycholinguistics, Morphology

Jan 2014 - Jul 2014, Erasmus ProgramUniversite Catholique de Louvain, Belgium.Applied Linguistics, Computational Linguistics, Statistics

2009 - 2012, BSc in Modern Literature, 110/110 cum laudeUniversity of Padova, Italy. Supervisor: Luca ZulianiStylistic and Metrics, Formal Linguistics, Philology

Relevant ExperienceOct 2017, Research InternILLC, University of Amsterdam. Supervisor: Jakub SzymanikDistributional Semantics, Formal Linguistics, Language Modelling

Nov 2016 - Jun 2017, Language SpecialistAppen. Part-time, project-oriented remote positionComputational Linguistics, Formal Linguistics

Training2017 Mini-Symposium on Deep Generative Models, Amsterdam2017 iV&L Training School on Cognitive Robotics, Athens2016 26th ESSLLI, Bolzano2016 iV&L Training School on Deep Learning, Malta2015 - 2016 Machine Learning by Stanford University, Coursera

Recent PresentationsOct 24, 2017 Learning to Quantify from Language and Vision: Insights from Be-havioral and Computational Studies. Talk at Comp. Ling. Series, Amsterdam.Sep 28, 2017 Quantifiers and Proportions in Language and Vision: Insights fromBehavioral and Computational Studies. Talk at CoSaQ Workshop, Amsterdam.Sep 26, 2017 Be Precise or Fuzzy: Learning the Meaning of Cardinals and Quan-tifiers from Vision. Poster at Google NLP Summit, Zurich.

Denis Dushi Sandro Pezzelle Tassilo Klein Moin Nabi

2INTERNAL© 2018 SAP SE or an SAP affiliate company. All rights reserved. ǀ

VQA Task

Q: “What is this?”

AnnotationsInput

answer count

bottle 5

tv 2

office 2

room 1

A1 bottleA2 bottleA3 tvA4 officeA5 bottleA6 tvA7 bottleA8 roomA9 officeA10 bottle

Ground Truth

“bottle”


VQA Evaluation metric

answer count

bottle 5

tv 2

office 2

room 1

Ground Truth

“bottle”

accuracy = min(# Annotators providing that answer

3

, 1) (1)

L(x, c,w) =

|c|X

i=1

wi

(� log

exci

P|x|j=1 e

xj

) (2)

Table 1:

num answers/classes 1 2 5 50 300 3000 40271

soft-loss model acc. (val) 0.349 0.402 0.424 0.481 0.504 0.516 0.512

Table 2: Accuracy of soft-loss model using N classes in prediction.

1

Annotations Evaluation Accuracy

prediction accuracy

bottle 100%

tv ~ 67%

office ~ 67%

room ~ 33%

Training Loss

[1] Antol et al. (2015). VQA: Visual Question Answering. Proceedings of the IEEE international 076 conference on Computer Vision: 2425–2433

[1]


Subjectivity

[2] Jolly, Pezzelle et al. (2018). The Wisdom of MaSSeS: Majority, Subjectivity, and Semantic Similarity in the Evaluation of VQA


Coverage analysis


num samples (train) 9541 11570 12531 14963 17046 19425 20K

% samples (train) 47.70 57.85 62.65 74.81 85.23 97.12 100

Table 1: Number and percentage of samples covered by using the top-N answers

(row 1).

1

• Coverage of samples considering all the annotations


Most frequent answer : unanswerable

count covered samples % covered samples1 3059 32%2 1878 20%≥ 3 4604 48%


Uncertainty-aware training

• Methods that use only the most-frequent answer ignore :

Uncertainty-aware training Uncertainty modeled as agreement over humans

1. Contribution of other answers

2. Uncertainty of each answer


Soft cross-entropy loss

«What's the weather like outside on this photo? Thank you»

.

.

.

7 cloudy0 unsuitable0 yes2 overcast0 blue0 dog...

●

10

VQA Model

.

.

.

7 cloudy0 unsuitable0 yes2 overcast0 blue0 dog...

●

10

accuracy = min(# Annotators providing that answer

3

, 1) (1)

L(x, c,w) =

|c|X

i=1

wi

(� log

exci

P|x|j=1 e

xj

) (2)

Table 1:



Table 2: Accuracy of soft-loss model using N classes in prediction.

1

[3] Ilievski et al. (2017). A simple loss function for improving the convergence and accuracy of visual question answering models.

[4] Kazemi et al. (2017). Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering.

[3]

• Standard VQA model [4]


ResultsDataset Augmented VizWiz 50% VizWiz Balanced VizWiz

Accuracy 0.501 0.446 0.111



Table 2 Accuracy of soft-loss model using N classes in prediction.

Actual class (Most freq. answer)

other unanswerable / unsuitable

predicted class

other 1199 118

unanswerable / unsuitable 1052 804

Table 3 Confusion matrix. unanswerable and unsuitable are the answers with

the highest coverage of samples in VizWiz.

manipulation augmented train 50% train balanced val

accuracy 0.501 0.446 0.111

only Text only Vision Multimodal

unanswerable 0.784 0.796 0.803

other 0.138 0.299 0.340

yes/no 0.499 0.346 0.690

number 0.243 0.319 0.285

tot. accuracy 0.377 0.476 0.516

Table 4 Ablation study.

2

• Accuracy on validation split

• Accuracy on test-challenge split

method acc

SoA 0.475

Ours 0.512

[5] Gurari et al. (2018). VizWiz Grand Challenge: Answering Visual Questions from Blind People.

[5]


Preprocessing

• Accuracy on test-challenge

method acc

SoA 0.4750

Ours 0.5120

Ours + prepro 0.5163

1. Smartly stripping punctuation

2. Filtering conversational words

e.g. “can’t” à “cant”

e.g. “hello”, “please”, “thank you”, “goodbye” ...


[5]


Answerability task

• Accuracy on test-dev

method F1 AP

Ours 65.02 74.71

Ours + Up 68.84 74.73

1. Change output layer of multi-class model

2. Balance dataset

Label : 0/1 (unanswerable/answerable)

• Up-sampling

• Down-samplingImbalanced dataset (71.3 % answerable)

• Accuracy on test-challenge

method F1 AP

SoA - 71.7

Ours + Up 67.71 73.11


[5]


Conclusion

1. Multi-class task

2. Answerability task

Binary classifier with up-sampling of unanswerable samples

• Soft cross-entropy

• Smart preprocessing

Sandro Pezzelle

Contacts:[email protected]

skype: sandro.pezzellemobile: +39 349 0537325

sandropezzelle.github.ioresearchgate

linkedInscholar

arXiv

Work address:CIMeC, University of Trento

Corso Bettini, 3138068 Rovereto (TN), Italy

Skills

Languages

� Italian� English� FrenchProgramming

� Unix� Python� Keras� Tensorflow

� Matlab� Psychtoolbox

� Lua/TorchStatistics & Others

� R/RStudio� lme4� ggplot2

� LaTeX� LibreOffice� Inkscape� HtmlSoft Skills

� Communication� Writing� Organization� Learning� Networking

Sandro PezzellePhD Student

About me PhD Student in Cognitive and Brain Sciences, track Language,Interaction and Computation. My current research - at the intersectionbetween Computational Linguistics, Computer Vision and Cognition - isfocused on the learning of quantity expressions (numbers, proportions,quantifiers). I’d define myself as an enthusiastic, communicative, multi-faceted person. Proactive and inclined to lifelong learning. “Let’s try!” asa personal motto. My code is full of print().

Education2015 - present, PhD in Cognitive and Brain SciencesCIMeC, University of Trento, Italy. Supervisor: Raffaella BernardiComputational Linguistics, Computer Vision, Cognitive Sciences, MachineLearning, AI

2012 - 2015, MSc in Linguistics, 110/110 cum laudeUniversity of Padova, Italy. Supervisors: Laura Vanelli, Marco MarelliDistributional Semantics, Psycholinguistics, Morphology

Jan 2014 - Jul 2014, Erasmus ProgramUniversite Catholique de Louvain, Belgium.Applied Linguistics, Computational Linguistics, Statistics

2009 - 2012, BSc in Modern Literature, 110/110 cum laudeUniversity of Padova, Italy. Supervisor: Luca ZulianiStylistic and Metrics, Formal Linguistics, Philology

Relevant ExperienceOct 2017, Research InternILLC, University of Amsterdam. Supervisor: Jakub SzymanikDistributional Semantics, Formal Linguistics, Language Modelling

Nov 2016 - Jun 2017, Language SpecialistAppen. Part-time, project-oriented remote positionComputational Linguistics, Formal Linguistics

Training2017 Mini-Symposium on Deep Generative Models, Amsterdam2017 iV&L Training School on Cognitive Robotics, Athens2016 26th ESSLLI, Bolzano2016 iV&L Training School on Deep Learning, Malta2015 - 2016 Machine Learning by Stanford University, Coursera

Recent PresentationsOct 24, 2017 Learning to Quantify from Language and Vision: Insights from Be-havioral and Computational Studies. Talk at Comp. Ling. Series, Amsterdam.Sep 28, 2017 Quantifiers and Proportions in Language and Vision: Insights fromBehavioral and Computational Studies. Talk at CoSaQ Workshop, Amsterdam.Sep 26, 2017 Be Precise or Fuzzy: Learning the Meaning of Cardinals and Quan-tifiers from Vision. Poster at Google NLP Summit, Zurich.

Denis Dushi Sandro Pezzelle Tassilo Klein Moin Nabi

Thank you.(Answerable) Questions?

when the distribution is the answer - vizwizwhen the distribution is the answer vizwiz challenge...

Documents