dr. s. arunmozhi1

Upload: ilasundaram

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Dr. S. Arunmozhi1

    1/46

    Selvaraj [email protected]

    rav an n vers y

  • 8/2/2019 Dr. S. Arunmozhi1

    2/46

    24-Jan-2012

    2

    SRM University

    Lexical Resources

    In recent years, monolingual and multilingual,

    have become more readily available.

    ,words and their relations, both within andacross languages,

    has become richer and more easily exploitable invarious applications.

  • 8/2/2019 Dr. S. Arunmozhi1

    3/46

    24-Jan-2012SRM University

    3

    Parallel corpora aligned at word level have createdpossibilities for analyzing translationalcorrespon ences an er v ng ex ca re at onswithin and across languages by means of new

    computational methods such as Semantic Mirrors

    Furthermore, unstructured texts, such as ordinaryweb materials, can be mined in different ways by

    too s such as SketchEngine in order to fully automatically derive overviews of how

    ex ca tems e ave n context.

  • 8/2/2019 Dr. S. Arunmozhi1

    4/46

    24-Jan-2012SRM University

    4

    TDIL Init iat ives

    MCIT started TDIL in 1991

    languages

    To develo information rocessin tools andtechniques

    To facilitate human-machine interaction without

    To create and access multilingual knowledgeresources and integrate them to develop innovativeuser products and services

  • 8/2/2019 Dr. S. Arunmozhi1

    5/46

    24-Jan-2012SRM University

    5

    Basic tools for Indian languages

    Software tools and fonts for all 22 Indian

    been released in the public domain

    -software tools for enabling the linguisticcommunity in the digital age

    www.ildc.in

  • 8/2/2019 Dr. S. Arunmozhi1

    6/46

    24-Jan-2012SRM University

    6

    Ongoing projects in Consort ium mode

    English-IL MT system

    - sys em

    On-line handwritten recognition system

    -

    Speech Corpora/Technologies

    Language Corpora

  • 8/2/2019 Dr. S. Arunmozhi1

    7/46

    24-Jan-2012SRM University

    7

    Lexical Resources

    WordNet

    Corpora

  • 8/2/2019 Dr. S. Arunmozhi1

    8/46

    24-Jan-2012

    8

    SRM University

    WordNet

    WordNets are being used in word sense, ,

    information extraction and information

    retrieval.

    Over 60 WordNets have been developed over theworld.

    Typologically different languages have facedchallenges in adapting the original model andlinking WordNets across languages.

  • 8/2/2019 Dr. S. Arunmozhi1

    9/46

    24-Jan-2012

    9

    SRM University

    What is WordNet?

    A large lexical database, or electronic

    Covers most English nouns, verbs, adjectives,

    adverbs Electronic format makes it amenable to

    automatic manipulation

    and sorting, machine translation,...)

  • 8/2/2019 Dr. S. Arunmozhi1

    10/46

    24-Jan-2012

    10

    SRM University

    What s so special about WordNet?

    Traditional paper dictionaries are organized

    so words that are grouped together (on the same

    a e are unrelated

    WordNet is organized by meaning

    so words in close proximity are related

    Users can browse WordNet and find wordsrelated to their queries (like in a thesaurus)

  • 8/2/2019 Dr. S. Arunmozhi1

    11/46

    24-Jan-2012

    11

    SRM University

    Basic Design of WN

    WordNet entries are word-concept mappings

    Natural Languages map many-to many:

    One conce t can be ex ressed b man words (synonymy): {car, auto, automobile}

    c o e, u

  • 8/2/2019 Dr. S. Arunmozhi1

    12/46

    24-Jan-2012

    12

    SRM University

    One word can express many concepts

    {c lub , stick}

    {c lub , nightclub} {c lub , playing card}

    The words we use most frequently are the most

    polysemous (have the most meanings)!

  • 8/2/2019 Dr. S. Arunmozhi1

    13/46

    24-Jan-2012

    13

    SRM University

    WordNet handles synonymy and polysemy

    Represents words and concepts unambiguously

    Meaningfully relates words and concepts

  • 8/2/2019 Dr. S. Arunmozhi1

    14/46

    24-Jan-2012

    14

    SRM University

    WordNets building blocks: sets of synonyms

    {hit, beat}

    {queue, line}

    Each s nset ex resses a distinct conce t.

    Currently, WordNet contains appr. 117,000synsets

  • 8/2/2019 Dr. S. Arunmozhi1

    15/46

    24-Jan-2012

    15

    SRM University

    WordNet stores, and allows one to retrieve,

    all words that express a given concept

    -based relations

    Result: a large semantic network

    (as opposed to a flat list in a paper dictionary)

  • 8/2/2019 Dr. S. Arunmozhi1

    16/46

    24-Jan-2012

    16

    SRM University

    Relat ions among noun synsets

    Hyperonymy/hyponymy relates super/subordinatesynsets (denting more/less general concepts):

    {vehicle}/ \

    car automobile bic cle bike/ \ \

    {convertible} {SUV} {mountain bike}

    Transitivity: A car is a kind of vehicle

    n s a n o car => An SUV is a kind of vehicle

  • 8/2/2019 Dr. S. Arunmozhi1

    17/46

    24-Jan-2012

    17

    SRM University

    Relat ions among noun synsets

    Meronymy/holonymy (part/whole)car automobile

    |

    {engine}{spark plug} {cylinder}

    Inheritance: A car has an engine An en ine has s ark lu s => A car has spark plugs

  • 8/2/2019 Dr. S. Arunmozhi1

    18/46

    24-Jan-2012

    18

    SRM University

    Relat ions among verb synsets

    Verbs denote event

    {communicate}

    |{talk}

    / \

    s ammer w sper

  • 8/2/2019 Dr. S. Arunmozhi1

    19/46

    24-Jan-2012

    19

    SRM University

    Semantics of events (verbs) are very different

    WordNet captures this fact with different

    Relation refer to temporal properties of events

    artial and com lete overla of two events

    prior or posterior events

  • 8/2/2019 Dr. S. Arunmozhi1

    20/46

    24-Jan-2012

    20

    SRM University

    Relations among synsets create interconnectednetwork

    Different senses of polysemous words aremem ers o s nc synse s a are re a e odifferent synsets i.e. occu different locations in the network

    e.g., {stock, broth} has superordinate synset {dish}

    s oc , ree as superor na e var e y These different synsets are also linked to

  • 8/2/2019 Dr. S. Arunmozhi1

    21/46

    24-Jan-2012

    21

    SRM University

    A words meaning can be defined in terms of itsposition in the network c lub 1 s a n o as soc ia t ion as m e m b e r s c lub 2 is a kind ofs t ick

    Relatedness between words or synsets can be

    quantified in terms of path length (number of connections among synsets)

  • 8/2/2019 Dr. S. Arunmozhi1

    22/46

    24-Jan-2012

    22

    SRM University

    How closely related are {zebra} and {horse}? Very: Both share the direct superordinate equine

    What about {horse, sawhorse} and {horse,gymnastic horse}? e a e , u e o: o n uperor na e ar ac

    is 4-5 levels up

    What about {zebra} and {horse, mnastichorse}? Unrelated: the trees containing them never

  • 8/2/2019 Dr. S. Arunmozhi1

    23/46

    24-Jan-2012

    23

    SRM University

    WSD is a major problem in Natural LanguageProcessing

    Assumption: words in a context (phrase,sentence, discourse) are semantically related , o r s e ze r a

    to mean equine;

    in the neighborhood ofgym it likely meansgymnastic horse.

    If you want to disambiguate horse in the

    con ex o ze ra, oo or a or e pa sfrom zebra to horse.

    sense of horse.

  • 8/2/2019 Dr. S. Arunmozhi1

    24/46

    24-Jan-2012

    24

    SRM University

    Freely downloadable:

    p: wor ne .pr nce on.e u

  • 8/2/2019 Dr. S. Arunmozhi1

    25/46

    24-Jan-2012

    25

    SRM University

    WordNets around the world

    Currently, WordNets exist for some 60, , , ,

    Estonian, Hebrew, Icelandic, Italian, Kannada,

    Latvian Persian Romanian Sanskrit TamilTelugu, Thai, Turkish, Urdu, ...

    Global WordNet Associationhttp://www.globalwordnet.org

  • 8/2/2019 Dr. S. Arunmozhi1

    26/46

    24-Jan-2012

    26

    SRM University

    WordNets in Indian Languages

    Pioneer: Hindi WordNet

    er n an anguages un er ons ruc on

    North-East WordNet

    , ,

    Indradhanush

    Bengali , Gujarati, Kashmiri, Konkani, Oriya,

    Punjabi, Urdu

  • 8/2/2019 Dr. S. Arunmozhi1

    27/46

    24-Jan-2012SRM University

    27

    Dravidian WordNet

    Tamil (Tamil University), Telugu (Dravidian,

    Viswavidyalayam), Kannada (University of

    M sore Funding Agency: DIT

    Bud et: 152 lakhs

    Time frame: 24 months Starting Date: 26-12-2011

  • 8/2/2019 Dr. S. Arunmozhi1

    28/46

    24-Jan-2012SRM University

    28

    Work already done

    Tamil WordNet -

    Tamil Virtual University

    Available for download fromwww.nrc oss. n

    Dravidian WordNet 11000 synsets developed Available online from

    ttp: www.c t. t .ac. n n owor net

  • 8/2/2019 Dr. S. Arunmozhi1

    29/46

    24-Jan-2012

    29

    SRM University

    IndoWordNet

    Collaborative effort to develop/link all Indian

    Foundation of WordNet construction:

    Source: Hindi WordNet

    Ex ansion A roach

  • 8/2/2019 Dr. S. Arunmozhi1

    30/46

    24-Jan-2012SRM University

    30

    Three Principles

    Minimality

    the words in the synset which uniquely identifies

    the conce t. For example

    {fam ily , house} uniquely identifies a concept

    (e.g. he is from the house of the King of Jaipur}.

  • 8/2/2019 Dr. S. Arunmozhi1

    31/46

    24-Jan-2012SRM University

    31

    Coverage

    pr nc p e en s resses on e comp e on o esynset, i.e., capturing ALL the words that stand

    (e.g., {fam ily , house, household, m nage}com pletes the synset).

    Within the synset the words should be orderedaccording their frequency in the corpus.

  • 8/2/2019 Dr. S. Arunmozhi1

    32/46

    24-Jan-2012SRM University

    32

    Replaceab i l i ty

    synset,

    i.e., w ords tow ards the be innin o the s nsetshould be able to replace one another in the examplesentence associated with the synset

    33

  • 8/2/2019 Dr. S. Arunmozhi1

    33/46

    24-Jan-2012SRM University

    33

    Some Stat ist ics on IndoWordNet -

    A ss a m es e 3530 / 19 6 0 9

    Ben ga li 8 679 / 18 563

    B o d o 38 37/ 13357

    Gu g a r a t i 9 70 / 2125

    H in d i 33 9 0 0 / 8 20 0 0

    K a n n a d a 59 20 / 734 4

    M a lay a lam 6 154 / 8 6 22

    M an ipu r i 2744 / 5231

    M a ra th i 9739/ 21223

    ep a 5 0 2 0 27

    Sa n sk r i t 3340 / 178 20

    Ta m i l 4750 / 98 21

    T e lu u 10 6 3 9 / 18 2 5 0

    Ur d u 6 123 / 9 6 4 1

    34

  • 8/2/2019 Dr. S. Arunmozhi1

    34/46

    24-Jan-2012SRM University

    34

    Corpora

    35

  • 8/2/2019 Dr. S. Arunmozhi1

    35/46

    24-Jan-2012

    35

    SRM University

    Indian Languages Corpora Initiative

    The Indian Languages Corpora Initiative (ILCI)

    r e s e a r ch p r o je ct fo r t e ch n o lo gy

    d e ve lo m en t fo r In d ia n la n u a e s .

    Special Centre for Sanskrit Studies ofJ a w a h a r la l N eh r u Un ive r s it y

    is coordinating this national project and is t h eco n s o r t iu m le a d e r o f t h e I LCI p r o je ct .

    36

  • 8/2/2019 Dr. S. Arunmozhi1

    36/46

    24-Jan-2012

    36

    SRM University

    Consort ium Members

    Punjabi University for Punjabi JNU (Center for Indian languages) for Urdu ISI Kolkata for Bangla Utkal University for Oriya

    IIT Mumbai for Marathi Gujarat University for Gujarati Dravidian University for Telugu Tamil University for Tamil

    IITM-K Trivandrum for Malayalam Goa University for Konkani Ea ch co n s o r t iu m m e m b e r w ill d eve lo co r o r a

    a n d s t a n d a r d s in t h e ir r e s p e ct ive la n gu a ge s .

    37

  • 8/2/2019 Dr. S. Arunmozhi1

    37/46

    24-Jan-2012

    37

    SRM University

    The m ain o b ject ive

    11 Indian languages along with English) with

    s t a n d a r d s for 12 major Indian languages includingEng s n t e d o m a in o f t o u r is m a n d h e a lt h .

    Major aims of the project are

    build parallel corpora in the domain of tourism andhealth (Hindi-English and Hindi-Indian languages) &

    annotate (label) the parallel corpora.

    38

  • 8/2/2019 Dr. S. Arunmozhi1

    38/46

    24-Jan-2012

    38

    SRM University

    Aims

    Evolving Draft Standards includes evaluation of

    as part of various projects under Technology

    Development in Indian Languages (TDIL), and evaluating existing standards for their usability.

    Standards for corpora collection, for corpora

    The task of Corpora development includes corporacollection in Hindi arallel cor ora in 11 Indianlanguages and parallel corpora in English.

    39

  • 8/2/2019 Dr. S. Arunmozhi1

    39/46

    24-Jan-2012

    39

    SRM University

    The basic starting point for this project is a list of50,000Hindi sentences used in the tourism and health domain.

    A list of data source institutions including Tourism andHealth departments was made to collect data for Hindi.

    the given 11 Indian languages and English has beencreated as per the standards evolved.

    English are almost completed as per the BIS standards

    40

  • 8/2/2019 Dr. S. Arunmozhi1

    40/46

    24-Jan-2012

    4

    SRM University

    50 K sentences from Hindi into Telugu were

    25 k each in tourism and health domain

    based on BIS-POS Tagset

    Will be read b 1st Jan and

    Will be made available online from www.tdil.gov.in

    41

  • 8/2/2019 Dr. S. Arunmozhi1

    41/46

    24-Jan-2012SRM University

    4

    Tools developed

    Corpora Annotation Tool en er

    Stemmer

    Frequency list builder

    42

  • 8/2/2019 Dr. S. Arunmozhi1

    42/46

    24-Jan-2012SRM University

    ILCI-Phase II

    Major aims of the project are:

    Corpora collection for source language

    target languages

    Corpora annotation of parallel corpora in 23

    languages Agriculture and Culture domains (in addition to

    More than 10 million word corpus to be developed

    43

  • 8/2/2019 Dr. S. Arunmozhi1

    43/46

    24-Jan-2012SRM University

    Budget

    1049.26 - 10 crores, 49 lakhs and 26 thousands par ners . a s

    New partners 60.38 lakhs

    ,Communications and IT, GoI.

    44

  • 8/2/2019 Dr. S. Arunmozhi1

    44/46

    24-Jan-2012SRM University

    New languages in ILCI-PII

    Maith i l i

    K a n n a d a

    S a n s k r i t D o g r i

    S i n d h i

    A s s a m e s e

    M a n i u r i

    Ne p a l i

    B o d o

    45

  • 8/2/2019 Dr. S. Arunmozhi1

    45/46

    24-Jan-2012SRM University

    Advertisement

    M.Sc in Computational Linguistics rav an n vers y

    Under UGCs Innovative Programme

    46

  • 8/2/2019 Dr. S. Arunmozhi1

    46/46

    24-Jan-2012SRM University

    Th nk f r r kin n i n!