linguistic pro - cs.upc.edu › ~nlp › meaning › documentation › 3rdyear › … ·...

38

Upload: others

Post on 29-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Linguistic Processors and InfrastructureDocument Number D3.3Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/~nlp/meaning/meaning.htmlAvailability PublicAuthors: Luisa Bentivogli, Bernardo Magnini (ITC-irst) - Inaki Alegria Loinaz(EHU) - Lluis Padr�o, Jordi Atzerias Batalla (UPC) - Rob Koeling (Sussex)

    INFORMATION SOCIETY TECHNOLOGIES

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 1Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scaleLanguage TechnologiesSecurity (Distribution level) PublicContractual date of delivery DATEActual date of delivery February 2, 2005Document Number D3.3Type ReportStatus & version v DraftNumber of pages 36WP contributing to the deliberable WP3WPTask responsible ITC-irstAuthors Luisa Bentivogli, BernardoMagnini (ITC-irst) - Inaki Ale-gria Loinaz (EHU) - Lluis Padr�o,Jordi Atzerias Batalla (UPC) -Rob Koeling (Sussex)Other contributorsReviewerEC Project O�cer Evangelia MarkidouAuthors: Luisa Bentivogli, Bernardo Magnini (ITC-irst) - Inaki Alegria Loinaz(EHU) - Lluis Padr�o, Jordi Atzerias Batalla (UPC) - Rob Koeling (Sussex)Keywords: Linguistic Processors, Linguistic Resources, Corpora.Abstract: This deliverable reports an analysis of the situation of each partner ofthe Meaning project with respect to the availability of Linguistic Processors andResources after the second phase of the project (LP2). First, a comparison with theprevious plans of development is carried out to verify if they have been accomplished.Second, a �nal inventory of tools, lexiclal resources, and corpora which have beendeveloped or enhanced within the project is presented. Then follows a descriptionof how these tools and resources have been used in the second cycle of acquisition(WP5) and disambiguation (WP6). Finally, a list of web sites where a number ofMeaning tools and resources can be freely accessed is presented.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 2Contents1 Introduction 42 Original Plan for LP1 and LP2 53 Situation at LP2 with respect to the plan devised at LP0 and LP1 63.1 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Basque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.5 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Final Inventories at LP2 114.1 Linguistic Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Lexical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 On-line tools and resources 146 Usage of tools and resources in the second Meaning cycle 156.1 WP5 Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.2 WP6 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . 166.3 WP8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Conclusion 17A New Linguistic Processors developed within the project (not availableon-line) 18A.1 EHU Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.1.1 Eihera (LP1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.2 UPC Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.2.1 HMM Tagger (LP1) . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.2.2 Text Classi�cation tool (LP2) . . . . . . . . . . . . . . . . . . . . . 19A.3 University of Sussex Processors . . . . . . . . . . . . . . . . . . . . . . . . 21A.3.1 Multiword Identi�er (LP1) . . . . . . . . . . . . . . . . . . . . . . . 21A.4 ITC-irst Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22A.4.1 SentencePro (developed at LP1, enhanced at LP2) . . . . . . . . . 22A.4.2 LemmaPro (LP1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23A.4.3 Italian Chunker (LP1) . . . . . . . . . . . . . . . . . . . . . . . . . 25A.4.4 NERD (LP1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26A.4.5 TextPro (LP2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28A.4.6 Chronos (LP2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.4.7 DDS: Domain Driven Similarity (LP2) . . . . . . . . . . . . . . . . 30IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 3B New versions of WordNets at LP2 32B.1 Basque WordNet (EuskWordNet) . . . . . . . . . . . . . . . . . . . . . . . 32B.2 Spanish WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33B.3 Catalan WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34B.4 Italian WordNet (MultiWordNet MEANING Version) . . . . . . . . . . . . 34C New corpora developed within the project (not available on-line) 36C.1 3LB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36C.2 Sussex sense-annotated corpus . . . . . . . . . . . . . . . . . . . . . . . . . 36

    IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 41 IntroductionThe aim of workpackage WP3 \Linguistic Processors and Infrastructure" is, �rst of all,that of analyzing the situation of each partner of theMeaning project with respect to theavailability of Linguistic Processors, Lexical Resources, and Corpora and that of makinga plan for the improvement of the currently available resources and the development offurther resources useful in the acquisition (WP5) and disambiguation (WP6) phases of theproject.TheMeaning approach heavily relies on linguistic information automatically acquiredfrom large-size corpora for di�erent languages (in Meaning Spanish, English, Italian,Basque and Catalan are considered). To this aim, such corpora need to be processed atseveral levels of linguistic analysis, including morphology, syntax, and semantics. As aconsequence, it is of the utmost importance that partners of the project are adequatelyequipped with state of the art processors for their respective languages and for the kind oflinguistic data they want to acquire and use in word sense disambiguation.Deliverable D3.1 reported the situation of each partner at the beginning of the project,providing an extensive description of all tools and resources available at the di�erent sitesfor the languages involved in the project, together with detailed plans for their developmentin the following two phases scheduled at T18 (LP1) and T27 (LP2).Deliverable D3.2 analyzed the situation at T18 (LP1), after the �rst phase of theproject, focusing on the improvements of linguistic processors and resources and their usewithin the �rst cycle of acquisition (ACQ1) and word sense disambiguation (WSD1).In Deliverable D3.3, we sum up the developments of Linguistic Processors throughoutthe whole Meaning project. On the basis of the plan originally presented in D3.1, wedescribe the situation of each Meaning language at LP2 with respect to that plan and wegive details concerning how tools, linguistic resources and corpora have been progressivelydeveloped, enhanced, and used by each partner during the two project cycles of acquisition(ACQ1 and ACQ2) and word sense disambiguation (WSD1 and WSD2).In Section 2 we report the plan, originally presented in D3.1, regarding the develop-ment and improvement of tools and resources for the di�erent languages involved in theMeaning project. On the basis of this plan, Section 3 describes the situation of eachpartner at LP2 with respect to that plan. In Section 4 we present the �nal inventory of thelinguistic processors, lexical resources, and corpora that have been developed within theMeaning project and that are available at LP2. In Section 5 we present a list of web siteswhere a number of Meaning tools and resources can be freely accessed. InSection 6 wedescribe the usage of the developed tools and resources within the second cycle of acqui-sition (WP5) and word sense disambiguation (WP6). Finally, three appendixes describethe new linguistic processors (A), the �nal versions of wordnets (B) and the new corporaavailable at the Meaning partners at LP2.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 52 Original Plan for LP1 and LP2The development of new linguistic processors and the improvement of already running toolsare strictly related to their use in the acquisition and disambiguation phases scheduledwithin the project.Tables 1 and 2 report the plan which was previously devised and described in DeliverableD3.1. This plan regards the development of tools and resources for the di�erent languagesin order to meet the needs of the two phases of acquisition (ACQ1 at T18 and ACQ2 atT27) and word sense disambiguation (WSD1 at T18 and WSD2 at T27).Languages Tools and resources LP1 LP2Italian NE Recognizer Porting to ItalianMultiwords recognizer Extraction Uploading into WN(collocations, Phrasets,domain speci�c terms)Chunker developmentMeaning corpus Annotation Annotation(orthographic,structure, (syntactic, multiwords,morpho-syntactic) NE, semantic)Catalan Morphological analyzer improvement of theanalyzer + enlargementand debugging ofthe lexiconSentence splitter enhance recall whilemaintaining precisionNE Recognizer full integrationPOS tagger bootstrapping of taggerand hand-checked corpusCo-reference resolution development of a co-referencedetection moduleAnaphora resolution porting to Catalan of theSpanish implementationText Classi�cation may be developed providedtraining corporaSpanish Sentence splitter enhance recall whilemaintaining precisionNE Recognizer full integrationChunking development of atreebankCo-reference resolution development of a co-referencedetection moduleAnaphora resolution �nal integration ofthe whole systemText Classi�cation prototype integrationTable 1: Plan of development for Italian, Catalan, and Spanish languagesIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 6Basque Language identi�er improvement integration(common client/server archi-tecture and XML format)Tokenizer improvement integrationLemmatizer improvement integrationPOS tagger improvement integrationText classi�er improvement integrationMultiwords recognizer improve coverage integrationChunker-parser improvement integration in the batchprocess including �rstversion with treatmentof complex prepositionsNE Recognizer �rst version prototype evaluated andrevised version + integrationin the batch processEnglish Shallow parsing system XML input/outputMultiwords recognizer improve coverage incorporating resultsNE Recognizer �rst implementationTable 2: Plan of development for Basque and English languages3 Situation at LP2 with respect to the plan devisedat LP0 and LP1At the beginning of the project (LP0) a plan was devised for the improvement of theavailable resources and the development of further resources useful in the acquisition (WP5)and disambiguation (WP6) phases of the project. Deliverable D3.2 showed that the originalplan for LP1 had been ful�lled and con�rmed the suitability of the original plan also forLP2. This deliverable demonstrates that the original plan has been mostly kept also forLP2. Moreover, during the two phases of the Meaning project, not only the plans havebeen largely respected, but also new tasks have been performed.In the following, we discuss the �nal situation of linguistic processors for eachMeaninglanguage with respect to the plan devised at LP0 and con�rmed at LP1. The achievedresults, in terms of both respected goals and additional performed tasks, are reported forboth LP1 and LP2.3.1 Italian� NE Recognizer:{ LP1: porting to Italian done.� Multiwords Recognizer:{ LP1: extraction of a list of 181,938 multiword expressions (77,984 bigrams andIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 7103,954 trigrams) from "La Repubblica" corpus (years 1999-2000, 38 millionwords){ LP2: uploading into WordNet of 1,100 multiword expressions (547 restrictedcollocations and 553 named entities)� Chunker:{ LP1: development of a �rst version� Meaning corpus:{ LP1: Improvement of the size of the Micro-balanced component (last versioncomposed of 21,310,540 tokens); all corpus annotated up to the morphosyntac-tic level (96,837,555 tokens for the Macro-balanced component and 21,310,540tokens for the Micro-balanced){ LP2: annotation at all planned levels except syntacticOther performed tasks:LP1:� New Sentence Splitter and POS tagger� new version of WordNet-Domains under construction (reorganization of the hierar-chy, mapping of all the domains with the corresponding DDC codes). First releasescheduled for LP2LP2:� improvement of the portability of some tools: TokenPro, SentencePro and LemmaProare available for Unix, Linux and Windows platforms� new text analysis package: TextPro� new tool for the recognition and normalization of temporal expressions: Chronos� new term similarity and domain model inference module based on Latent SemanticAnalysis: DDS3.2 Catalan� Morphological analyzer:{ LP1: improvement of the tool, lexicon extension and debugging� Sentence splitter:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 8{ LP1: partial improvement� NE Recognizer:{ LP1: integration postponed to LP2{ LP2: implementation done� POS tagger:{ LP1: bootstrapping done; hand-tagged corpus almost �nished� Text Classi�cation system:{ LP1: not yet developed due to lack of training corpora� Co-reference resolution module:{ LP2: development not carried out� Anaphora resolution module:{ LP2:porting to Catalan of the Spanish implementation not carried outOther performed tasks:LP1: improvement in the speed of linguistic processors� morphological analyzer ported to C++� fast HMM tagger developedLP2:� Release of FreeLing suite of analyzers, integrated reimplementation of all preexistingtools, under a free software license3.3 Spanish� Sentence splitter:{ LP1: improvement partially done� NE Recognizer:{ LP1: integration postponed to LP2{ LP2: integration carried out� Chunking:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 9{ LP2:development of a Spanish Treebank of 500 Kwords.� Co-reference resolution module:{ LP2: development not carried out� Anaphora resolution module:{ LP2: integration not carried out� Text Classi�cation system:{ LP1: integration postponed to LP2{ LP2: integration carried outOther performed tasks:LP1: improvement in the speed of linguistic processors� morphological analyzer ported to C++� fast HMM tagger developedLP2:� Release of the FreeLing suite of analyzers, integrated reimplementation of all preex-isting tools, under a free software license3.4 Basque� Language identi�er:{ LP1: improvements done for 6 languages; English documentation added{ LP2: integration done� Tokenizer:{ LP1: slight changes introduced{ LP2: integration done� Lemmatizer:{ LP1: improvement in the disambiguation of proper names and guessing; newversion of lexicon{ LP2: integration done� POS tagger:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 10{ LP1: improvement in the disambiguation of proper names and guessing; newversion of lexicon{ LP2: integration done� Text classi�er:{ LP1: complete evaluation{ LP2: integration done� Multiwords recognizer:{ LP1: wider coverage including new units in the lexicon{ LP2: integration done� Chunker-parser:{ LP1: Enhanced. Integration in the batch process including postpositions{ LP2: integration in the batch process including treatment of complex preposi-tions done� NE recognizer:{ LP1: �rst version of Eihera �nished. Integration in the batch process{ LP2: evaluation, revised version, and integration in the batch processOther performed tasksLP1:� Chunker-parser: development of a treebank almost �nished� Integration using XML: almost �nished for the tokenizer, lemmatizer and POS tagger.3.5 English� Shallow parsing system:{ LP1: XML input/output added to RASP system� Multiword recognizer:{ LP1: data extracted and integrated{ LP2: results have not been incorporated� NE Recognizer:{ LP1: maximum entropy model-based, trained on CONNL data, and integratedIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 114 Final Inventories at LP2The systems developed within WP5 (Acquisition) and WP6 (Word Sense Disambiguation)heavily relied on the tools and the di�erent kinds of resources developed within WP3. Thetools and resources available at the end of the Meaning project (i.e. LP2) are describedin detail in the following sections.4.1 Linguistic ProcessorsBuilding on a common framework of basic linguistic processors available for all the lan-guages involved in the project (i.e. tokenizers, morphological analyzers, and Part of Speechtaggers), during the various phases of the Meaning project each partner has produced anumber of tools which may be di�erent for the di�erent languages considered inMeaningand deal with tasks of varying complexity, ranging from tools for basic text processing tomore advanced processing.At the beginning of the project (LP0), some tools were already available. During theproject, most of these tools have been improved and, in some cases, they also have beenintegrated in (multilingual) linguistic analysis platforms. In addition to existing tools,some new Linguistic Processors have been developed, both at LP1 and LP2. Moreover,as well as developing modules for each own language, most of the partners of the projecthave also developed linguistic processors working for English.All Linguistic Processors available at LP2, both existing and new, are described inTables 3 and 4. Processors already available at the beginning of the project but whichhave been improved are shown in italic. New processors developed at LP1 are in bold,while new processors developed at LP2 are in bold and italic.Moreover, some of the linguistic processors developed within the Meaning projecthave been made publicly available on-line by the project partners. These on-line tools arelisted in Section 5, whereas a detailed description of the other new tools not available online is presented in Appendix A.Finally, it is important to mention the Word Sense Disambiguation modules whichhave been developed by all the project partners as one of the main goals of the Meaningproject. These modules are described in detail in the WP6 deliverable on WSD.4.2 Lexical ResourcesThe main lexical resource adopted in the Meaning project is WordNet. WordNets fordi�erent languages have been developed by the partners of the Consortium and, to maintaincompatibility within them, a Multilingual Central Repository has been created (WP4). Theknowledge acquired from each language has been consistently uploaded to the MultilingualCentral Repository and ported over to the local WordNets involved in the project.Moreover, besides the WordNets, some of the partners have developed other lexicalresources useful for the Meaning experiments. In particular, the following resources areavailable at LP2:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 12Basque Catalan SpanishLanguage LangId (EHU) LangId (EHU) LangId (EHU)Identi�erTokenizer Tokenizer (EHU) Freeling-tok (UPC) Freeling-tok (UPC)Sentence Freeling-split (UPC) Freeling-split (UPC)SplitterMorphological Lemati (EHU) Freeling-morpho (UPC) Freeling-morpho(UPC)AnalyzerWord AlignerKey ConceptExtractorChunker Zatiak (EHU)NE Eihera (EHU) Freeling-nerc (UPC) Freeling-nerc (UPC)RecognizerMultiword Lemati-MWD (EHU)Identi�erPoS Tagger Euslem (EHU) Freeling-relax (UPC) Freeling-relax (UPC)Freeling-hmm (UPC) Freeling-hmm (UPC)Spanish TreeTagger (UPC)Parser Freeling-CP (UPC) Freeling-CP(UPC)Text Sailka (EHU) KB-TC (UPC) Classi�er (UPC)Classi�er ML-TC (UPC)TimeRecogn/NormLSA DDS (irst) DDS (irst) DDS (irst)Table 3: Linguistic processors available at LP2 for Basque, Catalan, and Spanish.� WordNet-Domains 2.0. The new version of WordNet-Domain. The resource hasbeen revised: the hierarchy has been reorganized, and all the domains have beenmapped with the corresponding DDC codes. See Working Paper 3.6..� Spanish MiniDir. A WSD-oriented sense dictionary, containing sense de�nitionsrelated to WordNet synsets, as well as examples and collocations for each sense. Italso contains multiword senses and synonyms. The sense inventory covers 46 Spanishwords used in Senseval-3 lexical sample task. The data encoding format is XML.� Catalan MiniDir. A WSD-oriented sense dictionary, containing sense de�nitionsrelated to WordNet synsets, as well as examples and collocations for each sense. Italso contains multiword senses and synonyms. The sense inventory covers 27 Catalanwords used in Senseval-3 lexical sample task. The data encoding format is XML.� EDBL. A lexical database for Basque available on-line (see Section 5).It is worth noting that some of the lexical resources developed within the Meaningproject have been made publicly available on-line by the project partners. These resourcesIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 13English ItalianLanguage Identi�er LangId (EHU)Tokenizer Freeling-tok (UPC) TokenPro (irst)Tokenizer (Sussex) LEX-Tokenizer (irst)CL-Tokenizer (irst)Sentence Splitter Freeling-split (UPC) SentencePro (irst)Morphological Analyzer (Sussex) MorphoPro (irst)Analyzer Freeling-morpho (UPC) FSTAN (irst)Word Aligner KNOWA (irst) KNOWA (irst)Key Concept Extractor KX (irst) KX (irst)Chunker NG (Sussex) Italian Chunker (irst)NE LearningPinocchio (irst) LearningPinocchio (irst)Recognizer NERD (irst) NERD (irst)Freeling-nerc (UPC)Multiword Identi�er MWD-Id (Sussex)PoS Tagger PoS Tagger (Sussex) SSI Tagger (irst)Freeling-relax (UPC) LemmaPro (irst)Freeling-hmm (UPC)SVMTool (UPC)Parser RASP (Sussex)Freeling-CP (UPC)Text Classi�er Categorizer (Sussex)Time Expression Chronos(irst)Recogn/NormalizationLSA DDS (irst) DDS (irst)Table 4: Linguistic processors available at LP2 for English and Italian.are listed in Section 5, whereas complete data about the new versions of the singleWordNetsavailable at the partners' sites at LP2 are reported in Appendix B.4.3 CorporaCorpora play a crucial role within the Meaning project as they represent an importantsource of lexical information which is useful both in the acquisition and in the disam-biguation tasks. Deliverable D3.1 reported the corpora available at the Consortium at thebeginning of the project. These corpora have been progressively automatically annotatedat di�erent levels.Besides the existing corpora, a number of new corpora have been developed duringthe project, such as the parallel English/Italian MultiSemCor corpus (see deliverable D3.2- WP3.4), a Spanish TreeBank and an English domain-speci�c sense tagged corpus (seeAppendix C).Some of these Meaning corpora have been made publicly available on-line by theproject partners. They are listed in Section 5.Moreover, the Meaning Consortium has actively participated in the Senseval 3 Eval-uation exercises for Word Sense Disambiguation. To this purpose, annotated corpora forIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 14all theMeaning languages have been developed for both training and testing of the WSDsystems participating in the competition. These corpora are described in detail in thepapers published in the Senseval 3 Workshop Proceedings by the organizers of the Italian(ITC-irst), Spanish and Catalan (UPC), and Basque (EHU) lexical sample tasks. Boththese papers and the corpora can be directly downloaded from the Senseval 3 web site:http://www.senseval.org/senseval35 On-line tools and resourcesA very important achievement of the Meaning project is that some of the developedlinguistic processors, lexical resourses, and corpora have been made publicly available on-line to the scienti�c community. The respective web sites are listed below:� Eulia. An XML client in Java for integrated EUSLEM (lemmatizer/tagger forBasque)http://ixa3.si.ehu.es/eulia� TextPro. A text analyisis package working for Italian and English.http://tcc.itc.it/projects/textpro/� Freeling. The FreeLing package consists of a library providing language analysisservices (such as morfological analysis, date recognition, PoS tagging, etc.). Thecurrent version (1.2) of the package provides tokenizing, sentence splitting, morpho-logical analysis, NE detection, date/number/currency recognition, PoS tagging, andchart-based shallow parsing. Future versions will improve performance in existingfunctionalities, as well as incorporate new features, such as NE classi�cation, docu-ment classi�cation, etc.http://www.lsi.upc.es/ nlp/freeling� SVMTool. The SVMTool is a simple and e�ective part-of-speech tagger based onSupport Vector Machines. By means of a rigorous experimental evaluation, we con-clude that the proposed SVM-based tagger is robust and exible for feature modelling(including lexicalization), trains e�ciently with almost no parameters to tune, andis able to tag thousands of words per second, which makes it really practical for realNLP applications.http://www.lsi.upc.es/ nlp/SVMTool� WN-Mappings. Mappings between di�erent WN versions, performed using mainlystructural information (i.e. synset relationships), and some help from glosses andsynset words in the cases where structure was not enough (adjectives and adverbs).http://www.lsi.upc.edu/ nlp/tools/mapping.htmlIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 15� WN-Mapping software. C++ code to perform mappings between di�erent WNversions, using the relaxation labelling algorithm.http://www.lsi.upc.edu/ nlp/tools/mapping.html� EuskoWordNet. Basque WordNethttp://siuc02.si.ehu.es/cgi-bin/mcrWei/public/wei.consult.perl� EDBL. A lexical data-base for Basquehttp://ixa2.si.ehu.es/edbl� MultiWordNet. Italian WordNet aligned with the English and Spanish WordNets.http://multiwordnet.itc.it� MultiSemCor. An English/Italian parallel corpus, aligned at the word level andmorphosyntactically and semantically annotated according to WordNet senses.http://multisemcor.itc.it6 Usage of tools and resources in the secondMeaningcycleIn this Section we describe the tools and resources that have been used in the secondMeaning cycle for acquisition, word sense disambiguation, and evaluation.6.1 WP5 AcquisitionExperiment A (Multilingual Acquisition)Tools: Basic linguistic processors, NE Recognizer, ChunkerResources: The \Sport" and \Finance" components of the corpora representing thedi�erent languages involved in the projectExperiment C (Domain for NEs)Tools: NERD for English, POS tagger for EnglishResources: The Reuters corpus and the MCR (WordNet Domains) for English lan-guageExperiment D (Topic Signatures)Tools: Basic linguistic processorsResources: WEBIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 16Experiment E (Sense Examples)Tools: Basic linguistic processorsResources: MCR, WEBExperiment G (Selectional Preferences)Tools: RASP tokeniser, POS tagger, lemmatiser and parser (with grammatical re-lations output)Resources:BNC corpusExperiment I (Multiword Acquisition)Tools: RASP tokeniser, POS tagger, lemmatiser and parser (with grammatical re-lations output)Resources:BNC corpusExperiment J (Enriching WordNet with collocations)Tools: Basic linguistic processorsResources: BNC corpus6.2 WP6 Word Sense DisambiguationExperiment A (All-words for English)Tools: sentence splitter, POS tagger, lemmatizer, MINIPAR, NERD NE recognizerResources: SemCor (for training), MCR, Senseval-2 all-words corpus, WN-Domains,SUMO, BNC, (EuroWordNet Top Ontology)Experiment E (All-words non-English)Tools: POS tagger, lemmatizerResources: For WSD we used the MCR and WN-Domains. In the evaluation phase,for Italian language we used MultiSemCor, while for Spanish and Catalan we usedthe Senseval-3 lexical sample corpus.Experiment F (Features)Tools: POS tagger, lemmatizerResources: WN-Domains, SUMO, EuroWordNet Top Ontology, MCRExperiment G (Unsupervised WSD)Tools: RASP tokeniser, POS tagger, lemmatiser and parser (with grammatical re-lations output)Resources: BNC corpusIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 17Experiment (E�ect of sense clusters)Tools: Basic linguistic processorsResources: MCR, corporaExperiment J (Semantic Class classi�ers)Tools: Basic linguistic processorsResources: WN-Domains, SUMO, EuroWordNet Top Ontology, MCR, WordNetsemantic �lesExperiment K (Automatic ranking of word senses)Tools: RASP tokeniser, POS tagger, lemmatiser and parser (with grammatical re-lations output)Resources: BNC, ReutersExperiment L (Disambiguating WordNet glosses)Tools: Basic linguistic processorsResources: MCR6.3 WP8 EvaluationFor the evaluation of the project the Multilingual Central Repository is being extensivelyused. For a detailed description see Deliverable D8.37 ConclusionThis deliverable reported an analysis of the situation of each partner of the Meaningproject with respect to the availability of Linguistic Processors and Resources after thesecond phase of the project (LP2). The systems that have been developed within WP5(Acquisition) and WP6 (Word Sense Disambiguation) relied on the tools and the di�erentkinds of resources developed within WP3. First, a comparison with the previous plansof development has been carried out to verify if they have been accomplished. Second,a �nal inventory of tools, lexical resources, and corpora which have been developed orenhanced within the project has been presented. Then followed a list of web sites wherea number of Meaning tools and resources can be freely accessed has been presented.Finally, a description of how these tools and resources have been used in the second cycleof acquisition (WP5) and disambiguation (WP6) has been presented.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 18A New Linguistic Processors developed within theproject (not available on-line)This document contains a detailed description of the new linguistic processors which areavailable at di�erent partners' sites at LP2 but not available on-line. As for linguisticprocessors available on-line, the web site is reported in Section 5.A.1 EHU ProcessorsA.1.1 Eihera (LP1)Type: Named entity recognizer [Alegria et al., 2003]Author: EHUDescription: Detects boundaries of NE. Assign to each detected entity a class (person,location, organization, misc)Languages: BasquePortability: Di�cult. A grammar is used.Requirements: Unix/Linux. xfst.I/O format: The tool itself has no I/O format but it uses the output fron EUSLEM. TheCONLL format or a XML output (integrated with POS tagging and chunking) canbe obtained.Examples: IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 19A.2 UPC ProcessorsA.2.1 HMM Tagger (LP1)Type: PoS TaggerAuthor: Muntsa PadroDescription: Trigram based HMM TaggerLanguages: Spanish, Catalan, EnglishPortability: May be ported to any language, provided that training corpora are available.Requirements: C++ compiler, training corporaI/O format: No speci�c I/O format. The tagger is a library that may be called from anymain applicationExamples:Input:the the DT 1cat cat NN 0.444444 cat NNP 0.444444 cat VBP 0.111111eats eat VBZ 0.875 eats NNS 0.125fish fish NN 0.490196 fish NNS 0.490196 fish VBP 0.0196078. . Fp 1Output :he the DTcat cat NNeats eat VBZfish fish NN. . FpA.2.2 Text Classi�cation tool (LP2)Type: Text Classi�cation toolAuthor: Xavier CarrerasDescription: A text categorization tool trained on EFE collection for year 2000. UsesIPTC top-level categories. Accuracy is over 80% F1.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 20Languages: SpanishPortability: Low, it is a prototype consisting of a set of scripts, loosely integrated.Requirements: PerlI/O format: Inputs a document repository in XML, outputs a set of weight and predic-tions for each category.Examples: Input: ONU-CHIPREDE SOTO CONVERSARA PRESIDENTE CHIPRE Y DIRIGENTE TURCOCHIPRIOTA

    Alvaro de Soto, el enviado especial del secretario general de la ONU, el ghans Kofi Annan, para Chipre, emprendi hoy viaje a ese pas, en donde se entrevistar con el presidente chipriota, Glafcos Clerides, y con el mximo dirigente turcochipriota, Rauf Denktash.

    Chipre se encuentra dividida desde que, en 1974, las tropas turcas tomaran por la fuerza el norte del pas, despus de un golpe de Estado militar en Nicosia, inspirado por la junta militar que entonces gobernaba en Grecia

    Output:ACE -7.406883 -1CLJ -5.691495 -1EDU -25.466964 -1FIN -7.697371 -1HTH -4.994665 -1HUM -13.868578 -1LAB -10.420984 -1POL -0.461913 -1REL -24.914497 -1IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 21SCI -9.716915 -1SPO -27.164951 -1WAR 3.693150 +1WEA -56.989698 -1A.3 University of Sussex ProcessorsA.3.1 Multiword Identi�er (LP1)Type: Multiword identi�erAuthor: Diana McCarthy (UoS)Description: A simple program that takes raw text as input, and outputs the same textbut with occurrences of multiword expressions changed so that adjacent words in theexpression are conjoined with an underscore ( ). The multiwords that are handledare speci�ed in lists and fed to a preprocessor which automatically produces the excode required for the data in these lists. Currently, only multiwords comprised ofadjacent words are handled. Variation in the capitalisation and spacing around themultiwords is permitted and there is an option to indicate those multiwords whichcan take a regular (English) plural. Further extensions to allow more variation shouldbe made in a module envisaged for a post-lemmatisation stage. Phrasal verbs, canbe identi�ed in English by the RASP parser.Dependencies: Lists of multiwords that occur adjacently. Currently only handles varia-tion in capitalisation, spacing, and regular plurals.Languages: English, but could be used for other languages depending on the availabilityof multiword lists in that language and a change to pattern currently used for therecognition of regular plurals.Portability: Programmed in Perl and Flex. Easy to port.Requirements: Programmed in Perl and Flex.I/O format In: Raw Text, and lists of multiwords; Out: Raw Text, with adjacent mul-tiwords conjoined with an underscore.Examples:Input:Now I really like shish kebabs, a pina colada and several vol au vents,tout de suite.Output :IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 22Now I really like shish_kebabs , a pina_colada and several vol_au_vents, tout_de_suite.A.4 ITC-irst ProcessorsA.4.1 SentencePro (developed at LP1, enhanced at LP2)Type: Sentence SplitterVersion: 31-12-2004Author: Emanuele Pianta, Simone Romagnoli, Christian Girardi (ITC-irst)Description: Splits a stream of tokens into sentencesLanguages: Italian and EnglishPortability: Easy. Requires a list of abbreviations, a list of non breaking abbreviationsand a list of breaking tags.Requirements: nix/Linux/Windows. Perl 5.I/O format: SentencePro takes as input a tokenized text (one token per line) throughthe standard input and returns, through the standard output, a revised tokenizedtext with sentence boundaries. In the output new abbreviations may be identi�ed,and sentence boundary are marked with a special tag.Examples: Input:IncidentesullaA22coinvolgeilsig.Valdelli.Trafficointerrottodalletrea.m.

    \\Three di�erent formats of the output are available:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 23Output 1:Incidente sulla A22 coinvolge il sig. Valdelli . Traffico interrotto dalle tre a.m.

    Output 2:IncidentesullaA22coinvolgeilsig.Valdelli.Trafficointerrottodalletrea.m.

    Output 3:IncidentesullaA22coinvolgeilsig.Valdelli.Trafficointerrottodalletrea.m.

    A.4.2 LemmaPro (LP1)Type: Part-of-Speech taggerIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 24Author: Simone Romagnoli, Emanuele Pianta (ITC-irst)Description: Given a text in ASCII format performs tokenization, morphological analy-sis, part of speech tagging and lemma selectionLanguages: Italian, EnglishPortability: - A PoS tagged corpus is required. - A tokenizer is required. - A morpho-logical analyser is required. - A POS Tagger is required.Requirements: written in perl; requires perl 5. Works both on Solaris and Windows OSI/O format: LemmaPro takes as input a text in ASCII format and returns as output alemmatized text divided in columns separated by tabs; sub columns are separatedby blanks.Examples:Call example:LemmaPro.pl -c token+pos+lemma+comp_morpho+full_morpho -l eng -otesto-eng.txt.out testo-eng.txt\begin{verbatim}Input:The sleeping cats were sleeping on the beds after they fought!Output:The AT0 the the+art the+adv the+artsleeping AJ0 sleeping sleeping+adj+zero sleep+v+gerund+pressleeping+n+singsleeping+adj+zerocats NN2 cat cat+n+plur cat+n+plurcat+indic+pres+sing3were VBD be be+v+indic+past be+v+indic+pastsleeping VVG sleep sleep+v+gerund+pres sleep+v+gerund+pressleeping+n+singsleeping+adj+zeroon PRP on on+prep on+n+sing on+advon+adj+zero on+prepthe AT0 the the+art the+adv the+artbeds NN2 bed bed+n+plur bed+n+plur bed+indic+pres+sing3after CJS after aft+adj+comp after+prep after+advafter+adj+zeroafter+conjthey PNP they they+pron they+pronfought VVD fight fight+v+indic+past fight+v+indic+pastIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 25fight+v+part+past! PUNA.4.3 Italian Chunker (LP1)Type: Chunker ParserAuthor: Octavian Popescu, Manuela Speranza (ITC-irst)Description: Performs non-recursive chunking of a sentence. The chunker, originallymade by Kinyon, is a client server Java program and the analysis is based on rules.The rules we employed were strictly specialized to our particular purposes, i.e. toidentify the verbal and nominal groups with as much accuracy as possible. A rulespeci�es what may start/end a grammatical phrase. The items whose grammaticalrole cannot be found are left apart (no tag). We purposely did not devise rules forsemantically void elements (such as pronouns). As a consequence, a sentence thathas as subject a pronoun whose referent cannot be identi�ed whithin the reach, isconsidered without subject at all.Languages: ItalianPortability: HighRequirements: Unix/Linux. Java libraryI/O format: INPUT:txt; OUTPUT: txtExamples:[ Il_det_RS confronto_confronto_SS ][ e'__VIY impressionante_impressionare_VSP ],_comma_XPW[ secondo_secondo_E i_det_RP dati_dato_SP raccolti_raccolto_AP ][ dal_da_ES Cer__SPN ] :_colon_XPO[ l'_det_RS evasione_evasione_SS -__XPO elusione_elusione_SSteorica_teorico_AS ](_open_parenthesis_XPB[ lrapporto_rapporto_SS ][ tra_tra_E valore_valore_SS ][ aggiunto_aggiungere_VSP dichiarato_dichiarare_VSP ][ al_a_ES fisco_fisco_SS ][ e_e_C ][ valore_valore_SS aggiunto_aggiunto_AS risultante_risultante_SS ][ dalla_da_ES contabilita'_contabilita`_SN nazionale_nazionale_AS ]IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 26Recall Precision F-MeasurePERSON 91.48 (73.19) 85.08 (73.59) 88.16 (73.39)LOCATION 97.27 (91.90) 80.45 (85.78) 88.07 (88.74)ORGANIZATION 83.88 (90.15) 72.70 (74.84) 77.89 (81.79)All categories 91.32 74.75 82.21Table 5: Overall Precision, Recall and F-Measure scores over the categories PERSON,LOCATION, and ORGANIZATION)_close_parenthesis_XPB[ era_essere_VI ],_comma_XPW[ nel_in_ES 1989__N ],_comma_XPW[ attorno__E ][ al_a_ES 10%__N ][ nell'_in_ES industria_industria_SS ],_comma_XPW[ del_di_ES 55%__N circa_circa_B ][ nei_in_EP servizi_servizio_SP ]A.4.4 NERD (LP1)Type: NE recognizer [Magnini et al., 2002] [Negri and Magnini, 2004]Author: Matteo Negri, Bernardo Magnini (ITC-irst)Description: NERD is a multilingual Named Entity Recognizer for Italian and English.The system has been designed for the identi�cation and the categorization of entitynames (such as persons, locations and organizations names), temporal expressions(dates and times) and certain types of numerical expressions (measures, monetaryvalues, and percentages) in a written text. NERD relies on the combination of a setof language-dependent rules (approximately 350 for English and 400 for Italian) witha set of language-independent predicates, de�ned on the MultiWordNet hierarchy,for the identi�cation of both proper nouns and trigger words. The system, integratedinto the DIOGENE Question Answering architecture, has been successfully used forthe ITC-irst participation to the last two editions of the TREC QA main task, andto the �rst edition of the CLEF multiple language QA track.Table 5 shows evaluation results of system's performance (both for English and Ital-ian) over the three critical NE categories PERSON, LOCATION, and ORGANIZA-TION.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 27Languages: English and ItalianPortability: PoS tagger and MultiWordNet (o�ine)Requirements: written in Allegro Common Lisp; Solaris 2.7I/O format: INPUT: a text; OUTPUT: a text in XML format with the NEs taggedInput format: APW19980314.0392 NEWS STORY 03/14/1998 10:36:00 Russian show of power in World Cup finaleOSLO, Norway (AP) _ Russia's Alexey Prokurorov had no realchallengers as he won the 50-kilometer classical-style crosscountry World Cup race in Holmenkollen Saturday, clocking 2 hours,32 minutes, 25.3 seconds.Output format: APW19980314.0392 NEWS STORY 03/14/1998 10:36:00 Russian show of power in World Cup finaleOSLO,Norway(AP) _Russia'sAlexey Prokurorovhad no real challengers as he won theIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 2850-kilometerclassical-style cross country World Cup race inHolmenkollenSaturday, clocking2 hours, 32 minutes, 25.3 seconds.A.4.5 TextPro (LP2)Type: NLP wrapperAuthor: Emanuele Pianta, Christian Girardi, Oleksandr VaginDescription: TextPro integrates the following tools: TokenPro, SentencePro and LemmaPro.Languages: English, ItalianPortability: TokenPro, SentencePro and LemmaPro are required.Requirements: Unix/Linux/Windows. Perl 5.I/O format: TextPro takes as input a raw text and html �le. For default it returns a �lewith a complete annotation. Using some option it is possible to disable one or moreannotations. The output �le contains one token for line. The several annotations areput into the line separated by tabular space.Examples:Call example:TextPro -c token+pos+sentence+lemma+comp_morpho -l ita testo-ita.txtInput:\begin{verbatim}Input:IncidentesullaA22coinvolgeilsig.Valdelli.TrafficoIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 29interrottodalletrea.m.

    Output:Incidente - SS incidente incidente+n+m+singsulla - ESA22 - Ncoinvolge - VI coinvolgere coinvolgere+v+indic+pres+nil+3+singil - RS det det+art+m+singsig. - YAValdelli - SPN. XPSTraffico - SS traffico traffico+n+m+singinterrotto - AS interrotto interrotto+adj+m+sing+pstdalle - EPtre - N tre tre+adj+_+_+pst+numa.m. YAA.4.6 Chronos (LP2)Type: Tool for the recognition and normalization of temporal expressionsAuthor: Matteo Negri, Luca Marseglia (ITC-irst)Description: Chronos [Negri and Marseglia, 2004] is a tool for the recognition and nor-malization of temporal expressions, which extends the capabilities of NERD [Magniniet al., 2002] [Negri and Magnini, 2004], a rule-based multilingual Named Entity Rec-ognizer for Italian and English. Chronos allows both for the recognition and thenormalization of temporal expressions within an input English text. To this aim,the system is designed to provide the automatic annotation of textual data with theTIMEX2 tag [Ferro et al., 2003], which includes attributes for expressing the nor-malized, intended meaning or value of a broad range of temporal expressions. Thesystem was evaluated within the Full Task framework of ACE/TERN 2004 evaluationcampaign, obtaining the 2nd rank out of 12 participants.Table 6 shows the evaluation results obtained by the system at TERN-2004. Par-ticipating systems were evaluated considering their performance both on the detec-tion/bracketing of temporal expressions (TIMEX2 and TEXT rows in Table ??) andon normalization (ANCHOR DIR, ANCHOR VAL, MOD, SET, and VAL rows) overthe di�erent normalization attributes required by the TIMEX2 annotation formalism.Languages: EnglishIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 30Recall Precision F-MeasureTIMEX2 0.880 0.976 0.926ANCHOR DIR 0.698 0.833 0.760ANCHOR VAL 0.775 0.683 0.726MOD 0.720 0.837 0.774SET 0.564 0.880 0.688TEXT 0.798 0.885 0.839VAL 0.870 0.875 0.872Table 6: Precision, Recall and F-Measure scores obtained by Chronos at TREC-2004Portability: PoS taggerRequirements: written in Allegro Common Lisp; Solaris 2.7I/O format: INPUT: a text; OUTPUT: a text in XML format with the temporal expres-sions tagged with TIMEX2 tagsExamples: January 10, 2005Sunday5:00 p.m.�ve days agothe early 1990sthepast 10 yearsdailyA.4.7 DDS: Domain Driven Similarity (LP2)Type: Term similarity and domain model inferenceAuthor: Al�o Gliozzo and Carlo StrapparavaDescription: We de�ne a domain model (DM) for a set of instances described by a setof features as a (soft) cluster of both the instances and the features. More formally,let D = fD1; D2; :::; D0kg be a set of domains, a domain model is a mapping functionD : F [Rk� > Rk0.Thus a Domain Model induces a mapping from both the vectorial space Rk of thetraining dataset T and the feature set F into a common domain VSM, i.e. into a Rk0space having lower dimensionality (k0 � k). In this space both the instances and thefeatures are represented by means of Domain Vectors (DV), i.e. vectors expressingthe domain relevance for an object with respect to all the domains in D. DVs forIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 31both instances and features are then expressed in a common feature set, allowing toestimate the symilarity among them.Latent Semantic Analysis (LSA) is an unsupervised technique for estimating thesimilarity among texts and terms in a corpus. Even though it has been developedin the area of computational linguistic, it can be applyied to any dataset for whicha features-by-instances matrix can be extracted. LSA is performed by means of aSingular Value Decomposition (SVD) of the term by document matrix describing thetraining corpus. SVD performs a simultaneous principal component analysis on boththe feature and the instance space, with the result of �nding out the most informativedimensions, that will be used as a basis to de�ne a new space, in which the number ofdimensions required to describe the original space is sensibly lower, while preservingmost of the information. Thus SVD induces then a simultaneous mapping from boththe instance and the feature space to a common LSA space, in which the similarityamong features and instances can be uniformly estimated.SVD decomposes T (i.e. the matrix describing the training dataset) into 3 matrixesT ' F�k0IT where �k0 is the diagonal k� k matrix containing the highest k0 eigen-values of T, and all the remaining elements set to 0. The matrix Fp�k0 containsthe DVs for all the features in F , while the matrix Ip�k0 contains the DVs for allthe instances in T . The DV for the feature fi is the vector ~f 0ij~f 0i j , where ~f 0i is the ithrow of the matrix Fp�k0. The parameter k0 is the dimensionality of the LSA spaceand can be �xed in advance.Languages: The technique is language independent.Portability: The module is written in Allegro Common LispRequirements: A corpus of documentsI/O format: Each line is a document. Each token in a line represent a term with itsferquency in the document.Examples: Input:agricolo#a:3 agriturismo#n:1 alimento#n:10 aprire#v:8 ...arricchire#v:1 avicoltura#n:1 capitale#n:3 cibo#n:11 conferenza#n:1 confermare#v:1 ......Output: The output matrices of the LSA.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 32B New versions of WordNets at LP2This appendix contains information about further developments of the WordNets originallyavailable at the di�erent partners' sites. Some of these WordNets are also available on-line.See Section 5 for the web sites addresses.B.1 Basque WordNet (EuskWordNet)Version: 31-12-2004Author: UPV/EHUDescription: version aligned with PWN 1.6Characteristics: Data and relations are listed below. The Basque WordNet contains 20di�erent EWN relations.Total Nouns Verbs Adjectives AdverbsWord senses 51,243 41,871 9,232 140 {Lemmas 25,563 22,486 3,179 50 {Synsets 31,286 27,874 3,299 113 {Proper nouns 677 { { { {New Synsets { { { { {Gaps corresponding 1,239 1,205 26 8 {to PWN synsetsRelations Number Relations Numberbe-in-state 38 verb-group 63causes 110 near-synonym 22has-derived 1 role 5,482has-hyponym 22,300 role-agent 86,591has-mero-madeof 203 role-instrument 253has-mero-member 310 role-location 70has-mero-part 1,969 role-patient 116,691has-subevent 125 see-also-wn15 135nearest 81 xpos-fuzzinym 35near-antonym 1,092 xpos-near-synonym 250TOTAL 235,821Alignment to PWN version: 1.6Database: Msql, but it is being ported to MysqlInterface: Web Interface (WEI). A perl and SOAP API is being developed.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 33B.2 Spanish WordNetVersion: 31-12-2004Author: UPC-UBDescription: New release of the Spanish WordNetCharacteristics: Data and relations are listed below. The Spanish WordNet contains21 di�erent relations from EWN. In this version the relation \entailment-wn15" hasnow been codi�ed as \has-subevent", while \relational-adj-wn15" has been codi�edas \pertains-to". The Spanish EuroWordNet also contains all the relationships fromthe original Princeton WordNet, but only relations for the Spanish WordNet synsetsare reported here. Total Nouns Verbs Adjectives AdverbsWord senses 94,367 63,426 12,459 18,482 {Lemmas 62,022 47,672 5,297 9,053 {Synsets 67,351 43,367 9,043 14,941 {Proper nouns { { { { {New Synsets 5,831 5,344 213 274 {Gaps corresponding 5,059 424 1,564 3,071 {to PWN synsetsRelations Number Relations Numberbe-in-state 1,176 near-antonym 5,491causes 189 near-synonym 17,500has-subevent 356 pertains-to 42has-derived 2,154 verb-group 195has-hyponym 50,161 xpos-fuzzynym 36has-mero-madeof 338 see-also-wn15 3,033has-mero-member 5,182 xpos-near-synonym 307has-mero-part 3,930 has-xpos-hyponym 477role 102 role-location 82role-agent 504 role-patient 6role-instrument 282 TOTAL 91,543Alignment to PWN version: 1.5Database: Msql (but it is being ported to Mysql)Interface: Web Interface (WEI). A perl and SOAP API is being developed.IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 34B.3 Catalan WordNetVersion: 31-12-2004Author: UPC-UBDescription: New release of the Catalan WordNetCharacteristics: Data and relations are listed below. The Catalan WordNet contains16 di�erent relations from EWN. In this version the relation \entailment-wn15" hasnow been codi�ed as \has-subevent", while \relational-adj-wn15" has been codi�edas \pertains-to". The Catalan EuroWordNet also contains all the relationships fromthe original Princeton WordNet, but only relations for the Catalan WordNet synsetsare reported here. Total Nouns Verbs Adjectives AdverbsWord senses 66,686 46,860 11,591 8,235 {Lemmas 43,243 34,288 4,616 4,339 {Synsets 43,722 33,042 5,907 4,773 {Proper nouns { { { { {New Synsets 676 666 10 { {Gaps corresponding 2,206 954 560 692 {to PWN synsetsRelations Number Relations Numberbe-in-state 569 near-antonym 2,408causes 156 near-synonym 4,176has-subevent 213 pertains-to 15has-derived 288 verb-group 101has-hyponym 36,487 role-agent 1has-mero-madeof 297 see-also-wn15 696has-mero-member 3,510 xpos-near-synonym 3has-mero-part 2,839 has-xpos-hyponym 169TOTAL 51,928Alignment to PWN version: 1.5Database: Msql (but it is being ported to Mysql)Interface: Web Interface (WEI). A perl and SOAP API is being developed.B.4 Italian WordNet (MultiWordNet MEANING Version)Version: MEANING versionAuthor: ITC-irstIST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 35Description: New version of MultiWordNet released for the MEANING projectCharacteristics: Data and relations are listed below. The Italian Wordnet contains 11PWN relations plus a new relation called NEAREST. The NEAREST relation is anintralinguistic semantic relation connecting a synset which is a gap to its semanticallynearest synset (usually an hyponym or an hypernym).Total Nouns Verbs Adjectives AdverbsWord senses 66,514 48,156 9,767 6,390 2,201Lemmas 45,499 34,004 4,881 5,032 1,582Synsets 37,808 27,821 4,912 3,851 1,224Proper NounsNew synsets 2,977 2871 61 43 2Phrases 1,514 868 225 361 60Phrasets 1,291 692 204 339 56Gaps corresponding 949 536 98 282 33to PWN synsetsRelations Number Relations Numberhas-hypernym 32,043 similar-to 7,402has-member 783 entailment 251has-part 4,569 attribute 885has-substance 355 causes 127nearest 85 antonym 14pertain 2 also-see 1TOTAL 46,517Alignment to PWN version: 1.6Other info: Domain Information for all synsetsDatabase: MySQL, XMLInterface: Web interface (coupled with mySQL) for browsing and editing. Java API weredeveloped (to be used with MySQL database).

    IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 36C New corpora developed within the project (notavailable on-line)This appendix contains information about the new corpora which are available at thedi�erent partners' sites but not on-line. The list of the corpora available on-line is reportedin Section 5.C.1 3LBVersion: 1.0Type: TreebankAuthor: UPCDescription: Based on CLIC-TALP corpus (varied content: articles, news, literature...)Languages: SpanishTime-span: N/ASources: VariousTotal Size: 100,000 wordsAnnotation: PoS, syntax, syntactic role. 42,000 words with sense annotation.Data Encoding Format: XMLC.2 Sussex sense-annotated corpusVersion:Type:Author:Description:Languages:Time-span:Sources:Total Size:Annotation:Data Encoding Format:IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

  • WP3-D3.3 Version: DraftLinguistic Processors and Infrastructure Page : 37References[Alegria et al., 2003] I. Alegria, I. Balza, N. Ezeiza, I. Fernandez, and R. Urizar. NamedEntity Recognition and Classi�cation for texts in Basque. In II Jornadas de Tratamientoy Recuperaci�on de Informaci�on, JOTRI, page , Madrid, Spain, 2003.[Ferro et al., 2003] L. Ferro, L. Gerber, I. Mani, B. Sundheim, and G. Wilson. TIDES2003 Standard for the Annotation of Temporal Expressions. Technical report, MITRECorp., 2003.[Magnini et al., 2002] B. Magnini, M. Negri, H. Tanev, and R. Prevete. A WordNet-BasedApproach to Named Entities Recognition. In Proceedings of the SemaNet'02 workshopon Building and Using Semantic Networks, Taipei, Taiwan, 2002.[Negri and Magnini, 2004] M. Negri and B. Magnini. Using wordnet predicates for multi-lingual named entity recognition. In To appear in Proceedings of the 2nd InternationalConference of the Global WordNet Association, Brno, Czech Republic, 2004.[Negri and Marseglia, 2004] M. Negri and L. Marseglia. Recognition and Normalization ofTime Expressions: ITC-irst at TERN 2004. Technical report, ITC-irst, 2004.

    IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies