Data Driven Methods for Improving Mono- and Cross-lingual IR Performance in Environments

TamPub

Näytä suppeat kuvailutiedot

dc.contributor.author Järvelin, Antti -
dc.contributor.author Talvensaari, Tuomas -
dc.contributor.author Järvelin, Anni -
dc.date.accessioned 2012-06-17T20:11:26Z
dc.date.available 2012-06-15 06:24:14 -
dc.date.available 2012-06-17T20:11:26Z
dc.date.issued 2009 -
dc.identifier.isbn 978-951-44-7929-8 -
dc.identifier.uri http://tampub.uta.fi/handle/10024/65729
dc.description.abstract Abstract In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection, because they often are out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpusbased approach, comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are presented. The methods are shown to be effective in OOV word translation (s-grams), in query translation without dictionaries between closely related languages (TRT and s-grams), and in query translation in a special domain (sgrams, TRT and corpus based methods). Keywords: Cross-language information retrieval, noise, OOV words, TRT, s-grams, corpus based methods -
dc.language.iso en -
dc.title Data Driven Methods for Improving Mono- and Cross-lingual IR Performance in Environments -
dc.type fi=Erillisteos | en=Monograph| -
dc.identifier.urn urn:isbn:978-951-44-7929-8 -
dc.relation.numberinseries Research Notes 1 -
dc.type.version fi=Kustantajan versio | en=Publisher's version| -
dc.seriesname TRIM Research Notes -
dc.subject.okm fi=Media- ja viestintätieteet | en=Media and communications| -
dc.oldstats 353 -

Viite kuuluu kokoelmiin:

Näytä suppeat kuvailutiedot