Data Driven Methods for Improving Mono- and Cross-lingual IR Performance in Environments


dc.description.abstract Abstract In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection, because they often are out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpusbased approach, comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are presented. The methods are shown to be effective in OOV word translation (s-grams), in query translation without dictionaries between closely related languages (TRT and s-grams), and in query translation in a special domain (sgrams, TRT and corpus based methods). Keywords: Cross-language information retrieval, noise, OOV words, TRT, s-grams, corpus based methods -
dc.title Data Driven Methods for Improving Mono- and Cross-lingual IR Performance in Environments -
