Evaluating lexical resources for a semantic tagger

Evaluating Lexical Resources for A Semantic Tagger

Scott S。 L。 Piao1, Paul Rayson2, Dawn Archer1, Tony McEnery1

1Department of Linguistics and MEL

2Computing Department

Lancaster University

Lancaster LA1 4YT

United Kingdom

{s.piao@http://www.csgoeats.com/doc/info-95bf5cc6bb4cf7ec4afed08a.html;paul@http://www.csgoeats.com/doc/info-95bf5cc6bb4cf7ec4afed08a.html;d.archer@http://www.csgoeats.com/doc/info-95bf5cc6bb4cf7ec4afed08a.html; amcenery@http://www.csgoeats.com/doc/info-95bf5cc6bb4cf7ec4afed08a.html }

Abstract

Semantic lexical resources play an important part in both linguistic study and natural language engineering。 In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger。 Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community。 In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods。 We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries。 In the evaluation, the semantic lexicon achieved a lexical coverage of 98。49% on the BNC sampler, 95。38% on the METER Corpus and 92。76% -- 97。29% on the historical data。 Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words。 Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres。

1. Introduction

Lexical resources play an important part in both linguistic study and natural language engineering. Over the past decade, in particular, large semantic lexicons, such as WordNet (Fellbaum, 1998), EuroWordNet (Vossen, 1998), HowNet (http://www.csgoeats.com/doc/info-95bf5cc6bb4cf7ec4afed08a.html), etc. have been built and applied to various tasks.

During the same period of time, another large semantic lexical resource has been built in Lancaster University, as a knowledge base for an English semantic tagger named USAS (Rayson and Wilson 1996; Piao et al. 2003). Employing a semantic annotation scheme, this lexicon links English lexemes and multiword expressions to their potential semantic categories, which are disambiguated according to their context in actual discourse.

In this paper, we present our evaluation work on the lexical coverage of the semantic lexicon of the Lancaster semantic tagger. During the evaluation, we examined the system’s lexical coverage in both modern general English and a narrow-domain English corpus. We also investigated how the time periods affect the lexical coverage of our semantic lexicon. As this paper will show, our evaluation suggests that the optimal way of evaluating lexical resources is to conduct it over multiple genres and various time periods, using a large representative corpus or several domain-specific corpora.

2. Lancaster Semantic Lexicon

As mentioned earlier, the Lancaster semantic lexicon has been developed as a semantic lexical knowledge database for a semantic tagger. It consists of two main parts: a single word sub-lexicon and a multi-word expression (MWE) sub-lexicon. Currently it contains over 42,300 single word entries and over 18,400 multi-word expression entries. In the single word sub-lexicon, each entry maps a word, together with its POS category1, to its potential semantic categories. For example, the word “iron” is mapped to the category of {object/substance and material} when it

is used as a noun, and to the category of {cleaning and personal care} when it is used as a verb。 When provided with context, these candidate categories can be disambiguated。

The entries in the MWE lexicon have similar structures as the single word counterpart but the key words are replaced by MWEs。 Here, the constituent words of each MWE are considered as a single semantic entity, and thus mapped to semantic category/ies together。 For example, the MWE “life expectancy” is mapped to the categories of {time/age} and {expect}。

In addition, to account for MWEs of similar structures with the same entry, many MWEs are transcribed as templates using a simplified form of regular expression. For example, the template {*ing_NN1 machine*_NN*} represents a set of MWEs including “washing machine/s”, “vending machine/s”, etc. As the result, the MWE lexicon covers many more MWEs than the number of individual entries. Furthermore, the templates also capture discontinuous MWEs.

The Lancaster semantic taxonomy was initially based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981), but has undergone a series of expansion and improvements. Currently it contains 21 major discourse fields that expand, in turn, into 232 categories (for further details 1 In the Lancaster semantic lexicon, the C7 POS tagset is used

to encode POS information.

Evaluating lexical resources for a semantic tagger的相关文档搜索

推荐阅读

相关文档
盛通彩票网 山东11选5计划 聚发彩票注册 极速赛车怎么稳赚 k8彩票计划群 财神彩票计划群 乐彩网导航 永发彩票计划群 极速赛车实力大平台 百分百彩票计划群