来源: 发布时间:2014-11-26
题目:A Corpus-Based Investigation of the Lexical Characteristics of Dialogue Acts
Dialogue acts (DA) play a key role in the interpretation of the communicative behaviour of dialogue participants and offer valuable insight into the design of human-machine dialogue systems (Bunt et al. 2010). Research has been conducted to identify a core set of DAs and to automatically label utterances according to such a set. This talk reports a corpus-based investigation of the lexical characteristics of dialogue acts. The Switchboard Corpus (Jurafsky et al. 1997) is employed, which comprises 1,155 transcribed telephone conversations, representing a total number of 205,000 utterances or 1.4 million word tokens. The corpus is fully annotated and each component utterance is functionally labelled with a tag from a set of 60 different DA types.
The objective of the study is to examine the lexical characteristics of the utterances according to their communicative functions. To this end, a set of sub-corpora are created, each containing utterances labeled with the same DA tag. We first measure the dispersion of word unigrams across the 60 DA types; word unigrams are identified as one-category words, two-category words … etc, where words unique to one DA are defined as one-category words and words common in all the DA types as 60-category words. We then apply the measure of Chi by degrees of freedom (CBDF) to calculate the similarity of the 60 DAs. Finally, machine learning techniques are applied to extract lexical features for the automatic separation of the utterances according to their DA types. Accuracy scores will be produced in terms of precision, recall and F-score as empirical evidence in support of the correlation between lexical uses (representing either semantic content or grammatical construction or both) and communicative functions.
As our results will show, the dispersion of word unigrams is uneven across different DAs and frequentially significant word unigrams may not constitute significant DA differentiators. Our results will demonstrate that the dispersion index is an effective criterion for the selection of word unigrams as lexical cues to identify DAs. While these results lend themselves to the understanding that utterances are multifunctional and hence confusable to human annotators as well as automatic DA recognition systems, they nonetheless suggest that a granular approach to the DAMSL scheme and re-grouping of the DA tags may produce better results, a suggestion that has recently emerged from our manual inspection of some problematic cases.
Bunt, H., J. Alexandersson, J. Carletta, J.-W. Choe, A.C. Fang, K. Hasida, K. Lee, V. Petukhova, A. Popescu-Belis, L. Romary, C. Soria, and D. Traum. 2010. Towards an ISO Standard for Dialogue Act Annotation. In Proceedings of the Seventh International Conference on Language Resources and Evaluation. Valletta, MALTA, 17-23 May 2010.
Jurafsky, D., E. Shriberg and D. Biasca. 1997. Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coders’ Manual, Draft 13. University of Colorado, Boulder Institute of Cognitive Science Technical Report 97-02.
Alex Chengyu Fang is a computational linguist in corpus linguistics and natural language processing. He held various positions in the English Department of University College London (UCL), UK, before his appointment as Deputy Director of the Survey of English Usage, UCL. He later became a Senior Research Fellow and then Honorary Senior Research Fellow in UCL’s Department of Phonetics and Linguistics and lectured in the Computer Science Department of the same university on intelligent text processing. He is now at the Department of Chinese, Translation, and Linguistics, City University of Hong Kong, where he is Associate Professor and lectures on corpus linguistics and cognitive linguistics. He is an expert member of the International Organization for Standardization (ISO) in the area of language resources as well as an expert member of the China National Technical Committee on Terminology for Standardization Standardization (全国术语标准化技术委员会).

版权所有:beat·365(中国)-官方网站 2011 未经授权禁止复制或建立镜像