VNU-UET Repository

A hybrid approach to Vietnamese word segmentation

Tuan Phong Nguyen and Anh Cuong Le (2016) A hybrid approach to Vietnamese word segmentation. In: The 2016 IEEE RIVF International Conference on Computing and Communication Technologies, 7-9 November 2016, Hanoi, Vietnam.

Full text not available from this repository.

Official URL: http://doi.org/10.1109/RIVF.2016.7800279

Abstract

Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.

Item Type:Conference or Workshop Item (Paper)
Subjects:Information Technology (IT)
Divisions:Faculty of Information Technology (FIT)
ID Code:2401
Deposited By: Tuan-Phong Nguyen
Deposited On:12 Jan 2017 15:55
Last Modified:12 Jan 2017 15:55

Repository Staff Only: item control page