%A Tuan Phong Nguyen
%A Anh Cuong Le
%T A hybrid approach to Vietnamese word segmentation
%X Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second.
%P 114-119
%B 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)
%D 2016
%C Hanoi, Vietnam
%R doi:10.1109/RIVF.2016.7800279
%L SisLab2401