eprintid: 2401 rev_number: 6 eprint_status: archive userid: 294 dir: disk0/00/00/24/01 datestamp: 2017-01-12 15:55:15 lastmod: 2017-01-12 15:55:15 status_changed: 2017-01-12 15:55:15 type: conference_item metadata_visibility: show creators_name: Nguyen, Tuan Phong creators_name: Le, Anh Cuong creators_id: tuanphong94@gmail.com creators_id: cuongla@vnu.edu.vn title: A hybrid approach to Vietnamese word segmentation ispublished: pub subjects: IT divisions: fac_fit abstract: Word segmentation is the very first task for Vietnamese language processing. Word-segmented text is the input of almost other NLP tasks. This task faces some challenges due to specific characteristics of the language. As in many other Asian languages such as Japanese, Korean and Chinese, white spaces in Vietnamese are not always used as word separators and a word may contain one or more syllables. In this paper, we propose an efficient hybrid approach to detect word boundary for Vietnamese texts using logistic regression as a binary classifier combining with longest matching algorithm. First, longest matching algorithm is used to catch words that contain more than two syllables in input sentence. Next, the system utilizes the classifier to determine the boundary of 2-syllable words and proper names. Then, the predictions having low confidence conducted by the classifier are verified by a dictionary to get the final result. Our system can achieve an F-measure of 98.82% which is the most accurate result for Vietnamese word segmentation to the best of our knowledge. Moreover, the system also has a high speed. It can run word segmentation for nearly 34k tokens per second. date: 2016-11-07 date_type: published official_url: http://doi.org/10.1109/RIVF.2016.7800279 id_number: doi:10.1109/RIVF.2016.7800279 full_text_status: none pres_type: paper pagerange: 114-119 event_title: The 2016 IEEE RIVF International Conference on Computing and Communication Technologies event_location: Hanoi, Vietnam event_dates: 7-9 November 2016 event_type: conference refereed: TRUE book_title: 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) related_url_url: http://ieeexplore.ieee.org/document/7800279/ related_url_type: pub citation: Nguyen, Tuan Phong and Le, Anh Cuong (2016) A hybrid approach to Vietnamese word segmentation. In: The 2016 IEEE RIVF International Conference on Computing and Communication Technologies, 7-9 November 2016, Hanoi, Vietnam.