eprintid: 1888 rev_number: 6 eprint_status: archive userid: 290 dir: disk0/00/00/18/88 datestamp: 2016-11-14 02:37:56 lastmod: 2016-11-14 02:37:56 status_changed: 2016-11-14 02:37:56 type: article metadata_visibility: show creators_name: Phan, Xuan Hieu creators_name: Nguyen, Cam Tu creators_name: Le, Dieu Thu creators_name: Nguyen, Le Minh creators_name: Horiguchi, Susumu creators_name: Ha, Quang Thuy creators_id: hieupx@vnu.edu.vn creators_id: thuyhq@vnu.edu.vn title: A Hidden Topic-Based Framework toward Building Applications with Short Web Documents ispublished: pub subjects: IT subjects: isi divisions: fac_fit abstract: This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results. date: 2011 date_type: published official_url: http://doi.org/10.1109/TKDE.2010.27 id_number: doi:10.1109/TKDE.2010.27 contact_email: hieupx@vnu.edu.vn full_text_status: none publication: IEEE Transactions on Knowledge and Data Engineering volume: 23 number: 7 pagerange: 961-976 refereed: TRUE issn: 1041-4347 funders: Japan Society for the Promotion of Science projects: Project No.P06366 citation: Phan, Xuan Hieu and Nguyen, Cam Tu and Le, Dieu Thu and Nguyen, Le Minh and Horiguchi, Susumu and Ha, Quang Thuy (2011) A Hidden Topic-Based Framework toward Building Applications with Short Web Documents. IEEE Transactions on Knowledge and Data Engineering, 23 (7). pp. 961-976. ISSN 1041-4347