LDA topics: Representation and evaluation
Journal of Information Science
Department of Information and Communication Engineering, Yeungnam University, 214-1, Dae-dong, Gyeongsan, Gyeongsangbuk, South Africa; Kunsan National University, South Korea; Troy University, United States
In recent years many automated topic coherence formulas (using the top-m words of a topic inferred by latent Dirichlet allocation) based on word similarities have been proposed and evaluated against human ratings. We treat a wordy topic as an object and quantitatively describe it via normalized mean values of pair-wise word similarities. Two types of word similarities, thesaurus and local corpus-based, are used as the descriptive features of a topic. We perform topic classification using represented topics as input and bi-level human ratings about topic coherence as class labels. Classification results (precision, recall and accuracy) based on two datasets and three supervised classification algorithms suggest that the novel topic representation is consistent with human ratings. Corpus-based word similarities are positively correlated with human ratings whereas thesaurus-based similarities have negative relations. The proposed representation of topics opens a window for us to investigate the utilization of topics with different perspectives. © Chartered Institute of Library and Information Professionals.
Statistics; Thesauri; Latent dirichlet allocations; Supervised classification; topic evaluation; Topic Modeling; topic representation; Classification (of information)