Feature analysis for duplicate detection in programming QA communities

Zhang, W.E.; Sheng, Q.Z.; Shu, Y.; Nguyen, V.K.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/121447

Scopus	Web of Science®	Altmetric
Citations
?	?

Type:	Conference paper
Title:	Feature analysis for duplicate detection in programming QA communities
Author:	Zhang, W.E. Sheng, Q.Z. Shu, Y. Nguyen, V.K.
Citation:	Lecture Notes in Artificial Intelligence, 2017 / Cong, G., Peng, W.C., Zhang, W.E., Li, C., Sun, A. (ed./s), vol.10604, pp.623-638
Publisher:	SPRINGER INTERNATIONAL PUBLISHING AG
Publisher Place:	Switzerland
Issue Date:	2017
Series/Report no.:	Lecture Notes in Artificial Intelligence
ISBN:	9783319691787
ISSN:	0302-9743 1611-3349
Conference Name:	International Conference on Advanced Data Mining and Applications (ADMA) (5 Nov 2017 - 6 Nov 2017 : Singapore, SINGAPORE)
Editor:	Cong, G. Peng, W.C. Zhang, W.E. Li, C. Sun, A.
Statement of Responsibility:	Wei Emma Zhang, Quan Z. Sheng, Yanjun Shu, and Vanh Khuyen Nguyen
Abstract:	In community question answering (CQA), duplicate questions are questions that were previously created and answered but occur again. These questions produce noises in the CQA websites which impede users to find answers efficiently. Programming CQA (PCQA), a branch of CQA that holds questions related to programming, also suffers from this problem. Existing works on duplicate detection in PCQA websites framed the task as a supervised learning task on the question pairs, and relied on a number of extracted features of the question pairs. But they extracted only textual features and did not consider the source code in the questions, which are linguistically very different to natural languages. Our work focuses on developing novel features for PCQA duplicate detection. We leverage continuous word vectors from the deep learning literature, probabilistic models in information retrieval and association pairs mined from duplicate questions using machine translation. We provide extensive empirical analysis on the performance of these features and their various combinations using a range of learning models. Our work could be helpful for both research works and practical applications that require extracting features from texts that are not all natural languages.
Keywords:	Feature analysis; question answering; duplicate detection
Rights:	© Springer International Publishing AG 2017
DOI:	10.1007/978-3-319-69179-4_44
Published version:	http://dx.doi.org/10.1007/978-3-319-69179-4_44
Appears in Collections:	Aurora harvest 4 Computer Science publications

Files in This Item:

There are no files associated with this item.

Show full item record

Adelaide Research & Scholarship