 DNA Sequences Representation Derived from Discrete Wavelet Transformation for Text Similarity Recognition
Tác giả hoặc Nhóm tác giả: Phan Hieu Ho, Ngoc Anh Thi Nguyen, Trung Hung Vo
Nơi đăng: In Springer SCI Book, Modern Approaches for Intelligent Information and Database Systems; Số: SCI 769;Từ->đến trang: 75-85;Năm: 2018
Lĩnh vực: Công nghệ thông tin; Loại: Bài báo khoa học; Thể loại: Quốc tế
Recognizing text similarity, also known as duplicated documents, is considered as the most important solution for plagiarism detection which is a rising dramatically in the era of digital revolution recently. With the aim to contribute an efficient plagiarism system, we investigate a new approach for in text similarity mining via DNA sequences representation derived from Discrete Wavelet Transformation (DWT). Consequently, the contribution of the paper is classified as threefold. Firstly, we convert the raw source materials into a unique set of floating-number series called a DeoxyriboNucleic Acid (DNA) sequences using DWT. The DNA-based structure then is also required for the testing documents input at the second step. Lastly, text similarity discovery algorithm is performed for those given input DNA strings via computing the Euclidean distance. The experimental result demonstrates the advantages of the proposed method with very high precision for detecting text similarity on standard dataset of PAN, known as Plagiarism Analysis, Authorship Identification, and Near-Duplicate detection.
