Home
Giới thiệu
Tài khoản
Đăng nhập
Quên mật khẩu
Đổi mật khẩu
Đăng ký tạo tài khoản
Liệt kê
Công trình khoa học
Bài báo trong nước
Bài báo quốc tế
Sách và giáo trình
Thống kê
Công trình khoa học
Bài báo khoa học
Sách và giáo trình
Giáo sư
Phó giáo sư
Tiến sĩ
Thạc sĩ
Lĩnh vực nghiên cứu
Tìm kiếm
Cá nhân
Nội dung
Góp ý
Hiệu chỉnh lý lịch
Thông tin chung
English
Đề tài NC khoa học
Bài báo, báo cáo khoa học
Hướng dẫn Sau đại học
Sách và giáo trình
Các học phần và môn giảng dạy
Giải thưởng khoa học, Phát minh, sáng chế
Khen thưởng
Thông tin khác
Tài liệu tham khảo
Hiệu chỉnh
Số người truy cập: 107,093,852
Using language representation learning approach to efficiently identify protein complex categories in electron transport chain
Tác giả hoặc Nhóm tác giả:
Trinh-Trung-Duong Nguyen, Yu-Yen Ou, Nguyen-Quoc-Khanh Le, Dinh-Van Phan, Quang-Thai Ho
Nơi đăng:
Molecular informatics;
S
ố:
39(10);
Từ->đến trang
: 2000033;
Năm:
2020
Lĩnh vực:
Khoa học;
Loại:
Bài báo khoa học;
Thể loại:
Quốc tế
TÓM TẮT
We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5‐fold cross‐validation processes, seven types of sequence‐based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96%, 96.1%, 95.3%, and 0.86, respectively on cross‐validation data. For the independent test data, those corresponding performance scores are 95.3%, 92.6%, 94%, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.
ABSTRACT
We herein proposed a novel approach based on the language representation learning method to categorize electron complex proteins into 5 types. The idea is stemmed from the the shared characteristics of human language and protein sequence language, thus advanced natural language processing techniques were used for extracting useful features. Specifically, we employed transfer learning and word embedding techniques to analyze electron complex sequences and create efficient feature sets before using a support vector machine algorithm to classify them. During the 5‐fold cross‐validation processes, seven types of sequence‐based features were analyzed to find the optimal features. On an average, our final classification models achieved the accuracy, specificity, sensitivity, and MCC of 96%, 96.1%, 95.3%, and 0.86, respectively on cross‐validation data. For the independent test data, those corresponding performance scores are 95.3%, 92.6%, 94%, and 0.87. We concluded that using feature extracted using these representation learning methods, the prediction performance of simple machine learning algorithm is on par with existing deep neural network method on the task of categorizing electron complexes while enjoying a much faster way for feature generation. Furthermore, the results also showed that the combination of features learned from the representation learning methods and sequence motif counts helps yield better performance.
© Đại học Đà Nẵng
Địa chỉ: 41 Lê Duẩn Thành phố Đà Nẵng
Điện thoại: (84) 0236 3822 041 ; Email: dhdn@ac.udn.vn