274 0

Identification of isotope clusters from mass spectra using neural network model

Title
Identification of isotope clusters from mass spectra using neural network model
Other Titles
인공신경망을 이용한 질량 스펙트럼의 동위원소 클러스터 식별
Author
권대욱
Alternative Author(s)
권대욱
Advisor(s)
Eunok Paek
Issue Date
2020-02
Publisher
한양대학교
Degree
Master
Abstract
Mass spectrometry-based proteomics plays an important role in identifying peptides. Peptide identification strongly depends on a precursor mass estimated from mass spectrometry; however, there is burden to estimate precise precursor masses because it is too noisy to obtain correct isotope clusters. This problem can be reduced by conventional tools such as RAPID and MS-Deconv. These tools applied heuristic functions to recognize correct isotope clusters so that more precise precursor masses could be estimated. However, the heuristic functions were limited in modelling the patterns of experimental isotope clusters in that they were based on similarity with theoretical isotope clusters. Here, we propose a machine learning approach to identify correct isotope clusters, with a hope that it can better characterize experimental isotope clusters. Furthermore, we extend this concept to predict monoisotopic masses in addition to recognizing isotope clusters by developing a new software tool called MaSIC, which stands for MAss Spectrum Isotopic Cluster. We designed an artificial neural network model to train characteristics of isotope clusters. The model takes a monoisotopic mass and intensities of peaks from the first to the twelfth in a cluster as an input, and then predicts whether the given cluster is an isotope cluster or not. To train the model, we collected 3,749,487 peptide spectrum matches (PSMs) from a previous study. Predicted isotope clusters (PICs) corresponding to each PSM were generated from both RAPID and MS-Deconv, and we got ~1.73M PICs after de-duplicating clusters. We generated 0.75 M negative isotope clusters (NICs) consisting of subsequence of 1.73 M PICs. Four-fifths of PICs and NICs were used for training, and the rest of them were used for test. We applied 5-fold cross validation to prevent overfitting. The accuracy was 99.98% on average. We used PICs and NICs derived from different experimental methods to test the model. The sensitivity and specificity were 99.95% and 99.85%, respectively. DL4J, which is a useful library for machine learning algorithms in Java, was applied to make the trained model available on a Java platform. The software MaSIC can predict all possible isotope clusters when mass spectra in mzXML format are given as an input. The complementary use of both MaSIC and heuristic software can increase prediction performance.|질량 분석 기반 단백체학은 펩타이드를 동정하는데 중요한 역할을 한다. 펩타이드 식별은 질량 분석법으로 추정된 전구체 이온 질량에 크게 의존한다. 그러나 정확한 동위원소 클러스터를 얻기에는 질량 스펙트럼이 복잡하기 때문에 정확한 전구체 이온의 질량을 추정하는 데에 어려움이 있다. 이 문제는 RAPID와 MS-Deconv와 같은 기존의 도구를 이용하여 해결해왔다. 이 도구는 휴리스틱 기법을 적용하여 정확한 동위원소 클러스터를 식별하여 보다 정확한 전구체 이온 질량을 추정할 수 있다. 그러나 휴리스틱 기법은 이론적인 동위원소 클러스터와의 유사성에 기초하여 실험적 동위원소 클러스터의 패턴을 반영할 수 없다. 본 논문에서는, 정확한 동위원소 클러스터를 식별하기 위한 머신러닝 접근법을 제안하며, 동위원소 클러스터의 실험적 특성을 반영하는 이점이 있다. 우리는 동위원소 클러스터의 특성을 학습하기 위해 인공 신경망 모델을 설계했다. 이 모델은 동위원소 클러스터에서 단일동위원소 질량과 동위원소 클러스터를 구성하는 각 피크의 양을 입력으로 주어진 클러스터가 동위원소 클러스터인지 여부를 예측한다. 모델을 훈련시키기 위해, 우리는 이전 연구에서 약 375만 개의 펩타이드 스펙트럼 매치를 수집했다. 각 펩타이드 스펙트럼 매치에 해당하는 동위원소 클러스터는 RAPID와 MS-Deconv에서 추출하였고, 동일한 동위원소 클러스터를 중복 제거 한 후 약 173만개의 positive 동위원소 클러스터를 얻었다. 그리고 173만 개의 동위원소 클러스터를 정제하여 75만 개의 negative 동위원소 클러스터를 만들었다. 학습에는 80%의 positive, negative 데이터가 사용되었고 나머지는 테스트에 사용되었다. 과적합을 방지하기 위해 5-fold 교차 검증을 적용했다. 정확도는 평균 99.98%이다. 다른 실험 방법에서 얻은 데이터로 테스트한 결과, 민감도와 특이도는 각각 99.95%, 99.85% 이었다. 우리는 DL4J 라이브러리를 사용하여 학습된 모델을 Java 플랫폼에 사용할 수 있도록 소프트웨어를 개발하였다. 소프트웨어 MaSIC은 mzXML 형식의 질량 스펙트럼이 입력으로 제공될 때, 가능한 모든 동위원소 클러스터를 출력할 수 있다. 이로써 MaSIC을 다른 휴리스틱 소프트웨어와 상호보완적으로 사용하면 더 많은 전구체 이온의 질량을 예측하는데 사용할 수 있다는 이점이 있다.; however, there is burden to estimate precise precursor masses because it is too noisy to obtain correct isotope clusters. This problem can be reduced by conventional tools such as RAPID and MS-Deconv. These tools applied heuristic functions to recognize correct isotope clusters so that more precise precursor masses could be estimated. However, the heuristic functions were limited in modelling the patterns of experimental isotope clusters in that they were based on similarity with theoretical isotope clusters. Here, we propose a machine learning approach to identify correct isotope clusters, with a hope that it can better characterize experimental isotope clusters. Furthermore, we extend this concept to predict monoisotopic masses in addition to recognizing isotope clusters by developing a new software tool called MaSIC, which stands for MAss Spectrum Isotopic Cluster. We designed an artificial neural network model to train characteristics of isotope clusters. The model takes a monoisotopic mass and intensities of peaks from the first to the twelfth in a cluster as an input, and then predicts whether the given cluster is an isotope cluster or not. To train the model, we collected 3,749,487 peptide spectrum matches (PSMs) from a previous study. Predicted isotope clusters (PICs) corresponding to each PSM were generated from both RAPID and MS-Deconv, and we got ~1.73M PICs after de-duplicating clusters. We generated 0.75 M negative isotope clusters (NICs) consisting of subsequence of 1.73 M PICs. Four-fifths of PICs and NICs were used for training, and the rest of them were used for test. We applied 5-fold cross validation to prevent overfitting. The accuracy was 99.98% on average. We used PICs and NICs derived from different experimental methods to test the model. The sensitivity and specificity were 99.95% and 99.85%, respectively. DL4J, which is a useful library for machine learning algorithms in Java, was applied to make the trained model available on a Java platform. The software MaSIC can predict all possible isotope clusters when mass spectra in mzXML format are given as an input. The complementary use of both MaSIC and heuristic software can increase prediction performance.
URI
https://repository.hanyang.ac.kr/handle/20.500.11754/123826http://hanyang.dcollection.net/common/orgView/200000436907
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE