149 0

Methods to Estimate Efficiently False Discovery Rate in Peptide Identification

Title
Methods to Estimate Efficiently False Discovery Rate in Peptide Identification
Other Titles
펩타이드 검색에서 거짓 발견 비율을 효율적으로 추정하기 위한 방법
Author
이상정
Alternative Author(s)
Sangjeong Lee
Advisor(s)
박희진
Issue Date
2023. 8
Publisher
한양대학교
Degree
Doctor
Abstract
Proteomics is the study to analyze all proteins in cells and tissues. An important goal of proteomics is the identification and quantitative analysis of various proteins. The shotgun proteomic is currently the most used method for protein identification in large-scale studies. This method refers to a bottom-up protein analysis technique that identifies proteins in complex mixtures using LC-MS, which combines liquid chromatography (LC) and tandem mass spectrometry (MS). Various methods have been proposed for peptide identification from LC-MS/MS spectra. In general, the most used method to identify a spectrum as a peptide is a database search method. The database search method identifies a peptide with the highest similarity by comparing an MS/MS spectrum generated by an experiment with a peptide in the database. This method identifies most of the spectra in a set of tandem mass spectra as peptide-spectrum matches (PSMs), even if the tandem mass spectrum does not match completely to the peptide. Therefore, not all PSMs always mean the correct answer. In other words, it means that the incorrect peptide may be selected due to a random match that is not the correct peptide for a certain spectrum and may contain false positives. Consequently, it is very important to estimate the false discovery rate of a result to assess the reliability of the result. Various methods have been proposed to estimate the false discovery rate, but the most commonly used method is the target-decoy strategy (TDS). TDS is an easy and simple way to estimate the false discovery rate by searching together a decoy database of the same size as the target database. However, as the size of the database increases, the TDS method causes problems of time and space, and if an ideal decoy database cannot be created, an inaccurate false discovery rate may be estimated. Therefore, we propose a target-small decoy method and cTDS method, which are improved in terms of time/space and accuracy of TDS, respectively. Since the TDS is used by concatenating a decoy database of the same size as the target database, the size of the existing database is twice as large. As more data is poured out due to the development of proteomics and genomics analysis technology, it is inevitable that the size of the database will increase. In other words, it means that the problem of time and space of TDS will be more and more highlighted. So, we propose a target-small decoy strategy that reduces the size of the decoy database by a certain ratio instead of the decoy database of the same size as the target database. We show that the target-small decoy method estimates FDR almost as accurately as conventional TDS and can be used more efficiently in time and space than TDS. Also, to estimate the false discovery rate, TDS assumes that when a spectrum is identified incorrectly, the probability of being matched with a target or decoy peptide is equal. Most spectra have a probability of being identified incorrectly as a target or decoy peptides close to 0.5. However, only 1.14–4.85% of all spectrum has a probability of exactly 0.5. That is, most of the spectra do not have a completely equal probability of being matched with the target or decoy peptide. As a result, the false discovery rate may be inaccurately estimated. Therefore, we propose a cTDS that estimates the false discovery rate using the probability of being incorrectly identified as a target or decoy peptide for each spectrum. To calculate the probability of being identified incorrectly as a target or decoy peptide, the number of target and decoy candidate peptides for each spectrum is used. We showed that the cTDS method can estimate FDR more accurately than conventional TDS.
URI
http://hanyang.dcollection.net/common/orgView/200000684743https://repository.hanyang.ac.kr/handle/20.500.11754/187257
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE(컴퓨터·소프트웨어학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE