100 0

Discovering rare variant elements from next generation sequencing data in genomics

Discovering rare variant elements from next generation sequencing data in genomics
Heejin Park
Issue Date
2024. 2
한양대학교 대학원
Bioinformatics is a dynamic and ever-evolving field that bridges biology and computational science. Its applications are broad, ranging from fundamental biological research to medical advancements and biotechnological innovations. It enables scientists to make sense of the vast and complex biological data generated in the modern era, ultimately advancing our understanding of life and improving human health and well- being. In particular, rare variant detection from Next-Generation Sequencing (NGS) data is a crucial task in genomics, as these rare variants may play a significant role in understanding the genetic basis of various diseases. Recent advances in sequencing technology have allowed us to investigate personal genomes to find structural variations, which have been studied extensively to identify the association with the physiology of diseases such as cancer. In particular, mobile genetic elements (MGEs) are one of the major constituents of the human genomes, and cause genome instability by insertion, mutation, and rearrangement. We have developed a new program, iMGEins, to identify such novel MGEs by using sequencing reads of individual genomes, and to explore the breakpoints with the supporting reads and MGEs detected. iMGEins is the first universal MGE detection tool applying three algorithmic paradigms (discordant read-pair mapping, split-read mapping, and contig assembly). Our evaluation results showed an excellent performance in detecting novel MGEs from simulated genomes as well as real personal genomes. In detail, the average recall rate of iMGEins was 96.25%, which is about two times higher than the rates of two other tools compared. The average precision rate of iMGEins was 99.57%, which is also two times higher than the other tools. On the testing with real human genomes of the NA12878 sample, iMGEins found 2,040 known MGEs that are individually inserted, along with 122 MGE sequences inserted. Numerous programs for finding SNPs and short indels based on NGS data have been developed. However, existing programs in which users directly set parameters and call variants miss many true positives or call many false positives, despite the efforts of countless researchers. This is particularly serious when different SNP/indel variants are applied to one location. In order to solve this problem, we conducted a research to find SNPs/Indels from NGS read data by applying a deep-learning method that showed better performance than the general state-of-the-art algorithm. After processing the read pileup data to be suitable for text-based NGS data, we applied it to a deep learning model modified based on the Transformer model. This method showed similar performance to other existing programs, but showed better performance especially in the special case where SNPs and Indels were mixed.
Appears in Collections:
Files in This Item:
There are no files associated with this item.
RIS (EndNote)
XLS (Excel)


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.