Maqola

A transformer based deep learning framework for accurate single nucleotide variant correction in heterogeneous samples

X D WangNanjing Geneseeq Technology Inc., NanjingShenjie WangDepartment of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong UniversityZhili ChangNanjing Geneseeq Technology Inc., NanjingMinchao ZhaoNanjing Geneseeq Technology Inc., NanjingXi ZhangNanjing Geneseeq Technology Inc., NanjingNazarov FayzulloAI and Digital Technologies Faculty, Samarkand State UniversityEshtemirov BunyodAI and Digital Technologies Faculty, Samarkand State UniversityShuotong LiCollege of Economics, Shenzhen UniversityYì WángDepartment of Respiratory Medicine, The Second Affiliated Hospital of Xi'an Jiaotong University

Frontiers in Microbiologyjournal2026en

ABI

Annotatsiya

Profiling host genetic variations in heterogeneous host-microbiome mixtures is crucial for understanding cross-species interactions and microenvironmental dynamics. However, the variable host DNA fraction (purity) in bulk sequencing data severely compromises the performance of standard variant callers, leading to significant systematic biases in quantifying single nucleotide variants (SNVs). To address this, we developed a Transformer-based computational framework designed to model sequence context and technical artifacts in low-purity samples. The architecture employs a group-encoding mechanism to process multidimensional features-including variant allele frequency (VAF) distributions, base-level purity estimates, sequencing depth, and local genomic context (such as repeat regions and chromatin accessibility). By capturing long-range dependencies among these diverse signals, the model effectively neutralizes purity-induced biases to accurately recover the true host SNV count. We evaluated the framework using simulated sequencing data across a broad purity gradient (0.2-1.0). Our approach significantly reduced quantification errors, achieving high concordance between the corrected and actual ground-truth SNV counts. Benchmarking the corrected counts against the raw outputs of conventional callers (Mutect, Freebayes, LoFreq, and Platypus) demonstrated substantial performance gains, particularly in ultra-low purity conditions (0.2-0.3) where traditional statistical priors typically fail to provide reliable quantifications. Feature ablation and residual analyses further validated the independence of the multidimensional inputs and the unbiased, zero-centered nature of the count corrections. This deep learning pipeline provides a robust solution for the accurate quantification of host SNVs in complex biological mixtures, facilitating reliable downstream genetic analyses in highly heterogeneous microenvironments.

Mavzular

Genomics and Phylogenetic Studies Genetic Associations and Epidemiology Cancer Genomics and Diagnostics

Identifikatorlar

DOI: 10.3389/fmicb.2026.1838029

Iqtiboslar va manbalar

0 ta iqtibos19 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar · Tez orada