Modeling the Crosstalk for the High-throughput DNA Sequencing Data
Base-calling is the signal processing step to identify the DNA sequences based on the raw signals measured by high-throughput DNA Sequencing machine. The accuracy of base-calling is crucial for high-throughput DNA sequencing and downstream analysis. Accordingly, we made an endeavor to reduce DNA sequencing errors of Illumina systems by correcting three kinds of crosstalk in the cluster intensity data. We discovered that signal crosstalk between adjacent clusters accounts for a large portion of sequencing errors in Illumina systems, even after correcting color crosstalk caused by the overlap of dye emission spectra and phasing/pre-phasing caused by out-of-step nucleotide synthesis. Interestingly and importantly, the spatial crosstalk between adjacent clusters is cluster-specific and often asymmetric, which cannot be corrected by existing deconvolution methods. Therefore, we introduce a novel mathematical method able to estimate and remove spatial crosstalk, thereby reducing base-calling errors by 44-69%. Furthermore, the resolution gained from this study provides new room for higher throughput of DNA sequencing and of general measurement systems using fluorescence-based imaging technology. This is a joint work with Bo Wang, Anqi Wang and Lei M Li.