Fast stochastic compression algorithms for biological data analysis

Duan Chen (May 8, 2020)

Please install the Flash Plugin

Abstract

Our recent work is motivated by two types of biological problems. One is inferring 3D structures of chromatins based on chromosome conformation capture (3C), such as Hi-C, which is a high-throughput sequencing technique that produces millions of contact data between genomic loci pairs. The other problem is computational deconvolution of gene expression data from heterogeneous brain samples, for extracting cell type-specific information for patients with Alzheimer's Disease (AD). Both problems involve large volumes of data, thus fast algorithms are indispensable in either direct optimization or machine learning methods. A central approach is the low-rank approximation of matrices. Conventional matrix decomposition methods such as SVD, QR, etc, are expensive, so not suitable for repeated implementation in these biological problems. Instead, we develop fast stochastic matrix compressions based on randomized numerical linear algebra (RNLA) theories. In this talk, we will emphasize on a recently developed stochastic kernel matrix compression algorithm. In this algorithm, samples are taken at no (or low) cost and the original kernel matrix is reconstructed efficiently with desired accuracy. Storage and compressing processes are only at O(N) or O(NlogN) complexity. These stochastic matrix compressing can be used to the above-mentioned biological problems to greatly improve algorithm efficiency, they can also be applied to other kernel based machine learning algorithms for scientific computing problems with non-local interactions (such as fractional differential equations), since no analytic formulation of the kernel function is required in our algorithms.