Deep probability model for NGS data

Single-cell RNA sequencing is used to analyze the gene expression data of individual cells, thereby adding to existing knowledge of biological phenomena. Accordingly, this technology is widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and has been adopted for the analysis of single-cell data owing to its high capacity to manage large-scale data. Many different variants of the variational autoencoder have been applied, and have yielded superior results. However, because it is nonlinear, the model does not provide parameters that can be used to explain the underlying biological patterns. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into those shared across cells and those that are cell-specific. Effective nonlinear dimension reduction was achieved via a variational autoencoder applied to the cell-specific parameters. In addition to achieving nonlinear dimension reduction, our model could estimate the cell-type-specific gene expression. To improve the estimation accuracy, we introduced log-regularization, which reflects the single-cell property. Overall, our approach displayed excellent performance in a simulation study and in real data analyses, while maintaining good biological interpretability.

Structural Tensor Analysis and Decomposition

We develop a general framework for analyzing the distribution of genomic changes such as single nucleotide variants (SNVs) and relating them to sample-level covariates and variation-level covariates. The model incorporates sample-level covariates (e.g. sex, age, and disease status) and variant-level covariates (e.g. strand transcription and nucleosomal position) to extract mutational signatures while also performing inference on the signature prevalence across samples.

A prediction model for healthcare time-series data

Helathcare outcomes such as blood pressure and heart rate are commonly tracked across time owing to technological advances in wearable devices. This advance then makes it possible to predict health risks and to practice personalized medicine. For this type of healthcare data, it is important to reflect huge variation among subjects where the subject becomes an experimental unit. We extend a deep mixed effect model via a mixture of deep mixed effect models. Our mixed effect model is based on Gaussian processes where the mean adopts the deep neural networks to capture flexible time trends. Our model finds a highly nonlinear trend shared among segments of patients while clustering patients with similar trends into groups.

Deconvolution of mixed signals with NMF in the presence of many zeros

A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.

Graphical Models

We characterize a complex dependence structure among correlated biological variables using data integration. With the emergence of large collections of diverse biological datasets, a remaining challenge is how to integrate these rich collections in order to better reflect the biological process under study. To accomplish this aim, Dr. Chun and her collaborators developed a conditional Gaussian graphical model (CGGM) in which extra information is incorporated as additional predictors with a flexible reproducing kernel Hilbert space estimator. The research established a framework of data integration to systematically model complex biological networks of gene-gene and gene-genome regulations. The framework was then expanded in several directions. For example, Dr. Chun incorporated the feature that enables joint estimation of multiple networks and relaxed a distributional assumption to broaden the applicability.

Sparse partial least squares regression

Dr. Chun developed sparse partial least squares (SPLS) regression that simultaneously uses dimension reduction and variable selection. This approach produces a sparse solution (model with a small number of variables) that gives excellent prediction power and contains only relevant variables. Dr. Chun applied this approach to important genetics/genomics studies such as expression quantitative loci (eQTL) analyses and Genome-Wide Association Studies (GWAS) yielding superior results over existing methods.