The chromatin of eukaryotes has a complex high-level structure, formed by DNA winding around histones to build a beaded model and further folding and gathering. The transcription of genes must open the corresponding chromatin to make an open region to bind other transcriptional regulatory factors. Therefore, the chromatin development region is the window through which the genome encodes life.
Single-cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) technology uses Tn5 DNA transposase to insert sequencing adapters into open chromatin for labeling and sequencing at the single-cell level, thereby obtaining a "high-resolution" single-cell precision open chromatin map and reveal the regulatory mechanism of cell heterogeneity accordingly. More and more researchers are applying single-cell ATAC-seq technology to obtain large amounts of sequencing data in the fields of tumors, immunity, and development. However, currently, there is no effective method to analyze and mine the precious biological information in the massive single-cell ATAC-seq data. The difficulty lies in the data itself. First of all, hundreds of thousands of open chromatin sites in the entire cell, causing the "dimension disaster"; Secondly, due to biological reasons, many potential openings have no signal, and the data is abnormally sparse. The data loss caused by technical limitations greatly aggravates this phenomenon. In particular, there are generally only two copies of an open region on the diploid genome, which makes the data almost binarized. These problems have brought huge challenges to the analysis of single-cell ATAC-seq data.
Our open chromatin regions analysis method uses artificial intelligence deep learning methods, combined with variational autoencoders and Gaussian mixture models to extract the hidden layer features of single-cell ATAC-seq data, which projects the complex and sparse high-dimensional chromatin open map space onto the simple and abstract low-dimensional feature space.
This processing can not only discover and analyze cell-specific chromatin map patterns, but also fill in missing values caused by technical limitations through the sharing of similar cell information, which cleverly solves problems of single-cell ATAC-seq data such as binarization the high dimensionality, sparsity, and sparseness.
Our model provides complete visualization, clustering, data enhancement, helps the mining of downstream biological information and provides a powerful tool for researchers to decode single-cell epigenetics.
SCALE model
Our system uses SCALE model, which combines the variational autoencoder (VAE) and the Gaussian Mixture Model (GMM) to model the distribution of high-dimensional sparse scATAC-seq data (Figure 1). The VAE is a kind of unsupervised generative model, which can be used for feature extraction of data; the GMM is a linear combination of one more Gaussian distribution. SCALE is combined to fit multi-modal single-cell ATAC-seq data distribution. Specifically, SCALE consists of an encoder and a decoder in the VAE framework. The encoder is a four-layer neural network (3200–1600–800–400) and the decoder is a network of only one layer with 10-dimensional latent variables (features) directly connected to the output. The latent variables are on the GMM manifold parameterized by μc and σc.
Figure 1 Overview of the SCALE framework
Feature embedding and clustering
Compared with other models, such as PCA, scVI, cis Topic, and Cicero, the SCALE model we use has a better clustering function, which can separate different types of cells more clearly and accurately. Besides, we use t-SNE to visualize the features extracted from these tools and the original data (Figure 2).
Figure 2 Feature embedding and clustering
Data denoising and imputation efficiency on simulation and real datasets
SCALE can accurately estimate the actual distribution of scATAC-seq data which usually contains noise and a large number of missing values. The value estimated by SCALE can be used to eliminate noise and restore the lost data. We compare SCALE with scImpute, SAVER, MAGIC, and scVI in scRNA-seq interpolation methods,and the results are shown in Figure 3. In all scATAC-seq data sets, SCALE's performance is better than all scRNA-seq interpolation methods, SCALE achieves the highest correlation between single cells and corresponding meta-cells (Figure. 3a), indicating that it obtains a better estimate of the actual scATAC-seq data distribution. Compared with the original data, SCALE's embedding on the estimated data shows better-defined clusters (each well corresponds to a biologically defined subtype) (Figure. 3b).
Figure 3 Data denoising and imputation efficiency on simulation and real datasets
Application of SCALE on cell types and motifs
SCALE can separate the two cell types well and is superior to PCA and scVI in terms of potential intercalation (Figure 4.a). In clustering, SCALE also produced the closest result to the protein index marker, better than scVI, scABC, and SC3 (Figure 4.b).
Figure 4 Application of SCALE on cell types and motifs
SCALE disentangles biological cell types and batch effects
SCALE successfully clustered the types of Epcam+ tumor cells and CD45+ tumor-infiltrating immune cells (Figure 5.a). Besides, SCALE learned a heatmap of the ten features (Figure 5.b).
Figure 5 SCALE disentangles biological cell types and batch effects
Practical
Our system can be used for data sets with different protocols and different overall data quality.
Visible
Our system uses t-SNE to visualize and present the analysis results to users clearly and directly.
Accurately
SCALE can accurately model high-dimensional, sparse, and multi-modal scATAC-seq data.
CD ComputaBio has formed a team of experts excellent in imaging science and clinical domain knowledge, providing AI-driven solutions for open chromatin regions analysis according to your detailed requirements.
Services