SignallingSingleCell is an R package that provides an end-to-end approach for the analysis of scRNA-seq data, with a particular focus on building ligand and receptor signaling networks.
The workflow of SignallingSingleCell consists of a few key steps;
Additionally, the package has a variety of features that were built in an ad hoc manner. These features include algorithms for the detection of marker genes that are specific to each cell type, aggregation functions used to construct in silico bulk data, and a supervised method of scRNA-seq analysis in order to contextualize the scRNA-seq data with commonly used immunological markers.
Each of the steps in the analysis workflow will be described in the following vignette by applying the package to the BMDC and human vitiligo scRNA-seq data.
We performed scRNA-seq on the bone marrow-derived dendritic cells (BDMCs) in order to determine the cell populations that are present, contextualize these cell populations based on recent studies. We have analyzed the data using the R package, SignallingSingleCell.
The data used in this vignette is derived from Bone marrow extracted from 6-8 week old female C57 mice, and floating cells were collected on day 8 after derivation with GM-CSF following the protocol in (Donnard et al., 2018). BMDCs were then stimulated with LPS for 0, 1, and 4 hr before performing scRNA-seq using an in house built inDrop system (Klein et al., 2015).
# devtools::install_github("garber-lab/SignallingSingleCell")
library("SignallingSingleCell")
library(kernlab)
load(url("https://vitiligo.dolphinnext.com/data/mDC_UMI_Table.Rdata"))
load(url("https://vitiligo.dolphinnext.com/data/ex_sc_skin.Rdata"))
The ExpressionSet class (ex_sc) is a convenient data structure that contains 3 dataframes. These dataframes contain expression data, cell information, and gene information respectively.
exprs(ex_sc) is the expression data, where rows are genes and columns are cells. pData(ex_sc) is cell information, where rows are cells and columns are metadata. fData(ex_sc) is gene information, where rows are genes and columns are metadata.
ex_sc <- construct_ex_sc(input = mDC_UMI_Table) # mDC_UMI_Table == Digital Gene Expression Matrix
class(mDC_UMI_Table)
[1] "matrix" "array"
class(ex_sc)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
ex_sc # pData() and fData() are empty
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11584 features, 3814 samples
element names: exprs
protocolData: none
phenoData: none
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
exprs(ex_sc)[1:5,1:5]
0hrA_TGACGGACAAGTAATC 0hrA_CACAACAGTAGCCTCG 0hrA_GTTTGTTTGCACCTCT
0610007P14Rik 0 0 0
0610009B22Rik 0 0 0
0610009O20Rik 0 0 0
0610010B08Rik 0 0 0
0610010F05Rik 0 0 0
0hrA_GCTTACCTTGACCCTC 0hrA_GGAGAAGCGCTTTGGC
0610007P14Rik 0 0
0610009B22Rik 0 0
0610009O20Rik 0 0
0610010B08Rik 0 0
0610010F05Rik 0 0
pData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
fData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
rm(mDC_UMI_Table)
Often we have metadata about the experiment that can be valuable in the analysis! Writing that information now may be appropriate. Our experiment consists of a time course with LPS stimulation. The cell names in the DGE Matrix contain a substring encoding this information.
rownames(pData(ex_sc))[c(1,2000,3000)]
[1] "0hrA_TGACGGACAAGTAATC" "1hrA_TCATGAGGCTTACTCC" "4hrA_CATACATTCCTATTCA"
pData(ex_sc)$Timepoint <- NA # initialize a new pData column
table(pData(ex_sc)$Timepoint)
< table of extent 0 >
pData(ex_sc)[grep("0hr", rownames(pData(ex_sc))),"Timepoint"] <- "0hr"
pData(ex_sc)[grep("1hr", rownames(pData(ex_sc))),"Timepoint"] <- "1hr"
pData(ex_sc)[grep("4hr", rownames(pData(ex_sc))),"Timepoint"] <- "4hr"
head(pData(ex_sc))
Timepoint
0hrA_TGACGGACAAGTAATC 0hr
0hrA_CACAACAGTAGCCTCG 0hr
0hrA_GTTTGTTTGCACCTCT 0hr
0hrA_GCTTACCTTGACCCTC 0hr
0hrA_GGAGAAGCGCTTTGGC 0hr
0hrA_AAATCAGAGATCTCGG 0hr
table(ex_sc$Timepoint)
0hr 1hr 4hr
1729 1004 1081
Often you will want to filter your data to remove low quality cells. There are many ways to do this, often using the number of captured transcripts, unique genes per cell, or the percent of mitochrondrial genes to remove low quality cells.
ex_sc <- calc_libsize(ex_sc, suffix = "raw") # sums counts for each cell
ex_sc <- pre_filter(ex_sc, minUMI = 1000, maxUMI = 10000, threshold = 1, minCells = 10, print_progress = TRUE) # filters cells and genes
[1] "Filtering Genes"
[1] "Filtering Low Cells"
ex_sc <- calc_libsize(ex_sc, suffix = "raw_filtered")
plot_density(ex_sc, title = "UMI Density", val = "UMI_sum_raw_filtered", statistic = "mean")
head(pData(ex_sc))
Timepoint UMI_sum_raw UMI_sum_raw_filtered
0hrA_TGACGGACAAGTAATC 0hr 4572 4570
0hrA_CACAACAGTAGCCTCG 0hr 1581 1580
0hrA_GTTTGTTTGCACCTCT 0hr 2296 2288
0hrA_GCTTACCTTGACCCTC 0hr 2098 2097
0hrA_GGAGAAGCGCTTTGGC 0hr 2425 2421
0hrA_AAATCAGAGATCTCGG 0hr 6618 6601
plot_density_ridge(ex_sc, title = "UMI Density", val = "UMI_sum_raw_filtered", color_by = "Timepoint")
Picking joint bandwidth of 444
Dimensionality reduction is necessary in order to bring the cells from a high dimensional gene expression space (~10k dimensions, one dimension per gene) down to a more reasonable number. Typically this is done first with PCA to bring it down to ~5-15 dimensions, before a final embedding is done using tSNE or UMAP to bring it down to 2 dimensions.
?subset_genes
gene_subset <- subset_genes(ex_sc, method = "PCA", threshold = 1, minCells = 20, nComp = 10, cutoff = 0.85)
dim(ex_sc)
Features Samples
11055 3812
length(gene_subset)
[1] 1513
ex_sc <- dim_reduce(ex_sc, genelist = gene_subset, pre_reduce = "vPCA", nVar = .9, nComp = 30, iterations = 500, print_progress=TRUE)
[1] "Starting vPCA"
[1] "Starting tSNE"
Read the 3812 x 6 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.31 seconds (sparsity = 0.030912)!
Learning embedding...
Iteration 50: error is 85.845353 (50 iterations in 0.50 seconds)
Iteration 100: error is 75.307036 (50 iterations in 0.49 seconds)
Iteration 150: error is 74.449677 (50 iterations in 0.44 seconds)
Iteration 200: error is 74.240812 (50 iterations in 0.42 seconds)
Iteration 250: error is 74.171150 (50 iterations in 0.42 seconds)
Iteration 300: error is 2.162252 (50 iterations in 0.38 seconds)
Iteration 350: error is 1.794241 (50 iterations in 0.37 seconds)
Iteration 400: error is 1.620242 (50 iterations in 0.39 seconds)
Iteration 450: error is 1.520989 (50 iterations in 0.39 seconds)
Iteration 500: error is 1.459182 (50 iterations in 0.40 seconds)
Fitting performed in 4.20 seconds.
plot_tsne_metadata(ex_sc, color_by = "Timepoint")
ex_sc <- dim_reduce(ex_sc, genelist = gene_subset, pre_reduce = "iPCA", nComp = 12, tSNE_perp = 30, iterations = 500, print_progress=FALSE)
plot_tsne_metadata(ex_sc, color_by = "Timepoint")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp1", title = "PC1 cell loadings")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp2", title = "PC2 cell loadings")
plot_tsne_metadata(ex_sc, color_by = "iPC_Comp3", title = "PC3 cell loadings")
Now that we have dimension reduced data we can try clustering it! For dimensions, both Comp and 2d are supported. There will determine if the clustering is done on principal components, or on the 2D representation. There are also 2 clustering algorithms available, density and spectral. Typically we recommend spectral clustering on PCA components, or density clustering on the 2d representation. Try both!
ex_sc <- cluster_sc(ex_sc, dimension = "Comp", method = "spectral", num_clust = 4)
ex_sc$cluster_spectral <- ex_sc$Cluster
ex_sc <- cluster_sc(ex_sc, dimension = "2d", method = "density", num_clust = 4)
Distance cutoff calculated to 0.05082406
ex_sc$cluster_density <- ex_sc$Cluster
plot_tsne_metadata(ex_sc, color_by = "cluster_spectral")
plot_tsne_metadata(ex_sc, color_by = "cluster_density")
There are many possible ways to identify cell types based on their gene expression. The id_markers function will identify genes that are highly expressed in a high proportion of a given cluster, relative to the other clusters.
ex_sc <- id_markers(ex_sc, print_progress = TRUE)
[1] "Finding markers based on fraction expressing"
[1] "Finding markers based on mean expressing"
[1] "Merging Lists"
ex_sc <- calc_agg_bulk(ex_sc, aggregate_by = "Cluster")
markers <- return_markers(ex_sc, num_markers = 15)
plot_heatmap(ex_sc, genes = unique(unlist(markers)), type = "bulk")
Now that the data has preliminary clusters, we can normalize. SCRAN normalization will first normalize internally in clusters, before normalizing across clusters. Once the data is normalized we can run the same steps as above before visualization. The first step is to select the genes to be used for normalization. One method would be to first only use genes expressed in more than n cells, and then remove the most variable genes. This method can be computationally expensive, and is currently commented out. A simpler approach, counts per million, is also provided below.
table(pData(ex_sc)$Cluster)
Cluster1 Cluster2 Cluster3 Cluster4
962 736 362 1752
# ex_sc_norm <- norm_sc(ex_sc, pool_sizes = c(20,25,30,35,40))
x <- exprs(ex_sc)
cSum <- apply(x,2,sum) # sum counts for each cell
x <- as.matrix(sweep(x,2,cSum,FUN='/'))*1e4 # normalize to UMIs per 10k
ex_sc_norm <- construct_ex_sc(x) # Make a new expression set with normalized counts
pData(ex_sc_norm) <- pData(ex_sc) # Copy over the pData
ex_sc_norm <- calc_libsize(ex_sc_norm, suffix = "CP10k")
plot_tsne_metadata(ex_sc_norm, color_by = "UMI_sum_raw")
plot_tsne_metadata(ex_sc_norm, color_by = "UMI_sum_CP10k")
From the above analysis, it is clear that some clusters are formed based on their cell type, while others are based on their experimental condition. In these cases it can be useful to incorporate prior information in order to obtain clusters and annotations that are grounded in biological significance. Below, we can assign “panels” similar to flow cytometry, that will enable cell groupings based on the expression of genes that you believe to signify biologically relevant cell types.
panel1 <- c("S100a9", "Mmp9") # Neutrophil Markers
panel2 <- c("Ccr7", "Fscn1", "Flt3") # DC
panel3 <- c("Mertk", "Csf1r") # Mac
panels <- list(panel1, panel2, panel3)
plot_tsne_gene(ex_sc_norm, gene = panels[[1]], title = "", log_scale = T)
plot_tsne_gene(ex_sc_norm, gene = panels[[2]], title = "", log_scale = T)
plot_tsne_gene(ex_sc_norm, gene = panels[[3]], title = "", log_scale = T)
names(panels) <- c("Neutrophil", "Dendritic", "Macrophage")
panels
$Neutrophil
[1] "S100a9" "Mmp9"
$Dendritic
[1] "Ccr7" "Fscn1" "Flt3"
$Macrophage
[1] "Mertk" "Csf1r"
ex_sc_norm <- flow_filter(ex_sc_norm, panels = panels, title = "Flow Pass Cells")
ex_sc_norm <- flow_svm(ex_sc_norm, pcnames = "Comp")
Loading required package: lattice
plot_tsne_metadata(ex_sc_norm, color_by = "cluster_spectral", facet_by = "Timepoint")
plot_tsne_metadata(ex_sc_norm, color_by = "SVM_Classify")
Now that cells are grouped by their cell type, we can run DE in order to determine which genes are change in association with our experimental conditions.
For simplicity we can subset to 0hr and 4hr, to find the genes that change between these conditions.
It should be noted that DE should always be run on raw counts, not on the normalized counts!
ex_sc_norm_0_4 <- subset_ex_sc(ex_sc_norm, variable = "Timepoint", select = c("0hr", "4hr"))
[1] "Subsetted Data is 2808 cells"
0hr 4hr
1729 1079
table(pData(ex_sc_norm_0_4)[,c("SVM_Classify", "Timepoint")])
Timepoint
SVM_Classify 0hr 4hr
Dendritic 115 109
Macrophage 1582 921
Neutrophil 32 49
findDEgenes(input = ex_sc,
pd = pData(ex_sc_norm_0_4),
DEgroup = "Timepoint",
contrastID = "4hr",
facet_by = "SVM_Classify",
outdir = "~/Downloads/")
[1] "Performing DE for Dendritic"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Macrophage"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Neutrophil"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
# Macrophages
g <- plot_volcano(de_path = "~/Downloads/", de_file = "Macrophage_4hr_DEresults.tsv", fdr_cut = 1E-150, logfc_cut = 1)
plot(g)
mac_de <- unique(g$data$label)
mac_de <- mac_de[-which(mac_de == "")]
plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Rsad2")
# DCs
g <- plot_volcano(de_path = "~/Downloads/", de_file = "Dendritic_4hr_DEresults.tsv", fdr_cut = 0.0001, logfc_cut = 2)
plot(g)
dc_de <- unique(g$data$label)
dc_de <- dc_de[-which(dc_de == "")]
plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Ifit1")
# Neutrophils
g <- plot_volcano(de_path = "~/Downloads/", de_file = "Neutrophil_4hr_DEresults.tsv", fdr_cut = 1E-5, logfc_cut = 3)
plot(g)
neut_de <- unique(g$data$label)
neut_de <- neut_de[-which(neut_de == "")]
plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Mmp9")
# Heatmap of these DE genes
all_de <- c(mac_de, dc_de, neut_de)
unique(all_de)
[1] "Rsad2" "Cxcl10" "Ccl4" "Ifit1"
[5] "Ccl5" "Cmpk2" "Il1b" "Il12b"
[9] "Ccl3" "Ifit2" "Irg1" "Il1a"
[13] "Oasl1" "Il6" "Cd69" "Cxcl2"
[17] "Cxcl3" "Tnfsf15" "Tnf" "Cd40"
[21] "Ccnd2" "Saa3" "Ptx3" "Mx1"
[25] "Socs3" "Gbp5" "Slc7a2" "Ifih1"
[29] "Sdc4" "Ifit3" "Clec4e" "Ifi47"
[33] "Nos2" "Ch25h" "Isg15" "Socs1"
[37] "Ifi205" "Plek" "Gm14446" "Gbp2"
[41] "Iigp1" "Marcksl1" "Plet1" "Tnfaip3"
[45] "AA467197" "Serpina3g" "Ptgs2" "Usp18"
[49] "Clic4" "Ccl2" "Pyhin1" "Trim30a"
[53] "Slfn5" "Oasl2" "Slco3a1" "Inhba"
[57] "Gbp7" "Mndal" "Irgm1" "Irf7"
[61] "Ifi203" "Mnda" "Cav1" "Il1rn"
[65] "Wdr92" "Cd14" "Ccrl2" "Slc7a11"
[69] "Ifi204" "Nt5c3" "Parp14" "Cd274"
[73] "Cfb" "Gbp3" "Gm12250" "Helz2"
[77] "Peli1" "Acsl1" "Il15" "Ddx58"
[81] "Jak2" "Vcan" "Fam26f" "Cflar"
[85] "Gm4951" "Gadd45b" "Ms4a4c" "Sh3bp5"
[89] "Pim1" "Trim30d" "Rtp4" "Gbp4"
[93] "Daxx" "Pnp" "Slamf7" "Igtp"
[97] "Ehd1" "Tap1" "Sod2" "Bcl2a1b"
[101] "Bcl2a1a" "Nfkbia" "Znfx1" "Tnfaip2"
[105] "Slfn2" "Csf1r" "Srgn" "Ftl1"
[109] "Dusp2" "Fgl2" "Ccl17" "Lgals9"
[113] "Cxcl9" "Cited2" "Parp9" "Bcl2l11"
[117] "Ccl9" "Dck" "Sp100" "Cd86"
[121] "Bcl2l1" "Ebi3" "Cd70" "Alcam"
[125] "Ndrg2" "Dhx58" "Prkx" "Xaf1"
[129] "Snx2" "Irgm2" "Nampt" "Flt3"
[133] "Dnajc5" "Cd53" "Tcf4" "Tra2a"
[137] "Phf11b" "Vopp1" "Pfkp" "Zbp1"
[141] "Parp12" "Ramp3" "Lacc1" "Mylk"
[145] "Ehd4" "Usp25" "Stat5a" "AI607873"
[149] "Flnb" "Bbx" "Herc6" "Rapgef2"
[153] "Tnip3" "Prkrir" "Fn1" "Lcn2"
[157] "Pla1a" "Ranbp2" "Rfk" "Cd81"
[161] "Pea15a" "Tor1aip1" "Retnlg" "Lipg"
[165] "Ccl7" "H2-M2" "Slfn4" "Cycs"
[169] "Stat2" "4632428N05Rik" "Hist1h1c" "Mmp9"
[173] "Tagap" "Add3" "Cdc42ep3" "Cnn2"
[177] "Fam101b" "Cd300lb" "Mt2" "Nfkbiz"
[181] "Cxcl11" "Slc2a3" "Mapkapk2" "R3hdm4"
[185] "Msl1" "Tor3a" "Zfp36l2" "Csf3r"
[189] "Capza2"
ex_sc_norm <- calc_agg_bulk(ex_sc_norm, aggregate_by = c("Timepoint", "SVM_Classify"))
plot_heatmap(ex_sc_norm, genes = c(mac_de, dc_de, neut_de), type = "bulk", facet_by = "SVM_Classify", gene_names = F)
We performed scRNA data analysis using SignallingSingleCell on affected and unaffected skin from vitiligo patients, as well as healthy controls, to define the role of each cell type in coordinating autoimmunity during disease progression.
The data was generated using suction blistering followed by scRNA-seq on 10 subjects with active vitiligo (treatment-naïve for at least 6 months) and 7 healthy individuals. For each vitiligo subject, we collected 2 sets of blisters, one lesional sample from affected skin and one non-lesional sample from unaffected skin, located at least 10cm away from any visible lesion.
You can find more details in,
https://www.science.org/doi/10.1126/scitranslmed.abd8995
For convenience we are providing 3 processed data tables here. One is a raw sparse matrix, that is the output of our processing pipeline with no further manipulation. We are also providing a fully processed UMI table that has been filtered, dimension reduced, clustered, and normalized. The processed UMI table is provided in the expression set class format (https://www.bioconductor.org/packages/devel/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf). We are also providing a separate processed UMI table for the T Cells.
The raw .fastq files have been deposited through dbGAP.
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002455.v1.p1
These files are protected, and you will need an account and to request access to download the files. As of 10/5/21 the data files are processing on the dbGAP server.
# devtools::install_github("garber-lab/SignallingSingleCell")
library(SignallingSingleCell)
library(kernlab)
# load(url("https://vitiligo.dolphinnext.com/data/phs002455.v1.p1_UMITable.Rdata")) # The Raw Data
load(url("https://vitiligo.dolphinnext.com/data/phs002455_processed_UMItable.Rdata"))
load(url("https://vitiligo.dolphinnext.com/data/phs002455_TCells_processed_UMItable.Rdata"))
The ExpressionSet class (ex_sc) is a convenient data structure that contains 3 dataframes. These dataframes contain expression data, cell information, and gene information respectively.
exprs(ex_sc) is the expression data, where rows are genes and columns are cells.
pData(ex_sc) is cell information, where rows are cells and columns are metadata.
fData(ex_sc) is gene information, where rows are genes and columns are metadata.
colnames(pData(phs002455_processed_UMItable))
[1] "Patient" "Disease" "Phenotype" "Skin"
[5] "UMI_sum_raw" "x" "y" "Cluster"
[9] "UMI_sum_norm" "CellType" "Cluster_Fig_1B" "Cluster_Refined"
# Columns 1-4 are all phenotypic metadata related to the patient sample
# Columns 5-12 are all related to cell specific metadata such as tSNE coordinates, Clusters, etc
colnames(fData(phs002455_processed_UMItable))
[1] "CD8_marker_score_Cluster_Refined"
[2] "DC_marker_score_Cluster_Refined"
[3] "GD_marker_score_Cluster_Refined"
[4] "KRT-B1_marker_score_Cluster_Refined"
[5] "KRT-B2_marker_score_Cluster_Refined"
[6] "KRT-ECR_marker_score_Cluster_Refined"
[7] "KRT-GR_marker_score_Cluster_Refined"
[8] "KRT-SP_marker_score_Cluster_Refined"
[9] "MAC_marker_score_Cluster_Refined"
[10] "MEL_marker_score_Cluster_Refined"
[11] "NK_marker_score_Cluster_Refined"
[12] "TConv_marker_score_Cluster_Refined"
[13] "Treg_marker_score_Cluster_Refined"
# The only thing stored in fData right now are cluster marker scores.
# These were calculated with id_markers()
Reproduce Figure 1B from https://www.science.org/doi/10.1126/scitranslmed.abd8995
plot_tsne_metadata(phs002455_processed_UMItable, color_by = "Cluster_Fig_1B", shuffle = T)
plot_tsne_gene(phs002455_processed_UMItable, gene = c("TRAC", "TYR"))
# Additional plots
plot_violin(phs002455_processed_UMItable, gene = c("IFNG"), color_by = "Skin", facet_by = "Cluster_Refined")
markers <- return_markers(phs002455_processed_UMItable, return_by = "Cluster_Refined", num_markers = 5)
markers
$CD8_Markers
[1] "CD8A" "NKG7" "CCL5" "CD8B" "CD3D"
$DC_Markers
[1] "CD207" "CD1A" "FCER1A" "HLA-DQB2" "AIF1"
$GD_Markers
[1] "TARP" "TRGC1" "CD8A" "CCL5" "TRGC2"
$`KRT-B1_Markers`
[1] "CYR61" "COL17A1" "POSTN" "SYT8" "RND3"
$`KRT-B2_Markers`
[1] "S100A2" "KRT15" "CCDC3" "TP63" "MIR205HG"
$`KRT-ECR_Markers`
[1] "KRT6B" "MUCL1" "S100P" "MMP7" "KRT77"
$`KRT-GR_Markers`
[1] "KRT2" "LGALS7" "SBSN" "KLK11" "KRTDAP"
$`KRT-SP_Markers`
[1] "KRTDAP" "LGALS7" "LGALS7B" "SBSN" "DMKN"
$MAC_Markers
[1] "CLEC10A" "FPR3" "LGALS2" "LYZ" "CPVL"
$MEL_Markers
[1] "APOD" "TYR" "BCAN" "DCT" "PLP1"
$NK_Markers
[1] "SPINK2" "XCL1" "CTSW" "KLRB1" "TRDC"
$TConv_Markers
[1] "IL32" "TRBC2" "CD3D" "CD6" "CD48"
$Treg_Markers
[1] "CD27" "CD3D" "IL32" "TRBC2" "CTLA4"
plot_gene_dots(phs002455_processed_UMItable, genes = unique(unlist(markers)), break_by = "Cluster_Refined")
Reproduce Figure 1C from https://www.science.org/doi/10.1126/scitranslmed.abd8995
plot_tsne_metadata(phs002455_TCells_processed_UMItable, color_by = "Cluster", shuffle = T)
plot_tsne_gene(phs002455_TCells_processed_UMItable, gene = c("TRAC", "CD4", "CD8A", "FOXP3", "TRGC1", "FCER1G"))
# To create the aggregate bulk heatmaps first calculate the aggregate bulk values
phs002455_TCells_processed_UMItable <- calc_agg_bulk(phs002455_TCells_processed_UMItable, aggregate_by = "Cluster")
# Then identify markers
phs002455_TCells_processed_UMItable <- id_markers(phs002455_TCells_processed_UMItable, id_by = "Cluster", overwrite = T)
[1] "Finding markers based on fraction expressing"
[1] "Finding markers based on mean expressing"
[1] "Merging Lists"
markers <- return_markers(phs002455_TCells_processed_UMItable)
markers
$CD8_Markers
[1] "CD8A" "GZMH" "CD8B" "NKG7" "GZMK" "GZMB" "EOMES" "GZMA" "CCL4"
[10] "CCL5"
$GD_Markers
[1] "TARP" "TRGC1" "TRGC2" "KLRC3" "CD8A" "CD8B" "NKG7"
[8] "FXYD2" "KIR3DL2" "KLRC2"
$NK_Markers
[1] "SPINK2" "FCER1G" "XCL2" "KLRF2" "XCL1" "MB"
[7] "TMPRSS11E" "KRT81" "SPTSSB" "TRDC"
$TConv_Markers
[1] "CD40LG" "CD4" "CD6" "CCR6" "TNFRSF25" "ADAM19"
[7] "IL26" "CTSH" "HLF" "RORA"
$Treg_Markers
[1] "CD4" "FOXP3" "CTLA4" "GBP5" "TBC1D4" "TIGIT" "CD27" "LAIR2"
[9] "IKZF4" "F5"
markers <- unique(unlist(markers))
Now create a heatmap using these values
plot_heatmap(phs002455_TCells_processed_UMItable, type = "bulk", genes = markers)
These functions are still in development, but we have included them for illustration purposes. The goal is to take aggregate bulk CPM values derived from scRNA-seq, cross reference them to a database of ligand and receptor pairs (Ramilowski, Jordan A et al. “A draft network of ligand-receptor-mediated multicellular signalling in human.” Nature communications vol. 6 7866. 22 Jul. 2015, doi:10.1038/ncomms8866), and then construct a network. The first step is to calculate the aggregate bulk CPM values, followed by annotating the ligands and receptors in the data.
# This function is calculating our aggregate bulk CPM values. We have some thresholds (25 CPM or the value will be set to 0, and 15 minimum cells expressing within the group or the value set to 0). This is to greatly reduce the complexity of the network, and reduce spurious nodes with little supporting data.
phs002455_processed_UMItable <- calc_agg_bulk(phs002455_processed_UMItable, aggregate_by = c("Skin", "Cluster_Refined"), group_by = "Skin", cutoff_cpm = 25, cutoff_num = 15)
# The aggregate bulk data is written into fData()
colnames(fData(phs002455_processed_UMItable))[14:ncol(fData(phs002455_processed_UMItable))]
[1] "1-Healthy_CD8_num_genes_11521_num_cells_311_percent_3.28_bulk"
[2] "2-NonLesional_CD8_num_genes_10925_num_cells_197_percent_1.79_bulk"
[3] "3-Disease_CD8_num_genes_13319_num_cells_1444_percent_12.13_bulk"
[4] "1-Healthy_DC_num_genes_12950_num_cells_1860_percent_19.64_bulk"
[5] "2-NonLesional_DC_num_genes_11836_num_cells_753_percent_6.82_bulk"
[6] "3-Disease_DC_num_genes_12558_num_cells_1184_percent_9.95_bulk"
[7] "1-Healthy_GD_num_genes_9552_num_cells_96_percent_1.01_bulk"
[8] "2-NonLesional_GD_num_genes_9891_num_cells_116_percent_1.05_bulk"
[9] "3-Disease_GD_num_genes_12381_num_cells_564_percent_4.74_bulk"
[10] "1-Healthy_KRT-B1_num_genes_8841_num_cells_157_percent_1.66_bulk"
[11] "2-NonLesional_KRT-B1_num_genes_8561_num_cells_142_percent_1.29_bulk"
[12] "3-Disease_KRT-B1_num_genes_10920_num_cells_401_percent_3.37_bulk"
[13] "1-Healthy_KRT-B2_num_genes_12948_num_cells_1909_percent_20.16_bulk"
[14] "2-NonLesional_KRT-B2_num_genes_13707_num_cells_3924_percent_35.57_bulk"
[15] "3-Disease_KRT-B2_num_genes_13762_num_cells_3721_percent_31.27_bulk"
[16] "1-Healthy_KRT-ECR_num_genes_7783_num_cells_108_percent_1.14_bulk"
[17] "2-NonLesional_KRT-ECR_num_genes_6756_num_cells_78_percent_0.71_bulk"
[18] "3-Disease_KRT-ECR_num_genes_8872_num_cells_171_percent_1.44_bulk"
[19] "1-Healthy_KRT-GR_num_genes_12351_num_cells_1100_percent_11.61_bulk"
[20] "2-NonLesional_KRT-GR_num_genes_11157_num_cells_457_percent_4.14_bulk"
[21] "3-Disease_KRT-GR_num_genes_10234_num_cells_193_percent_1.62_bulk"
[22] "1-Healthy_KRT-SP_num_genes_13079_num_cells_2593_percent_27.38_bulk"
[23] "2-NonLesional_KRT-SP_num_genes_13648_num_cells_4431_percent_40.16_bulk"
[24] "3-Disease_KRT-SP_num_genes_12802_num_cells_1871_percent_15.72_bulk"
[25] "1-Healthy_MAC_num_genes_10720_num_cells_75_percent_0.79_bulk"
[26] "2-NonLesional_MAC_num_genes_10462_num_cells_63_percent_0.57_bulk"
[27] "3-Disease_MAC_num_genes_12454_num_cells_192_percent_1.61_bulk"
[28] "1-Healthy_MEL_num_genes_13299_num_cells_591_percent_6.24_bulk"
[29] "2-NonLesional_MEL_num_genes_13235_num_cells_503_percent_4.56_bulk"
[30] "3-Disease_MEL_num_genes_11742_num_cells_151_percent_1.27_bulk"
[31] "1-Healthy_NK_num_genes_11275_num_cells_228_percent_2.41_bulk"
[32] "2-NonLesional_NK_num_genes_9811_num_cells_80_percent_0.73_bulk"
[33] "3-Disease_NK_num_genes_12048_num_cells_365_percent_3.07_bulk"
[34] "1-Healthy_TConv_num_genes_11591_num_cells_303_percent_3.2_bulk"
[35] "2-NonLesional_TConv_num_genes_10372_num_cells_145_percent_1.31_bulk"
[36] "3-Disease_TConv_num_genes_12256_num_cells_571_percent_4.8_bulk"
[37] "1-Healthy_Treg_num_genes_10566_num_cells_140_percent_1.48_bulk"
[38] "2-NonLesional_Treg_num_genes_10429_num_cells_144_percent_1.31_bulk"
[39] "3-Disease_Treg_num_genes_13003_num_cells_1073_percent_9.02_bulk"
# This function will now cross reference the ligand and receptor database, and then annotate this information into fData
phs002455_processed_UMItable <- id_rl(phs002455_processed_UMItable)
# The ligand and receptor annotations are written into fData(). The IFNG entries are highlighted below.
fData(phs002455_processed_UMItable)[c("IFNG", "IFNGR1", "IFNGR2"),c(53:56)]
networks_Receptors networks_ligands networks_ligs_to_receptor
IFNG FALSE TRUE <NA>
IFNGR1 TRUE FALSE IFNG
IFNGR2 TRUE FALSE IFNG
networks_receptor_to_ligs
IFNG IFNGR1_IFNGR2
IFNGR1 <NA>
IFNGR2 <NA>
vals <- which(fData(phs002455_processed_UMItable)[,"networks_ligands"])
ligs <- rownames(fData(phs002455_processed_UMItable))[vals]
length(ligs)
[1] 278
plot_heatmap(phs002455_processed_UMItable, genes = ligs, type = "bulk", cluster_by = "row",facet_by = "Cluster_Refined", pdf_format = "tile", scale_by = "row", cluster_type = "kmeans", k = 15)
vals <- which(fData(phs002455_processed_UMItable)[,"networks_Receptors"])
recs <- rownames(fData(phs002455_processed_UMItable))[vals]
length(recs)
[1] 319
plot_heatmap(phs002455_processed_UMItable, genes = recs, type = "bulk", cluster_by = "row",facet_by = "Cluster_Refined", pdf_format = "tile", scale_by = "row", cluster_type = "kmeans", k = 15)
Now we can calculate the long format network table. Please note this step is slow and poorly optimized… it needs to be rewritten. On my laptop it takes about 30 minutes to run.
Please keep in mind the network scales exponentially with the number of groups you provide. For each individual ligand or receptor, a node is created when the cell type expresses it, and an edge between all of its expressed pairs. This means that if 5 cells express ligand A which binds to receptor B, there will be 5 ligand A nodes, and an edge from each of them to receptor B node. The more promiscuous a given ligand and receptor interaction, and the more cell types in the data, this network quickly explodes to become unmanageable in size.
We have found 10-15 cell types is the upper limit of what can be done on a local computer without needing a cluster environment.
For convenience a download link to the output file is provided below, however you may opt to run the command on your own.
### Warning Slow!!! Use provided output table for speed. The command is left commented for speed reasons ###
# network_table <- calc_rl_connections(phs002455_processed_UMItable, nodes = "Cluster_Refined", group_by = "Skin", print_progress = T)
load(url("https://www.dropbox.com/s/p30mwg5pcj58cc6/network_table.Rdata?dl=1"))
str(network_table)
List of 3
$ Summary :'data.frame': 507 obs. of 6 variables:
..$ Lig_produce : chr [1:507] "CD8" "CD8" "CD8" "CD8" ...
..$ Rec_receive : chr [1:507] "CD8" "CD8" "CD8" "DC" ...
..$ Skin : chr [1:507] "1-Healthy" "2-NonLesional" "3-Disease" "1-Healthy" ...
..$ num_connections : int [1:507] 133 141 141 113 115 115 107 106 110 69 ...
..$ fraction_connections : num [1:507] 0.01154 0.01291 0.01059 0.00981 0.01053 ...
..$ proportion_connections: num [1:507] 0.0968 0.1004 0.0999 0.0822 0.0819 ...
$ full_network :'data.frame': 47540 obs. of 21 variables:
..$ Cluster_Refined : chr [1:47540] "KRT-SP" "KRT-SP" "KRT-SP" "KRT-SP" ...
..$ Ligand : chr [1:47540] "GPI" "HLA-A" "HLA-B" "APP" ...
..$ Cluster_Refined : chr [1:47540] "KRT-SP" "KRT-SP" "KRT-SP" "KRT-SP" ...
..$ Receptor : chr [1:47540] "AMFR" "APLP2" "CANX" "CAV1" ...
..$ Skin : chr [1:47540] "1-Healthy" "1-Healthy" "1-Healthy" "1-Healthy" ...
..$ Ligand_expression : num [1:47540] 142.5 412.4 959.8 58.1 59.9 ...
..$ Receptor_expression : num [1:47540] 36.1 31.3 190.2 81.4 81.4 ...
..$ Connection_product : num [1:47540] 5150 12928 182571 4726 4876 ...
..$ log10_Connection_product : num [1:47540] 3.71 4.11 5.26 3.67 3.69 ...
..$ Connection_ratio : num [1:47540] 3.946 13.157 5.046 0.714 0.736 ...
..$ log10_Connection_ratio : num [1:47540] 0.596 1.119 0.703 -0.146 -0.133 ...
..$ Connection_rank_coarse : num [1:47540] 1913 1361 308 1971 1957 ...
..$ Connection_Z_coarse : num [1:47540] 2.46 2.76 3.64 2.43 2.44 ...
..$ Connection_rank_coarse_grouped: num [1:47540] 684 493 104 712 705 ...
..$ Connection_Z_coarse_grouped : num [1:47540] 2.41 2.7 3.56 2.38 2.39 ...
..$ Connection_rank_fine : num [1:47540] 92 52 4 99 95 44 63 161 5 187 ...
..$ Connection_Z_fine : num [1:47540] 2.99 3.35 4.38 2.96 2.97 ...
..$ Connection_rank_fine_grouped : num [1:47540] 38 23 2 41 40 19 26 62 3 68 ...
..$ Connection_Z_fine_grouped : num [1:47540] 2.85 3.2 4.18 2.82 2.83 ...
..$ Zscores_genes : num [1:47540] 0.0813 -0.3408 -1.5589 0.9897 1.4664 ...
..$ Zscores_genes_grouped : num [1:47540] 0.0538 -0.3484 -1.528 0.9381 1.5023 ...
$ full_network_raw:'data.frame': 351351 obs. of 21 variables:
..$ Cluster_Refined : chr [1:351351] "KRT-SP" "KRT-SP" "KRT-SP" "KRT-SP" ...
..$ Ligand : chr [1:351351] "CALM1" "CALM2" "CALM3" "LIN7C" ...
..$ Cluster_Refined : chr [1:351351] "KRT-SP" "KRT-SP" "KRT-SP" "KRT-SP" ...
..$ Receptor : chr [1:351351] "ABCA1" "ABCA1" "ABCA1" "ABCA1" ...
..$ Skin : chr [1:351351] "1-Healthy" "1-Healthy" "1-Healthy" "1-Healthy" ...
..$ Ligand_expression : num [1:351351] 409.5 608.3 203.5 32.3 54.3 ...
..$ Receptor_expression : num [1:351351] 0 0 0 0 0 0 0 0 0 0 ...
..$ Connection_product : num [1:351351] 0 0 0 0 0 0 0 0 0 0 ...
..$ log10_Connection_product : num [1:351351] 0 0 0 0 0 0 0 0 0 0 ...
..$ Connection_ratio : num [1:351351] 0 0 0 0 0 0 0 0 0 0 ...
..$ log10_Connection_ratio : num [1:351351] 0 0 0 0 0 0 0 0 0 0 ...
..$ Connection_rank_coarse : num [1:351351] 15109 15109 15109 15109 15109 ...
..$ Connection_Z_coarse : num [1:351351] -0.356 -0.356 -0.356 -0.356 -0.356 ...
..$ Connection_rank_coarse_grouped: num [1:351351] 5035 5035 5035 5035 5035 ...
..$ Connection_Z_coarse_grouped : num [1:351351] -0.357 -0.357 -0.357 -0.357 -0.357 ...
..$ Connection_rank_fine : num [1:351351] 1145 1145 1145 1145 1145 ...
..$ Connection_Z_fine : num [1:351351] -0.327 -0.327 -0.327 -0.327 -0.327 ...
..$ Connection_rank_fine_grouped : num [1:351351] 382 382 382 382 382 382 382 382 382 382 ...
..$ Connection_Z_fine_grouped : num [1:351351] -0.329 -0.329 -0.329 -0.329 -0.329 ...
..$ Zscores_genes : num [1:351351] -0.288 -0.288 -0.288 -0.263 -0.191 ...
..$ Zscores_genes_grouped : num [1:351351] -0.288 -0.288 -0.288 -0.263 -0.19 ...
# The lists contains 3 entries.
# Summary provides a high level view of the number of connections between cell types.
# full_network is the same as full_network_raw, except that full_network_raw will preserve entries where either the ligand or receptor is 0 in expression.
head(network_table$full_network)
Cluster_Refined Ligand Cluster_Refined Receptor Skin Ligand_expression
27 KRT-SP GPI KRT-SP AMFR 1-Healthy 142.54554
28 KRT-SP HLA-A KRT-SP APLP2 1-Healthy 412.41139
46 KRT-SP HLA-B KRT-SP CANX 1-Healthy 959.83108
48 KRT-SP APP KRT-SP CAV1 1-Healthy 58.08089
50 KRT-SP CYR61 KRT-SP CAV1 1-Healthy 59.91347
51 KRT-SP GNAI2 KRT-SP CAV1 1-Healthy 195.81901
Receptor_expression Connection_product log10_Connection_product
27 36.12779 5149.856 3.711795
28 31.34619 12927.527 4.111515
46 190.21150 182570.911 5.261432
48 81.37710 4726.455 3.674535
50 81.37710 4875.584 3.688027
51 81.37710 15935.183 4.202357
Connection_ratio log10_Connection_ratio Connection_rank_coarse
27 3.9455924 0.5961122 1913
28 13.1566661 1.1191459 1361
46 5.0461254 0.7029580 308
48 0.7137253 -0.1464689 1971
50 0.7362448 -0.1329777 1957
51 2.4063160 0.3813527 1231
Connection_Z_coarse Connection_rank_coarse_grouped
27 2.459681 684
28 2.762894 493
46 3.635177 104
48 2.431417 712
50 2.441651 705
51 2.831803 433
Connection_Z_coarse_grouped Connection_rank_fine Connection_Z_fine
27 2.405913 92 2.994012
28 2.703450 52 3.351666
46 3.559407 4 4.380566
48 2.378178 99 2.960674
50 2.388220 95 2.972745
51 2.771069 44 3.432948
Connection_rank_fine_grouped Connection_Z_fine_grouped Zscores_genes
27 38 2.853324 0.08134314
28 23 3.196009 -0.34080106
46 2 4.181846 -1.55893737
48 41 2.821381 0.98973688
50 40 2.832947 1.46642871
51 19 3.273889 0.80125671
Zscores_genes_grouped
27 0.05375211
28 -0.34838912
46 -1.52802185
48 0.93809202
50 1.50234668
51 0.77596199
Each row is defined as a cell type expressing a ligand, a cell type expressing the receptor, as well as the condition (Skin, ie Healthy, Non-lesional, Lesional), the expression values of the ligand and receptor, as well as the log10_Connection_product, which is a log10 transformed product of the ligand and expression CPM values.
There are also some basic statistics for these connections (z scores, etc), however these are not used in the manuscript and are beyond the scope of this vignette.
Now we can take this long format table, and convert it into an igraph object. There are several ways this could be done…
One way would be to create 3 separate networks. One for healthy, one for non-lesional, one for lesional. However this has caveats. Because a ligand or receptor may go from OFF to ON, this means the landscape (the nodes and edges of the network) would change between conditions. To avoid this complexity, we opted to create a network that is a superset of all available networks (merge_all = T argument). In this way all possible nodes and edges are represented. Later on, we then query the original network_table in order to determine the edge values associate with each condition.
Please note that these functions write out files by DEFAULT. These files include plots and data. Set your working directory accordingly to track these files location. The prefix argument determines the prefix for these written files. Here we use the log10_Connection_product as the edge values for this network.
network_igraph <- build_rl_network(input = network_table, merge_all = T, value = "log10_Connection_product", prefix = "scitranslmed.abd8995_network")
names(network_igraph)
[1] "igraph_Network" "layout" "clusters"
[4] "clusters_subgraphs" "interactive"
The igraph_network is the full network
The layout is a 2D matrix with node positions
Clusters shows the cluster membership of each node
clusters_subgraphs contains the subgraphs for each cluster
interactive is the web browser based display. Keep in mind this renders in browser and can take a long time to fully render for the full network.
The network contains both a main connected body, as well as some orphan ligands and receptors with no annotated edge to connect them to the main network. For this reason, further clustering is only performed on the main graph (subset = 1, subset_on = “connected” arguments)
network_igraph_C1_analyzed <- analyze_rl_network(network_igraph, subset = 1, subset_on = "connected", prefix = "scitranslmed.abd8995_network_C1", cluster_type = "louvain", merge_singles = F)
names(network_igraph_C1_analyzed)
[1] "Degree" "Node_degree"
[3] "Node_betweeness" "Node_authority"
[5] "Edge_degree" "Edge_betweeness"
[7] "Edge_hub" "Crossing"
[9] "Clusters_Results" "Clusters_individual"
[11] "Clusters_betweeness" "Clusters_edge_hub"
[13] "Clusters_node_authority" "Interactive"
[15] "igraph_Network" "layout"
The output contains statistics related to the clustering results, as well as individual cluster results and the interactive network.
The html files are a great way to view the network. Please note that they are rendered in the browser, so the full network can take a few minutes to display properly. There are drop down menus that allow you to select a particular node of interest, as well as a particular community (cluster) of the network.
input_cell_net <- network_igraph_C1_analyzed$igraph_Network
V(input_cell_net)$membership <- network_igraph_C1_analyzed$Clusters_Results$membership
clusters <- network_igraph_C1_analyzed$Clusters_Results
weights_clusters <- ifelse(crossing(clusters, input_cell_net), 1, 10)
set.seed(10)
cluster_weighted_layout <- layout_with_fr(input_cell_net, weights = weights_clusters)
edge_colors <- ifelse(crossing(clusters, input_cell_net), "red", "black")
members <- V(input_cell_net)$membership
colopal <- RColorBrewer::brewer.pal(n = 8, name = "Dark2")
f <- colorRampPalette(colopal)
colopal <- f(length(unique(members)))
colopal <- colopal[as.factor(members)]
plot(input_cell_net,
layout = cluster_weighted_layout,
vertex.color = colopal,
vertex.size = 1.5,
vertex.label = "",
vertex.label.cex = 1,
vertex.label.color = "black",
vertex.frame.color=colopal,
edge.width = 0.1,
edge.arrow.size = 0.01,
edge.arrow.width = 0.1,
edge.color = "gray50")