sctree

The goal of sctree is to create a tool to accelerate the transition from single cell rna-sequencing to calidation and new sub-population discovery.

Features suggesting pseudo-gating strategies to purify found populations via flow-cytometry, antibody querying and cross validations between datasets.

Please check our companion documentation website hosted at jspaezp.github.io/sctree.

Number of lines in roxygen comments: 1039

Number of lines in R code: 1968

Installation

Currently we are not hosted in CRAN but the package can be installed by using the following function in R

remotes::install_github("jspaezp/sctree")

Usage

I am assuming you have already done your clustering and dimensional reduction using seurat and we have our seurat object.

For this examples we will use a dummy dataset that come bundled with the package

small_5050_mix, this dataset comes originally from the 1:1 mixture of Jurkat and 293T cells provided by 10x.

Original data can be found here:

  1. 1:1 mixture
  2. 99:1 mixture

Finding important variables to classify clusters

We base our importances on the “classification value” they give to a random forest (using the implementation in the ranger package)

So lets fit the random forest … Here we are adding the warn.imp.method to prevent a warning message sent by ranger when most of the variables are correlated with the clustering.

Please reffer to the of the importance_pvalues section in the ranger documentation when addressing this issue and for more details.

By default, we obtain a data frame containing only importances with pvalues under 0.05.

Seurat Interface

As an analogous function to Seurat’s FindAllMarkers, we offer ranger_importances or the RangerDE option for FindAllMarkers

Note how variable importances can be high if a marker is either preferentially present of preferentially absent. Therefore as a pre-filtering step we implemented a modified version of seurat’s “FindMarkers”

Visualizing the expected outcome of a flow cytometry experiment

Lets say we choose the top n markers from the former list and we did a flow experiment … HYPOTHETICALLY the marker distribution would resemble the rna expression profile for which we have the function plot_flowstyle

g <- plot_flowstyle(small_5050_mix, markernames = top_markers)
g

We can also focus in one of the pannels (and check the color conventions)

Suggesting a gating strategy for the markers

A general strategy to get separate all clusters

Visualizing the tree as … a tree … we can see how our model is a simple series of yes/no questions.

If we wanted to classifiy a random cell: in the first node, we check if the expression of that gene is higher or lower than a given value, if it is lower, we proceed to the left, if not we go right. We keep doing that until we have no more branches. This final node will have a predicted cluster, in this plot we can also see how pure can we expect this group to be and how many of the cells in our training set clasify as part of it.

plot(tree_fit)

When inspecting the tree_fit, we can see a more detailed text representation of this tree.

Sometimes one might think that the proposed strategy is too complicated or not implementable in the experimental settings, in order to add constraints to the fit one can give additional arguments that will be passed to partykit::ctree_control, such as maxdepth = 2 (maximum 2 questions per cell)

Since not all variables are ultimately used in our classifier, one can access the ones that were by using varimp(tree_fit)

partykit::varimp(tree_fit)
#>      CD3D      ASNS 
#> 0.2849840 0.1505534
plot_flowstyle(small_5050_mix, names(partykit::varimp(tree_fit)))

One can also request the package to suggest a specific strategy only for a given cluster. This function is not expected to give drastically different results in datasets with few clusters, but it can definitely come usefull when many clusters are present and one is interested in a specific one.

Sometimes it is useful to visualize directly the subset of cells that will be “gated” out in each rule, this can be easily achieved by using our implementation of plot_gates

plot_gates(small_5050_mix, tree_fit, "6")

We have also implemented a way to export these rules as a garnett classifier. for more detail on how the classifier is implemented please refer to the garnett documentation

Finding antibodies for the experiment

Since we acknowledge most experimental workflows need antibodies. We have implemented several functions to look for antibodies in vendor websites, as well as some helper functions to find the other posible aliases a gene might have.

Here is a simple example for a gene widely know to have an antibody available

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora 30 (Workstation Edition)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sctree_0.0.7.0000 Seurat_3.1.1     
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rtsne_0.15           colorspace_1.4-1     selectr_0.4-2       
#>   [4] ggridges_0.5.1       farver_2.0.1         leiden_0.3.1        
#>   [7] listenv_0.8.0        npsurv_0.4-0         ggrepel_0.8.1       
#>  [10] bit64_0.9-7          mvtnorm_1.0-11       AnnotationDbi_1.46.1
#>  [13] xml2_1.2.2           ranger_0.12.1        codetools_0.2-16    
#>  [16] splines_3.6.2        R.methodsS3_1.7.1    lsei_1.2-0          
#>  [19] libcoin_1.0-5        knitr_1.26           Formula_1.2-3       
#>  [22] jsonlite_1.6         ica_1.0-2            cluster_2.1.0       
#>  [25] png_0.1-7            R.oo_1.23.0          uwot_0.1.5          
#>  [28] sctransform_0.2.0    compiler_3.6.2       httr_1.4.1          
#>  [31] assertthat_0.2.1     Matrix_1.2-18        lazyeval_0.2.2      
#>  [34] htmltools_0.4.0      tools_3.6.2          rsvd_1.0.2          
#>  [37] igraph_1.2.4.2       partykit_1.2-5       gtable_0.3.0        
#>  [40] glue_1.3.1           RANN_2.6.1           reshape2_1.4.3      
#>  [43] dplyr_0.8.3          Rcpp_1.0.3           Biobase_2.44.0      
#>  [46] vctrs_0.2.2          gdata_2.18.0         ape_5.3             
#>  [49] nlme_3.1-142         gbRd_0.4-11          lmtest_0.9-37       
#>  [52] inum_1.0-1           xfun_0.11            stringr_1.4.0       
#>  [55] globals_0.12.5       rvest_0.3.5          lifecycle_0.1.0     
#>  [58] irlba_2.3.3          gtools_3.8.1         future_1.16.0       
#>  [61] MASS_7.3-51.4        zoo_1.8-6            scales_1.1.0        
#>  [64] parallel_3.6.2       RColorBrewer_1.1-2   yaml_2.2.0          
#>  [67] memoise_1.1.0        reticulate_1.13      pbapply_1.4-2       
#>  [70] gridExtra_2.3        ggplot2_3.2.1        rpart_4.1-15        
#>  [73] reshape_0.8.8        stringi_1.4.6        RSQLite_2.1.4       
#>  [76] S4Vectors_0.22.1     caTools_1.17.1.3     BiocGenerics_0.30.0 
#>  [79] bibtex_0.4.2         Rdpack_0.11-0        SDMTools_1.1-221.2  
#>  [82] rlang_0.4.4          pkgconfig_2.0.3      bitops_1.0-6        
#>  [85] evaluate_0.14        lattice_0.20-38      ROCR_1.0-7          
#>  [88] purrr_0.3.3          labeling_0.3         htmlwidgets_1.5.1   
#>  [91] cowplot_1.0.0        bit_1.1-14           tidyselect_0.2.5    
#>  [94] GGally_1.4.0         RcppAnnoy_0.0.14     wrapr_1.9.3         
#>  [97] plyr_1.8.5           magrittr_1.5         R6_2.4.1            
#> [100] IRanges_2.18.3       gplots_3.0.1.1       DBI_1.0.0           
#> [103] pillar_1.4.3         fitdistrplus_1.0-14  survival_3.1-8      
#> [106] tibble_2.1.3         future.apply_1.3.0   tsne_0.1-3          
#> [109] crayon_1.3.4         KernSmooth_2.23-16   plotly_4.9.1        
#> [112] rmarkdown_1.18       viridis_0.5.1        grid_3.6.2          
#> [115] data.table_1.12.6    blob_1.2.0           metap_1.1           
#> [118] digest_0.6.24        tidyr_1.0.0          R.utils_2.9.1       
#> [121] RcppParallel_4.4.4   stats4_3.6.2         munsell_0.5.0       
#> [124] viridisLite_0.3.0

Steps down the road

  1. Address some of the TODO’s in this repository
  2. Reduce dependecies by replacing functions to base equivalents.
  3. Add links to the documentation to make nicer to explore the package from inside R
  4. Implement a way to find markers for clusters exclusively upregulated
  5. Refctor the code to make the coding nomenclature same as seurat