Testing the quality of the generated classifiers has three main parts.

  1. Splitting the data that wants to be classified in two.
  2. Use one of the splits to train the classifier.
  3. Use the model on the rest of the data and check the accuracy.

To illustrate the process, we will use this tiny dataset that comes bundled with our package. Additional information can be obtained by using ?small_5050_mix

Splitting the dataset

There are several ways in which this can be done, here we will sample 80% of the cells for the training, this will be don using Seurat’s subset interface.

Sample will select 204 cell names and store those names in cells_train, we will also store all other names in cells_test.

Now using the subset operation, we get two Seurat objects containing each of the correspoinding cells.

Checking the consistency of the classifier

This model can now be used to generate a prediction on the testing data. For this we will use the predict generic, which requires the new data to be passed as a data.frame.

We can store these predictions in the Seurat object itself for future usage.

Furthermore, we can generate a confusion matrix based on the classifications by passing the data onto table.

We can also display graphically this information using autoplot

autoplot(as.frequency.matrix(confusion_tbl), show_number = TRUE)

Tuning the model

I is posible to modify how the model is generated, in this case, we might want to define the markers based on some previous information or some quality metric. Here we will use the top 6 genes according to a random-forest based importance metric.

We can pass to fit_ctree any parameter accepted by ?partykit::ctree_control. Some of the parameters that might affect the most the final tree would be:

  • alpha, which would modify the splitting criteria, values close to 1 indicate that more splits could be considered.
  • maxdepth, which specifies how deep can the tree be.
  • minbucket, which specifies how many cells can each terminal node have.
  • minsplit, specifies how small can a node be to be considered for further splitting.

Please note that more complicated classifiers will usually have better classification performance, but less interpretability. Nontheless, there will be point in which the classifier gets better within the training data but will not improve in the testing data (phenomenom called over-fitting the model).

tree_model_topn <- fit_ctree(
    training_mix,
    genes_use = top_markers$gene,
    alpha = 0.99, minbucket = 2, minsplit = 1)

print(as.garnett(tree_model_topn))
#> > 0_node_14  (n = 4)
#> expressed below: ARHGDIB 0, CD3D 3.124, NDFIP2 0, TMSB4X 0
#> expressed between: CKB 5.142 4.539
#> 
#> > 0_node_15  (n = 4)
#> expressed above: TMSB4X 0
#> expressed below: ARHGDIB 0, CD3D 3.124, NDFIP2 0
#> expressed between: CKB 5.142 4.539
#> 
#> > 0_node_24  (n = 2)
#> expressed below: CD3D 1.741
#> expressed between: CKB 4.848 4.803, NDFIP2 2.215 0
#> 
#> > 0_node_28  (n = 2)
#> expressed above: CKB 4.539
#> expressed below: TMSB4X 0
#> expressed between: CD3D 3.124 1.741, NDFIP2 2.841 0
#> 
#> > 0_node_31  (n = 2)
#> expressed above: CKB 5.031, TMSB4X 0
#> expressed between: CD3D 3.124 1.741, NDFIP2 2.841 0
#> 
#> > 0_node_32  (n = 5)
#> expressed above: CKB 4.539, NDFIP2 2.841
#> expressed below: CD3D 3.124
#> 
#> > 0_node_36  (n = 102)
#> expressed above: CD3D 3.124
#> expressed below: NDFIP2 1.588, TMSB4X 5.7
#> 
#> > 0_node_38  (n = 2)
#> expressed above: CD3D 3.124
#> expressed below: TMSB4X 5.7
#> expressed between: NDFIP2 1.63 1.588
#> 
#> > 0_node_39  (n = 8)
#> expressed above: CD3D 3.124
#> expressed below: TMSB4X 5.7
#> expressed between: NDFIP2 2.013 1.63
#> 
#> > 0_node_40  (n = 2)
#> expressed above: CD3D 3.124, TMSB4X 5.7
#> expressed below: NDFIP2 2.013
#> 
#> > 0_node_42  (n = 3)
#> expressed above: CD3D 3.124, NDFIP2 2.013
#> expressed below: CKB 0
#> 
#> > 0_node_5   (n = 9)
#> expressed below: CD3D 1.616, CKB 4.539
#> 
#> > 0_node_7   (n = 2)
#> expressed below: CKB 4.539
#> expressed between: CD3D 2.07 1.616
#> 
#> > 0_node_8   (n = 3)
#> expressed below: CKB 4.539
#> expressed between: CD3D 2.749 2.07
#> 
#> > 0_node_9   (n = 2)
#> expressed below: CKB 4.539
#> expressed between: CD3D 3.124 2.749
#> 
#> > 1_node_16  (n = 3)
#> expressed above: CKB 5.142
#> expressed below: ARHGDIB 0, CD3D 3.124, NDFIP2 0
#> 
#> > 1_node_17  (n = 2)
#> expressed above: ARHGDIB 0, CKB 4.539
#> expressed below: CD3D 3.124, NDFIP2 0
#> 
#> > 1_node_22  (n = 8)
#> expressed below: CD3D 1.741
#> expressed between: CKB 4.803 4.539, NDFIP2 2.841 0
#> 
#> > 1_node_25  (n = 3)
#> expressed below: CD3D 1.741
#> expressed between: CKB 4.848 4.803, NDFIP2 2.841 2.215
#> 
#> > 1_node_26  (n = 27)
#> expressed above: CKB 4.848
#> expressed below: CD3D 1.741
#> expressed between: NDFIP2 2.841 0
#> 
#> > 1_node_30  (n = 4)
#> expressed above: TMSB4X 0
#> expressed between: CD3D 3.124 1.741, CKB 5.031 4.539, NDFIP2 2.841 0
#> 
#> > 1_node_43  (n = 5)
#> expressed above: CD3D 3.124, CKB 0, NDFIP2 2.013

(note how much more complicate the model ends up being …)