Stability Measures and Consensus Clustering
stability_measures.Rmd
Download a copy of the vignette to follow along here: stability_measures.Rmd
In this vignette, we will highlight the main stability measure options in the metasnf package.
Do brace yourself: stability measures scale SNF computation time by the number of settings matrix rows times the number of resamples of the data you use. Consider trying these functions on scaled down versions of your data or on just a couple of rows of the settings_matrix to get a sense of how long these functions may take to complete on your full dataset. The code below isn’t actually evaluated in this document (all documentation is re-rendered on every commit, and this vignette simply takes too much time during development), but descriptions of the outputs are provided and you should feel free to try the functions yourself.
Data set-up
library(metasnf)
# Generate data_list
data_list <- generate_data_list(
list(
data = gender_df,
name = "gender",
domain = "demographics",
type = "categorical"
),
list(
data = diagnosis_df,
name = "diagnosis",
domain = "clinical",
type = "categorical"
),
list(
data = age_df,
name = "age",
domain = "demographics",
type = "discrete"
),
uid = "patient_id"
)
# Generate settings_matrix
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 3,
max_k = 40,
seed = 42
)
As an added part of the data set-up, we’ll also calculate some subsamples of the data_list.
data_list_subsamples <- subsample_data_list(
data_list,
n_subsamples = 3, # calculate 30 subsamples
subsample_fraction = 0.8 # for each subsample, use random 80% of patients
)
data_list_subsamples
contains a list of 3 variations of
the full data_list. Each variation only has a random 80% of the original
patients.
Pairwise Adjusted Rand Indices Across Subsamples
pairwise_aris
is a dataframe that contains the mean and
standard deviation of the pairwise adjusted Rand indices between
patients for each row of the settings matrix.
pairwise_aris <- subsample_pairwise_aris(
data_list_subsamples,
settings_matrix
)
Persistence of Co-Clustering Across Subsamples
The fraction_clustered_together
function calculates for
every pair of patients that clustered together in the full sample, how
often they continued to cluster together in data subsamples.
# Run SNF and clustering
solutions_matrix <- batch_snf(
data_list,
settings_matrix
)
fraction_together <- fraction_clustered_together(
data_list_subsamples,
settings_matrix,
solutions_matrix
)
Co-clustering Heatmaps
You can visualize co-clustering across the resamples for a single
row of the settings matrix using the
generate_cocluster_data
and cocluster_heatmap
functions.
cocluster_heatmap
will only work if every pair of
patients were part of the same subsampled data at least 1 time. You’ll
see a descriptive warning if this didn’t end up happening in your own
data_list_subsamples. This can most easily be resolved by increasing the
number of subsamples you examine or the subsample fraction.
data_list_subsamples <- subsample_data_list(
data_list,
n_subsamples = 30, # calculate 30 subsamples
subsample_fraction = 0.8 # for each subsample, use random 80% of patients
)
cocluster_data <- generate_cocluster_data(
data_list = data_list,
data_list_subsamples = data_list_subsamples,
settings_matrix_row = settings_matrix[1, ]
)
cocluster_data
a list with two matrices:
-
same_solution
: A patient x patient matrix where each cell is the number of subsamples that contained both of those patients -
same_cluster
: A patient x patient matrix where each cell is the number of subsamples where those patients were clustered together
same_cluster <- cocluster_data$"same_cluster"
same_solution <- cocluster_data$"same_solution"
cocluster_matrix <- same_cluster / same_solution
This matrix is automatically calculated and plotted in the
cocluster_heatmap
function.
hm <- cocluster_heatmap(cocluster_data)
hm
You can pull out the patient order of the heatmap as follows:
order
can now be applied to the data in the data_list to
get the same patient order as shown in the heatmap:
# The order of patients as they appear in the heatmap
data_list[[1]]$"data"[order, "subjectkey"]
By default, generate_cocluster_data
only operates on a
single row of the settings matrix and essentially just summarizes the
data across the resamplings. You can also pool together the results from
several rows using the pooled_cocluster_heatmap
function as
follows.
cocluster_data_2 <- generate_cocluster_data(
full_data_list = data_list,
data_list_subsamples,
settings_matrix[2, ]
)
cocluster_data_3 <- generate_cocluster_data(
full_data_list = data_list,
data_list_subsamples,
settings_matrix[3, ]
)
pooled_cocluster_heatmap(
cocluster_list = list(
cocluster_data,
cocluster_data_2,
cocluster_data_3
)
)