Skip to contents

Download a copy of the vignette to follow along here: stability_measures.Rmd

In this vignette, we will highlight the main stability measure options in the metasnf package.

Do brace yourself: stability measures scale SNF computation time by the number of settings matrix rows times the number of resamples of the data you use. Consider trying these functions on scaled down versions of your data or on just a couple of rows of the settings_matrix to get a sense of how long these functions may take to complete on your full dataset. The code below isn’t actually evaluated in this document (all documentation is re-rendered on every commit, and this vignette simply takes too much time during development), but descriptions of the outputs are provided and you should feel free to try the functions yourself.

Data set-up

library(metasnf)

# Generate data_list
data_list <- generate_data_list(
    list(
        data = gender_df,
        name = "gender",
        domain = "demographics",
        type = "categorical"
    ),
    list(
        data = diagnosis_df,
        name = "diagnosis",
        domain = "clinical",
        type = "categorical"
    ),
    list(
        data = age_df,
        name = "age",
        domain = "demographics",
        type = "discrete"
    ),
    uid = "patient_id"
)

# Generate settings_matrix
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 3,
    max_k = 40,
    seed = 42
)

As an added part of the data set-up, we’ll also calculate some subsamples of the data_list.

data_list_subsamples <- subsample_data_list(
    data_list,
    n_subsamples = 3, # calculate 30 subsamples
    subsample_fraction = 0.8 # for each subsample, use random 80% of patients
)

data_list_subsamples contains a list of 3 variations of the full data_list. Each variation only has a random 80% of the original patients.

Pairwise Adjusted Rand Indices Across Subsamples

pairwise_aris is a dataframe that contains the mean and standard deviation of the pairwise adjusted Rand indices between patients for each row of the settings matrix.

pairwise_aris <- subsample_pairwise_aris(
    data_list_subsamples,
    settings_matrix
)

Persistence of Co-Clustering Across Subsamples

The fraction_clustered_together function calculates for every pair of patients that clustered together in the full sample, how often they continued to cluster together in data subsamples.

# Run SNF and clustering
solutions_matrix <- batch_snf(
    data_list,
    settings_matrix
)

fraction_together <- fraction_clustered_together(
    data_list_subsamples,
    settings_matrix,
    solutions_matrix
)

Co-clustering Heatmaps

You can visualize co-clustering across the resamples for a single row of the settings matrix using the generate_cocluster_data and cocluster_heatmap functions.

cocluster_heatmap will only work if every pair of patients were part of the same subsampled data at least 1 time. You’ll see a descriptive warning if this didn’t end up happening in your own data_list_subsamples. This can most easily be resolved by increasing the number of subsamples you examine or the subsample fraction.

data_list_subsamples <- subsample_data_list(
    data_list,
    n_subsamples = 30, # calculate 30 subsamples
    subsample_fraction = 0.8 # for each subsample, use random 80% of patients
)

cocluster_data <- generate_cocluster_data(
    data_list = data_list,
    data_list_subsamples = data_list_subsamples,
    settings_matrix_row = settings_matrix[1, ]
)

cocluster_data a list with two matrices:

  • same_solution: A patient x patient matrix where each cell is the number of subsamples that contained both of those patients
  • same_cluster: A patient x patient matrix where each cell is the number of subsamples where those patients were clustered together
same_cluster <- cocluster_data$"same_cluster"
same_solution <- cocluster_data$"same_solution"

cocluster_matrix <- same_cluster / same_solution

This matrix is automatically calculated and plotted in the cocluster_heatmap function.

hm <- cocluster_heatmap(cocluster_data)

hm

You can pull out the patient order of the heatmap as follows:

hm <- ComplexHeatmap::draw(hm)
order <- ComplexHeatmap::row_order(hm)

order can now be applied to the data in the data_list to get the same patient order as shown in the heatmap:

# The order of patients as they appear in the heatmap
data_list[[1]]$"data"[order, "subjectkey"]

By default, generate_cocluster_data only operates on a single row of the settings matrix and essentially just summarizes the data across the resamplings. You can also pool together the results from several rows using the pooled_cocluster_heatmap function as follows.

cocluster_data_2 <- generate_cocluster_data(
    full_data_list = data_list,
    data_list_subsamples,
    settings_matrix[2, ]
)

cocluster_data_3 <- generate_cocluster_data(
    full_data_list = data_list,
    data_list_subsamples,
    settings_matrix[3, ]
)

pooled_cocluster_heatmap(
    cocluster_list = list(
        cocluster_data,
        cocluster_data_2,
        cocluster_data_3
    )
)