A Less Simple Example • metasnf

This vignette walks through how this package can be used over a complete SNF-subtyping pipeline.

Data set-up and pre-processing

1. Load the library and data into the R environment

Your data should be loaded into the R environment in the following format:

The data is in one or multiple data.frame objects
The data is in wide form (one row per patient)
Any dataframe should one column that uniquely identifies which patient the row has data for

It is fine to have missing data at this stage.

The package comes with a few mock dataframes based on real data from the Adolescent Brain Cognitive Development study:

abcd_anxiety (anxiety scores from the CBCL)
abcd_depress (depression scores from the CBCL)
abcd_cort_t (cortical thicknesses)
abcd_cort_sa (cortical surface areas in mm^2)
abcd_subc_v (subcortical volumes in mm^3)
abcd_income (household income on a 1-3 scale)
abcd_pubertal (pubertal status on a 1-5 scale)

Here’s what the cortical thickness data looks like:

library(metasnf)

class(abcd_cort_t)
#> [1] "tbl_df"     "tbl"        "data.frame"

dim(abcd_cort_t)
#> [1] 188 152

str(abcd_cort_t[1:5, 1:5])
#> Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  5 variables:
#>  $ patient : chr  "NDAR_INV0567T2Y9" "NDAR_INV0GLZNC2W" "NDAR_INV0IZ157F8" "NDAR_INV0J4PYA5F" ...
#>  $ mrisdp_1: num  2.6 2.62 2.62 2.6 2.53
#>  $ mrisdp_2: num  2.49 2.85 2.29 2.67 2.76
#>  $ mrisdp_3: num  2.8 2.78 2.53 2.68 2.83
#>  $ mrisdp_4: num  2.95 2.85 2.96 2.94 2.99

abcd_cort_t[1:5, 1:5]
#>            patient mrisdp_1 mrisdp_2 mrisdp_3 mrisdp_4
#> 1 NDAR_INV0567T2Y9    2.601    2.487    2.801    2.954
#> 2 NDAR_INV0GLZNC2W    2.619    2.851    2.784    2.846
#> 3 NDAR_INV0IZ157F8    2.621    2.295    2.530    2.961
#> 4 NDAR_INV0J4PYA5F    2.599    2.670    2.676    2.938
#> 5 NDAR_INV0OYE291Q    2.526    2.761    2.829    2.986

The first column “patient” is the unique identifier (UID) for all subjects in the data.

Here’s the household income data:

dim(abcd_income)
#> [1] 300   2

str(abcd_income[1:5, ])
#> Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  2 variables:
#>  $ patient         : chr  "NDAR_INV0567T2Y9" "NDAR_INV0GLZNC2W" "NDAR_INV0IZ157F8" "NDAR_INV0J4PYA5F" ...
#>  $ household_income: num  3 NA 1 2 1

abcd_income[1:5, ]
#>            patient household_income
#> 1 NDAR_INV0567T2Y9                3
#> 2 NDAR_INV0GLZNC2W               NA
#> 3 NDAR_INV0IZ157F8                1
#> 4 NDAR_INV0J4PYA5F                2
#> 5 NDAR_INV0OYE291Q                1

Putting everything in a list will help us get quicker summaries of all the data.

abcd_data <- list(
    abcd_anxiety,
    abcd_depress,
    abcd_cort_t,
    abcd_cort_sa,
    abcd_subc_v,
    abcd_income,
    abcd_pubertal
)

# The number of rows in each dataframe:
lapply(abcd_data, dim)
#> [[1]]
#> [1] 275   2
#> 
#> [[2]]
#> [1] 275   2
#> 
#> [[3]]
#> [1] 188 152
#> 
#> [[4]]
#> [1] 188 152
#> 
#> [[5]]
#> [1] 174  31
#> 
#> [[6]]
#> [1] 300   2
#> 
#> [[7]]
#> [1] 275   2

# Whether or not each dataframe has missing values:
lapply(abcd_data,
    function(x) {
        any(is.na(x))
    }
)
#> [[1]]
#> [1] TRUE
#> 
#> [[2]]
#> [1] TRUE
#> 
#> [[3]]
#> [1] FALSE
#> 
#> [[4]]
#> [1] FALSE
#> 
#> [[5]]
#> [1] FALSE
#> 
#> [[6]]
#> [1] TRUE
#> 
#> [[7]]
#> [1] TRUE

Some of our data has missing values and not all of our dataframes have the same number of participants.

Generating the data list

The data_list structure is a structured list of dataframes (like the one already created), but with some additional metadata about each dataframe. It should only contain the input dataframes we want to directly use as inputs for the clustering. Out of all the data we have available to us, we may be working in a context where the anxiety and depression data are especially important patient outcomes, and we want to know if we can find subtypes using the rest of the data which still do a good job of separating out patients by their anxiety and depression scores. We’ll set aside anxiety and depression for now and use the rest of the data as inputs for our subtyping, which means loading them into the data_list.

data_list <- generate_data_list(
    list(abcd_cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(abcd_cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(abcd_subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(abcd_income, "household_income", "demographics", "continuous"),
    list(abcd_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "patient"
)
#> [1] "UID successfully converted to subjectkey."

The first entries are all lists which contains the following elements:

The actual dataframe
A name for the dataframe (string)
A name for the domain of information the dataframe is representative of (string)
The type of variable stored in the dataframe (options are continuous, discrete, ordinal, categorical, and mixed)

Finally, there’s an argument for the uid (the column name that currently uniquely identifies all the subjects in your data).

Behind the scenes, this function is building a nested list that keeps track of all this information, but it is also:

Converting the UID of the data into “subjectkey”
Removing all observations that contain any NAs
Removing all subjects who are not present in all input dataframes
Arranging the subjects in all the dataframe by their UID
Prefixing the UID values with the string “subject_” to help with cluster result characterization

Any rows containing NAs are removed. If you don’t want a bunch of your data to get removed because there are a few NAs sprinkled around here and there, consider using imputation. The mice package in R is nice for this.

We can get a summary of our constructed data_list with the summarize_dl function:

summarize_dl(data_list)
#>                    name       type       domain length width
#> 1    cortical_thickness continuous neuroimaging    100   152
#> 2 cortical_surface_area continuous neuroimaging    100   152
#> 3    subcortical_volume continuous neuroimaging    100    31
#> 4      household_income continuous demographics    100     2
#> 5       pubertal_status continuous demographics    100     2

Each input dataframe now has the same 100 subjects with complete data.

Generating the settings matrix

The settings_matrix stores all the information about the settings we’d like to use for each of our SNF runs. Calling the generate_settings_matrix function with a specified number of rows will automatically build a randomly populated settings_matrix.

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    min_k = 20,
    max_k = 50,
    seed = 42
)
#> [1] "The global seed has been changed!"

settings_matrix[1:5, ]
#>   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
#> 1      1   0.5 29 20          2         1         1         1        1        1
#> 2      2   0.4 26 20          1         1         1         1        1        1
#> 3      3   0.3 44 20          2         2         1         1        1        1
#> 4      4   0.3 43 20          1         1         1         1        1        1
#> 5      5   0.5 29 20          2         2         1         1        1        1
#>   mix_dist input_wt domain_wt inc_cortical_thickness inc_cortical_surface_area
#> 1        1        1         1                      1                         0
#> 2        1        1         1                      1                         1
#> 3        1        1         1                      1                         0
#> 4        1        1         1                      1                         1
#> 5        1        1         1                      1                         1
#>   inc_subcortical_volume inc_household_income inc_pubertal_status
#> 1                      1                    0                   1
#> 2                      1                    1                   1
#> 3                      0                    1                   1
#> 4                      0                    1                   1
#> 5                      1                    1                   1

The columns are:

row_id: Integer to keep track of which row is which
alpha - A hyperparameter for SNF (variable that influences the subtyping process)
k - A hyperparameter for SNF
t - A hyperparameter for SNF
snf_scheme - the specific way in which input data gets collapsed into a final fused network (discussed further in the SNF schemes vignette)
clust_alg - Which clustering algorithm will be applied to the final fused network produced by SNF
*_dist - Which distance metric will be used for the different types of data (discussed further in the distance metrics vignette)
inc_* - binary columns indicating whether or not an input dataframe is included (1) or excluded (0) from the corresponding SNF run (discussed further in the random dropout vignette)

Without specifying any additional parameters, generate_settings_matrix randomly populates these columns and ensures that no generated rows are identical.

What’s important for now is that the matrix (technically a dataframe in the R environment) contains several rows which each outline a different but reasonable way in which the raw data could be converted into patient subtypes. Further customization of the settings_matrix will enable you to access the broadest possible space of reasonable cluster solutions that your data can produce using SNF and ideally get you closer to a generalizable and useful solution for your context. More on settings_matrix customization can be found in the settings matrix vignette

Setting the optional seed parameter (which will affect the seed of your entire R session) ensures that the same settings matrix is generated each time we run the code.

While we end up with a random set of settings here, there is nothing wrong with manually altering the settings matrix to suit your needs. For example, if you wanted to know how much of a difference one input dataframe made, you could ensure that half of the rows included this input dataframe and the other half didn’t. You can also add random rows to an already existing dataframe using the add_settings_matrix_rows function (further discussed in the vignette).

Running SNF for all the rows in the settings matrix

The batch_snf function integrates the data in the data_list using each of the sets of settings contained in the settings_matrix. The resulting structure is an solutions_matrix which is an extension of the settings_matrix that contains columns specifying which cluster each subject was assigned for the corresponding settings_matrix row.

solutions_matrix <- batch_snf(data_list, settings_matrix)
#> [1] "Row: 1/20 | Time remaining: 3 seconds"
#> [1] "Row: 2/20 | Time remaining: 4 seconds"
#> [1] "Row: 3/20 | Time remaining: 3 seconds"
#> [1] "Row: 4/20 | Time remaining: 3 seconds"
#> [1] "Row: 5/20 | Time remaining: 3 seconds"
#> [1] "Row: 6/20 | Time remaining: 3 seconds"
#> [1] "Row: 7/20 | Time remaining: 2 seconds"
#> [1] "Row: 8/20 | Time remaining: 2 seconds"
#> [1] "Row: 9/20 | Time remaining: 2 seconds"
#> [1] "Row: 10/20 | Time remaining: 2 seconds"
#> [1] "Row: 11/20 | Time remaining: 2 seconds"
#> [1] "Row: 12/20 | Time remaining: 1 seconds"
#> [1] "Row: 13/20 | Time remaining: 1 seconds"
#> [1] "Row: 14/20 | Time remaining: 1 seconds"
#> [1] "Row: 15/20 | Time remaining: 1 seconds"
#> [1] "Row: 16/20 | Time remaining: 1 seconds"
#> [1] "Row: 17/20 | Time remaining: 1 seconds"
#> [1] "Row: 18/20 | Time remaining: 0 seconds"
#> [1] "Row: 19/20 | Time remaining: 0 seconds"
#> [1] "Row: 20/20 | Time remaining: 0 seconds"
#> [1] "Total time taken: 4 seconds."

colnames(solutions_matrix)[1:30]
#>  [1] "row_id"                    "alpha"                    
#>  [3] "k"                         "t"                        
#>  [5] "snf_scheme"                "clust_alg"                
#>  [7] "cont_dist"                 "disc_dist"                
#>  [9] "ord_dist"                  "cat_dist"                 
#> [11] "mix_dist"                  "input_wt"                 
#> [13] "domain_wt"                 "inc_cortical_thickness"   
#> [15] "inc_cortical_surface_area" "inc_subcortical_volume"   
#> [17] "inc_household_income"      "inc_pubertal_status"      
#> [19] "nclust"                    "subject_NDAR_INV0567T2Y9" 
#> [21] "subject_NDAR_INV0IZ157F8"  "subject_NDAR_INV0J4PYA5F" 
#> [23] "subject_NDAR_INV10OMKVLE"  "subject_NDAR_INV15FPCW4O" 
#> [25] "subject_NDAR_INV19NB4RJK"  "subject_NDAR_INV1HLGR738" 
#> [27] "subject_NDAR_INV1KR0EZFU"  "subject_NDAR_INV1L3Y9EOP" 
#> [29] "subject_NDAR_INV1TCP5GNM"  "subject_NDAR_INV1ZHRDJ6B"

It goes on like this for some time.

Just like that, the clustering is done!

You can pull the clustering results out of each row using the get_cluster_df function:

cluster_solutions <- get_cluster_solutions(solutions_matrix)

head(cluster_solutions)
#>                 subjectkey 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 1 subject_NDAR_INV0567T2Y9 1 2 1 2 2 2 1 2 2  2  5  2  2  2  1  2  1  1  2  1
#> 2 subject_NDAR_INV0IZ157F8 2 1 1 1 1 1 1 5 1  1  4  1  1  1  1  1  1  1  1  3
#> 3 subject_NDAR_INV0J4PYA5F 1 2 2 2 2 2 2 4 2  3  2  2  2  2  2  2  2  3  2  2
#> 4 subject_NDAR_INV10OMKVLE 1 1 1 1 1 1 1 1 1  1  5  1  1  1  1  1  1  1  1  3
#> 5 subject_NDAR_INV15FPCW4O 1 2 2 2 2 2 2 1 2  1  2  2  2  2  2  2  2  2  2  2
#> 6 subject_NDAR_INV19NB4RJK 1 2 1 2 2 2 1 3 2  4  4  1  2  2  2  2  1  2  2  3

Note: Parallel processing is available on an older release of the package (v0.2.0) and will be integrated into the latest release shortly. See the “processes” parameter in ?batch_snf.

Picking a solution

Now that we have access to 20 different clustering solutions, we’ll need to find some way to pick a favourite (or a few). In this case, plotting or running stats manually on each of the solutions might be a reasonable way to determine which ones we like the most. But when the number of solutions generated goes up into the hundreds (or thousands), we’re going to need some more automated approaches.

Below are some different tools that you can use to try to pick a solution that is the best for your purposes.

1: Examining “meta clusters”

This is the approach introduced by the original meta clustering paper. This is a good approach to use when you can’t quantitatively describe what makes one cluster solution better than another, but you can have an expert compare the two solutions and “intuit” which of the two is more desirable.

The idea is to cluster the clustering solutions themselves to arrive at a small number of qualitatively different solutions. From there, a user can manually pick out some representative solutions and do the evaluations themselves.

The first step is to calculate the adjusted Rand index (ARI) between each pair of cluster solutions. This metric is tells us how similar the solutions are to each other, thereby allowing us to find clusters of cluster solutions.

solutions_matrix_aris <- calc_om_aris(solutions_matrix)
#> [1] "Please wait - this may take a minute."
#> 
52.63158% completed...

We can visualize the resulting inter-cluster similarities with a heatmap:

adjusted_rand_index_heatmap(solutions_matrix_aris)

You can optionally save your heatmap by specifying a path with the save parameter (e.g., adjusted_rand_index_heatmap(solutions_matrix_aris, save = "./adjusted_rand_index_heatmap.png")).

The clustering solutions are along the rows and columns of the above figure, and the cells at the intersection between two solutions show how similar (big ARI) those solutions are to each other. The diagonals should always be red, representing the maximum value of 1, as they show the similarity between any clustering solution and itself. Agglomerative hierarchical clustering is being applied to these solutions by default (thank you, pheatmap package) so the orders of the clustering solutions here do not exactly line up with the order of the clustering solutions present in the settings matrix.

If you see something interesting in your heatmap, you may be curious to know how that corresponds to the settings that were in your settings matrix.

First, extract the ordering of the rows obtained from the clustering of the cluster solutions:

meta_cluster_order <- get_heatmap_order(solutions_matrix_aris)

The order is just a vector of numbers…

meta_cluster_order
#>  [1]  1 10 11 20  8 18  2  6  9 13  4 14  5 19  7 12 17 16  3 15

… which can be passed into the settings_matrix_heatmap function:

settings_matrix_heatmap(settings_matrix, order = meta_cluster_order)

This heatmap rescales all the columns in the settings_matrix to have a maximum value of 1. The purpose of re-ordering the settings matrix in this way is to see if any associations exist between certain settings values and pairwise cluster solution similarities.

Maybe you’ll see something interesting!

2. Quality measures

Quality metrics are another useful heuristic for the goodness of a cluster that don’t require any contextualization of results in the domain they may be used in. r-package enables measures of silhouette scores, Dunn indices, and Davies-Bouldin indices. To calculate these values, we’ll need not only the cluster results but also the final fused network (the similarity matrices produced by SNF) that the clusters came from. These affinity matrices can be collected from the batch_snf using the return_affinity_matrices parameter:

batch_snf_results <- batch_snf(
    data_list,
    settings_matrix,
    return_affinity_matrices = TRUE
)
#> [1] "Row: 1/20 | Time remaining: 2 seconds"
#> [1] "Row: 2/20 | Time remaining: 3 seconds"
#> [1] "Row: 3/20 | Time remaining: 2 seconds"
#> [1] "Row: 4/20 | Time remaining: 2 seconds"
#> [1] "Row: 5/20 | Time remaining: 2 seconds"
#> [1] "Row: 6/20 | Time remaining: 2 seconds"
#> [1] "Row: 7/20 | Time remaining: 2 seconds"
#> [1] "Row: 8/20 | Time remaining: 2 seconds"
#> [1] "Row: 9/20 | Time remaining: 2 seconds"
#> [1] "Row: 10/20 | Time remaining: 2 seconds"
#> [1] "Row: 11/20 | Time remaining: 2 seconds"
#> [1] "Row: 12/20 | Time remaining: 1 seconds"
#> [1] "Row: 13/20 | Time remaining: 1 seconds"
#> [1] "Row: 14/20 | Time remaining: 1 seconds"
#> [1] "Row: 15/20 | Time remaining: 1 seconds"
#> [1] "Row: 16/20 | Time remaining: 1 seconds"
#> [1] "Row: 17/20 | Time remaining: 1 seconds"
#> [1] "Row: 18/20 | Time remaining: 0 seconds"
#> [1] "Row: 19/20 | Time remaining: 0 seconds"
#> [1] "Row: 20/20 | Time remaining: 0 seconds"
#> [1] "Total time taken: 4 seconds."

solutions_matrix <- batch_snf_results$"solutions_matrix"
affinity_matrices <- batch_snf_results$"affinity_matrices"

This time, the output of batch_snf is a list. The first element of the list is a single solutions_matrix, like what we usually get. The second element is yet another list containing one final fused network (AKA affinity matrix / similarity matrix) per SNF run. Using those two lists, we can calculate the above mentioned quality metrics:

silhouette_scores <- calculate_silhouettes(solutions_matrix, affinity_matrices)

dunn_indices <- calculate_dunn_indices(solutions_matrix, affinity_matrices)

db_indices <- calculate_db_indices(solutions_matrix, affinity_matrices)

The first function is a wrapper around cluster::silhouette while the second and third come from the clv package. clv isn’t set as a mandatory part of the installation, so you’ll ned to install it yourself to calculate these two metrics.

The original documentation on these functions can be helpful for interpreting and working with them:

3. Stability measures

r-package offers tools to evaluate two different measures of stability:

Pairwise adjusted Rand indices (across resamplings of the clustering, on average, how similar was every pair of solutions according to the adjusted Rand index?)
Fraction clustered together (what is the average fraction of times that patients who clustered together in the full results clustered together in resampled results?)

To calculate either of these, you’ll need to first generate subsamples of the data_list.

data_list_subsamples <- subsample_data_list(
    data_list,
    n_subsamples = 30, # calculate 30 subsamples
    subsample_fraction = 0.8 # for each subsample, use random 80% of patients
)

data_list_subsamples is a list that now contains 30 smaller subsamples of the original data_list.

Then the stability calculations:

pairwise_aris <- subsample_pairwise_aris(
    data_list_subsamples,
    settings_matrix
)

fraction_together <- fraction_clustered_together(
    data_list_subsamples,
    settings_matrix,
    solutions_matrix
)

Be warned, that second function is especially extremely slow. As the number of patients and number of solutions you’re evaluating grows, these functions can get pretty slow. Consider only using them after eliminating solutions that you are certainly not interested in further characterizing.

4. Evaluating separation across “target variables” of importance

Warning: This approach can very easily result in overfitting your data and producing clustering results that generalize poorly to subjects outside of your dataset. Consider setting aside some data to validate your results to avoid this issue.

If you can specify a metric or objective function that may tell you how useful a clustering solution will be for your purposes in advance, that makes the cluster selection process much less arbitrary.

There are many ways to go about doing this, but this package offers one way through the target_list structure. The target_list contains dataframes what we can examine our clustering results over through linear regression (continuous data), ordinal regression (ordinal data), or the Chi-squared test (categorical data).

target_list <- generate_target_list(
    list(abcd_anxiety, "anxiety", "ordinal"),
    list(abcd_depress, "depressed", "ordinal"),
    uid = "patient"
)
#> [1] "UID successfully converted to subjectkey."

summarize_target_list(target_list)
#>        name    type length width
#> 1   anxiety ordinal    275     2
#> 2 depressed ordinal    275     2

The target_list is like the data_list, but without the domain attribute. At this time, each dataframe used to build an target_list must be a single-feature.

Just like when generating the initial data_list, we need to specify the name of the column in the provided dataframes that is originally being used to uniquely identify the different observations from each other with the uid parameter.

We will next extend our solutions_matrix with p-values from regressing the target_list features onto our generated clusters.

extended_solutions_matrix <- extend_solutions(solutions_matrix, target_list)

colnames(extended_solutions_matrix)[1:25]
#>  [1] "row_id"                    "alpha"                    
#>  [3] "k"                         "t"                        
#>  [5] "snf_scheme"                "clust_alg"                
#>  [7] "cont_dist"                 "disc_dist"                
#>  [9] "ord_dist"                  "cat_dist"                 
#> [11] "mix_dist"                  "input_wt"                 
#> [13] "domain_wt"                 "inc_cortical_thickness"   
#> [15] "inc_cortical_surface_area" "inc_subcortical_volume"   
#> [17] "inc_household_income"      "inc_pubertal_status"      
#> [19] "nclust"                    "subject_NDAR_INV0567T2Y9" 
#> [21] "subject_NDAR_INV0IZ157F8"  "subject_NDAR_INV0J4PYA5F" 
#> [23] "subject_NDAR_INV10OMKVLE"  "subject_NDAR_INV15FPCW4O" 
#> [25] "subject_NDAR_INV19NB4RJK"

# Looking at the newly added columns
head(no_subs(extended_solutions_matrix))
#>   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
#> 1      1   0.5 29 20          2         1         1         1        1        1
#> 2      2   0.4 26 20          1         1         1         1        1        1
#> 3      3   0.3 44 20          2         2         1         1        1        1
#> 4      4   0.3 43 20          1         1         1         1        1        1
#> 5      5   0.5 29 20          2         2         1         1        1        1
#> 6      6   0.4 26 20          2         1         1         1        1        1
#>   mix_dist input_wt domain_wt inc_cortical_thickness inc_cortical_surface_area
#> 1        1        1         1                      1                         0
#> 2        1        1         1                      1                         1
#> 3        1        1         1                      1                         0
#> 4        1        1         1                      1                         1
#> 5        1        1         1                      1                         1
#> 6        1        1         1                      1                         1
#>   inc_subcortical_volume inc_household_income inc_pubertal_status nclust
#> 1                      1                    0                   1      2
#> 2                      1                    1                   1      2
#> 3                      0                    1                   1      2
#> 4                      0                    1                   1      2
#> 5                      1                    1                   1      2
#> 6                      1                    1                   1      2
#>    cbcl_anxiety_r_p   cbcl_depress_r_p  min_p_val mean_p_val
#> 1 0.758497782121108  0.151784599479758 0.15178460  0.4551412
#> 2 0.436382083446635 0.0851033924288291 0.08510339  0.2607427
#> 3 0.319771547840626  0.687379490468759 0.31977155  0.5035755
#> 4 0.672455644238922  0.863299444565211 0.67245564  0.7678775
#> 5 0.244155078226786  0.113101620093214 0.11310162  0.1786283
#> 6 0.436382083446635 0.0851033924288291 0.08510339  0.2607427

If you just want the p-values:

p_val_select(extended_solutions_matrix)
#>    row_id cbcl_anxiety_r_p cbcl_depress_r_p
#> 1       1       0.75849778       0.15178460
#> 2       2       0.43638208       0.08510339
#> 3       3       0.31977155       0.68737949
#> 4       4       0.67245564       0.86329944
#> 5       5       0.24415508       0.11310162
#> 6       6       0.43638208       0.08510339
#> 7       7       0.19942831       0.67036268
#> 8       8       0.65534060       0.52733675
#> 9       9       0.62019372       0.06557707
#> 10     10       0.18683694       0.02026506
#> 11     11       0.09935434       0.18142564
#> 12     12       0.77200235       0.18269804
#> 13     13       0.60082272       0.10298457
#> 14     14       0.24415508       0.21488301
#> 15     15       0.93251462       0.51911329
#> 16     16       0.28223304       0.91903755
#> 17     17       0.42077342       0.39268244
#> 18     18       0.10214094       0.82503782
#> 19     19       0.58761917       0.13269839
#> 20     20       0.12232512       0.45334136

There is a heatmap for visualizing this too:

target_pvals <- p_val_select(extended_solutions_matrix)

pvals_heatmap(target_pvals, order = meta_cluster_order)

These p-values hold no real meaning for the traditional hypothesis-testing context, but they are reasonable proxies of the magnitude of the effect size / separation of the clusters across the variables in question. Here, they are just a tool to find clustering solutions that are well-separated according to the outcome measures you’ve specified. Finding a cluster solution like this is similar to a supervised learning approach, but where the optimization method is just random sampling. The risk for overfitting your data with this approach is considerable, so make sure you have some rigorous external validation before reporting your findings.

We recommend using label propagation (provided by the SNFtool package in the groupPredict function) for validation: take the top clustering solutions found in some training data, assign predicted clusters to some held out test subjects, and then characterize those test subjects to see how well the clustering solution seemed to have worked.

Validating results with label propagation

Here’s a quick step through of the complete procedure, from the beginning, with label propagation to validate our findings.

The metasnf package comes equipped with a function to do the training/testing split for you :)

# All the subjects present in all dataframes with no NAs
all_subjects <- data_list[[1]]$"data"$"subjectkey"

# Remove the "subject_" prefix to allow merges with the original data
all_subjects <- gsub("subject_", "", all_subjects)

# Dataframe assigning 80% of subjects to train and 20% to test
assigned_splits <- train_test_assign(train_frac = 0.8, subjects = all_subjects)

# Partition a training set
train_abcd_cort_t <- keep_split(abcd_cort_t, assigned_splits, "train", uid = "patient")
train_abcd_cort_sa <- keep_split(abcd_cort_sa, assigned_splits, "train", uid = "patient")
train_abcd_subc_v <- keep_split(abcd_subc_v, assigned_splits, "train", uid = "patient")
train_abcd_income <- keep_split(abcd_income, assigned_splits, "train", uid = "patient")
train_abcd_pubertal <- keep_split(abcd_pubertal, assigned_splits, "train", uid = "patient")
train_abcd_anxiety <- keep_split(abcd_anxiety, assigned_splits, "train", uid = "patient")
train_abcd_depress <- keep_split(abcd_depress, assigned_splits, "train", uid = "patient")

# Partition a test set
test_abcd_cort_t <- keep_split(abcd_cort_t, assigned_splits, "test", uid = "patient")
test_abcd_cort_sa <- keep_split(abcd_cort_sa, assigned_splits, "test", uid = "patient")
test_abcd_subc_v <- keep_split(abcd_subc_v, assigned_splits, "test", uid = "patient")
test_abcd_income <- keep_split(abcd_income, assigned_splits, "test", uid = "patient")
test_abcd_pubertal <- keep_split(abcd_pubertal, assigned_splits, "test", uid = "patient")
test_abcd_anxiety <- keep_split(abcd_anxiety, assigned_splits, "test", uid = "patient")
test_abcd_depress <- keep_split(abcd_depress, assigned_splits, "test", uid = "patient")

# Construct the data lists
train_data_list <- generate_data_list(
    list(train_abcd_cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(train_abcd_cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(train_abcd_subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(train_abcd_income, "household_income", "demographics", "continuous"),
    list(train_abcd_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "patient"
)
#> [1] "UID successfully converted to subjectkey."

test_data_list <- generate_data_list(
    list(test_abcd_cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(test_abcd_cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(test_abcd_subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(test_abcd_income, "household_income", "demographics", "continuous"),
    list(test_abcd_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "patient"
)
#> [1] "UID successfully converted to subjectkey."

Note, in a label propagation workflow, maintaining a consistent ordering of train and test subjects across all dataframes is particularly important. To ensure the full data list is constructed with an order that is compatible with the training and testing splits made above, you can either specify both the train_subjects and test_subjects parameters yourself, or provide the assigned_splits output from the train_test_assign function.

full_data_list <- generate_data_list(
    list(abcd_cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(abcd_cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(abcd_subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(abcd_income, "household_income", "demographics", "continuous"),
    list(abcd_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "patient",
    assigned_splits = assigned_splits
)
#> [1] "UID successfully converted to subjectkey."

# Construct the target lists
train_target_list <- generate_target_list(
    list(train_abcd_anxiety, "anxiety", "ordinal"),
    list(train_abcd_depress, "depressed", "ordinal"),
    uid = "patient"
)
#> [1] "UID successfully converted to subjectkey."

# Find a clustering solution in your training data
settings_matrix <- generate_settings_matrix(
    train_data_list,
    nrow = 5,
    seed = 42,
    min_k = 10,
    max_k = 30
)
#> [1] "The global seed has been changed!"

min(summarize_dl(train_data_list)$"length")
#> [1] 83

train_solutions_matrix <- batch_snf(
    train_data_list,
    settings_matrix
)
#> [1] "Row: 1/5 | Time remaining: 0 seconds"
#> [1] "Row: 2/5 | Time remaining: 0 seconds"
#> [1] "Row: 3/5 | Time remaining: 0 seconds"
#> [1] "Row: 4/5 | Time remaining: 0 seconds"
#> [1] "Row: 5/5 | Time remaining: 0 seconds"
#> [1] "Total time taken: 1 seconds."

extended_solutions_matrix <- extend_solutions(
    train_solutions_matrix,
    train_target_list
)

# The fourth row had the lowest minimum p-value across our outcomes
which(extended_solutions_matrix$"min_p_val" ==
      min(extended_solutions_matrix$"min_p_val"))
#> [1] 4

# Keep track of your top solution
top_row <- extended_solutions_matrix[4, ]

# Hand over the solutions matrix as well as the full data list to propagate labels
#  to the test subjects
propagated_labels <- lp_row(top_row, full_data_list)
#> [1] "If you add a 'significance' column to your solutions matrix those values will be used to name each solution (instead of row IDs)"
#> [1] "Processing row 1 of 1..."
head(propagated_labels)
#>                 subjectkey group 4
#> 1 subject_NDAR_INV0567T2Y9 train 2
#> 2 subject_NDAR_INV0IZ157F8 train 1
#> 3 subject_NDAR_INV0J4PYA5F train 2
#> 4 subject_NDAR_INV10OMKVLE train 2
#> 5 subject_NDAR_INV15FPCW4O train 2
#> 6 subject_NDAR_INV19NB4RJK train 2
tail(propagated_labels)
#>                   subjectkey group 4
#> 95  subject_NDAR_INVGDBYXWV4  test 2
#> 96  subject_NDAR_INVHEUWA52I  test 2
#> 97  subject_NDAR_INVK9ULDQA2  test 2
#> 98  subject_NDAR_INVKYH529RD  test 2
#> 99  subject_NDAR_INVL045Z1TY  test 2
#> 100 subject_NDAR_INVLDQH8ATK  test 2

You could, if you wanted, see how all of your clustering solutions propagate to the test set, but that would mean reusing your test set and removing the protection against overfitting conferred by this procedure.

propagated_labels_all <- lp_row(extended_solutions_matrix, full_data_list)
#> [1] "If you add a 'significance' column to your solutions matrix those values will be used to name each solution (instead of row IDs)"
#> [1] "Processing row 1 of 5..."
#> [1] "Processing row 2 of 5..."
#> [1] "Processing row 3 of 5..."
#> [1] "Processing row 4 of 5..."
#> [1] "Processing row 5 of 5..."
head(propagated_labels_all)
#>                 subjectkey group 1 2 3 4 5
#> 1 subject_NDAR_INV0567T2Y9 train 1 2 2 2 2
#> 2 subject_NDAR_INV0IZ157F8 train 2 1 4 1 1
#> 3 subject_NDAR_INV0J4PYA5F train 1 2 3 2 2
#> 4 subject_NDAR_INV10OMKVLE train 1 1 1 2 2
#> 5 subject_NDAR_INV15FPCW4O train 1 2 1 2 2
#> 6 subject_NDAR_INV19NB4RJK train 1 2 4 2 2
tail(propagated_labels_all)
#>                   subjectkey group 1 2 3 4 5
#> 95  subject_NDAR_INVGDBYXWV4  test 1 2 1 2 2
#> 96  subject_NDAR_INVHEUWA52I  test 1 2 2 2 2
#> 97  subject_NDAR_INVK9ULDQA2  test 1 2 4 2 2
#> 98  subject_NDAR_INVKYH529RD  test 1 2 3 2 2
#> 99  subject_NDAR_INVL045Z1TY  test 1 2 1 2 2
#> 100 subject_NDAR_INVLDQH8ATK  test 1 2 1 2 2

That’s all for now!

If you have any questions, comments, suggestions, bugs, etc. feel free to post an issue at https://github.com/BRANCHlab/metasnf.

Appendix

(sections below will move into separate vignettes)

inc*

This section describes how the generate_settings_matrix function randomly assigns input dataframess to be dropped from different SNF runs to access a wider space of possible solutions.

By default, generate_settings_matrix will build an empty dataframe containing the columns outlined in the settings matrix section. If a non-zero number of rows are specified when calling generate_settings_matrix, the function calls on add_settings_matrix_rows to add rows with randomly valid values. The input dataframe inclusion variables rely on yet another helper function:

random_removal
#> function (columns, min_removed_inputs, max_removed_inputs, dropout_dist = "exponential") 
#> {
#>     inclusion_columns <- columns[startsWith(columns, "inc")]
#>     num_cols <- length(inclusion_columns)
#>     if (dropout_dist == "none") {
#>         inclusions_df <- data.frame(t(rep(1, num_cols)))
#>         colnames(inclusions_df) <- inclusion_columns
#>         rownames(inclusions_df) <- NULL
#>         return(inclusions_df)
#>     }
#>     if (is.null(min_removed_inputs)) {
#>         min_removed_inputs <- 0
#>     }
#>     if (is.null(max_removed_inputs)) {
#>         max_removed_inputs <- num_cols - 1
#>     }
#>     if (max_removed_inputs >= num_cols || min_removed_inputs < 
#>         0) {
#>         stop(paste0("The number of removed elements must be between 0 and the", 
#>             " total number of elements in the data_list (", num_cols, 
#>             ")."))
#>     }
#>     if (dropout_dist == "uniform") {
#>         possible_number_removed <- seq(min_removed_inputs, max_removed_inputs, 
#>             by = 1)
#>         num_removed <- resample(possible_number_removed, 1)
#>     }
#>     if (dropout_dist == "exponential") {
#>         rand_vals <- stats::rexp(10000)
#>         rand_vals <- rand_vals/max(rand_vals)
#>         difference <- max_removed_inputs - min_removed_inputs
#>         rand_vals <- rand_vals * difference
#>         rand_vals <- rand_vals + min_removed_inputs
#>         rand_vals <- round(rand_vals)
#>         num_removed <- sample(rand_vals, 1)
#>     }
#>     remove_placeholders <- rep(0, num_removed)
#>     keep_placeholders <- rep(1, num_cols - num_removed)
#>     unshuffled_removals <- c(remove_placeholders, keep_placeholders)
#>     shuffled_removals <- sample(unshuffled_removals)
#>     inclusions_df <- t(data.frame(shuffled_removals))
#>     colnames(inclusions_df) <- inclusion_columns
#>     rownames(inclusions_df) <- NULL
#>     return(inclusions_df)
#> }
#> <bytecode: 0x55eb6b9a3ed8>
#> <environment: namespace:metasnf>

This random_removal function randomly picks a number of columns to remove according to an exponential function. The most likely number of input dataframes to be removed is 0, followed by 1, all the way up to \(D - 1\) dataframes where \(D\) is the number of provided input dataframes. The exponential distribution seemed preferrable to a uniform one, which would lead to a large number of input dataframes being dropped on every SNF run.

snf_scheme

This section describes how individual input dataframes are converted into a final fused network in different ways using the snf_scheme parameter.

snf_scheme: One consideration of SNF is choosing how to organize individual features (columns) into the input dataframes that get fused by SNF.

In the original SNF paper, input dataframes were organized according to “measurement type”: three input dataframes each containing several features related to miRNA, mRNA, or DNA methylation. In my own work, I had a very large set of input dataframes that had varying amounts of overlap in expected information content, which made the grouping of features a little bit less clear. For example, consider having input dataframes for (1) the thicknesses of cortical brain regions, (2) the surface areas of cortical brain regions, (3) the volumes of subcortical brain regions, and (4) mean blood pressure. We have 4 distinct sets of features, 3 distinct sets of units (cortical thickness and subcortical volume both being in \(mm^3\)), and (arguably) 2 distinct sources of information: structural brain data and blood pressure data.

It is not immediately clear that collapsing the data into any one of those mentioned sets will give rise to the most useful clustering solution for our purpose. The number of ways in which we can take these individual features and produce a single SNF-fused similarity matrix at the end is quite large.

Here are two examples of getting a single similarity matrix from two single-feature input dataframes:

Example 1: Concatenation -> single similarity matrix

df1 <- data.frame(
    subjectkey = c("person1", "person2", "person3", "person4"),
    feature1 = c(1, 2, 3, 4)
)

df2 <- data.frame(
    subjectkey = c("person1", "person2", "person3", "person4"),
    feature2 = c(3, 3, 4, 5)
)

Two individual dataframes, prior to concatenation

subjectkey	feature1
person1	1
person2	2
person3	3
person4	4

subjectkey	feature2
person1	3
person2	3
person3	4
person4	5

We start by joining our two features by concatenation:

df3 <- dplyr::inner_join(df1, df2, by = "subjectkey")

A single concatenated dataframe
subjectkey	feature1	feature2
person1	1	3
person2	2	3
person3	3	4
person4	4	5

Giving rise to the following distance matrix:

df3_dist <- df3[, 2:3] |>
    stats::dist() |>
    as.matrix() |>
    round(2)

\[\begin{bmatrix} 0&1&2.24&3.61 \\ 1&0&1.41&2.83 \\ 2.24&1.41&0&1.41 \\ 3.61&2.83&1.41&0 \\ \end{bmatrix}\]

And subsequent similarity matrix:

df3_sim <- SNFtool::affinityMatrix(df3_dist, K = 3, sigma = 0.5) |>
    round(2)

\[\begin{bmatrix} 0.52&0.23&0.04&0.01 \\ 0.23&0.69&0.11&0.02 \\ 0.04&0.11&0.71&0.14 \\ 0.01&0.02&0.14&0.46 \\ \end{bmatrix}\]

\(~\)

Example 2: Similarity matrices -> integration by SNF

df1 <- data.frame(
    subjectkey = c("person1", "person2", "person3", "person4"),
    feature1 = c(1, 2, 3, 4)
)

df2 <- data.frame(
    subjectkey = c("person1", "person2", "person3", "person4"),
    feature2 = c(3, 3, 4, 5)
)

We first convert each of our input dataframes into distance matrices:

df1_dist <- df1[, 2] |>
    stats::dist() |>
    as.matrix() |>
    round(2)

df2_dist <- df2[, 2] |>
    stats::dist() |>
    as.matrix() |>
    round(2)

\[\begin{bmatrix} 0&1&2&3 \\ 1&0&1&2 \\ 2&1&0&1 \\ 3&2&1&0 \\ \end{bmatrix}\] \[\begin{bmatrix} 0&0&1&2 \\ 0&0&1&2 \\ 1&1&0&1 \\ 2&2&1&0 \\ \end{bmatrix}\]

We then convert these distance matrices into similarity matrices:

df1_sim <- SNFtool::affinityMatrix(df1_dist, K = 3, sigma = 0.5) |>
    round(2)

df2_sim <- SNFtool::affinityMatrix(df2_dist, K = 3, sigma = 0.5) |>
    round(2)

\[\begin{bmatrix} 0.6&0.21&0.04&0.01 \\ 0.21&0.9&0.17&0.04 \\ 0.04&0.17&0.9&0.21 \\ 0.01&0.04&0.21&0.6 \\ \end{bmatrix}\] \[\begin{bmatrix} 1.2&1.2&0.11&0.02 \\ 1.2&1.2&0.11&0.02 \\ 0.11&0.11&1.2&0.17 \\ 0.02&0.02&0.17&0.72 \\ \end{bmatrix}\]

And then we fuse these similarity matrices together using SNF:

df1_df2_fused <- SNFtool::SNF(list(df1_sim, df2_sim), K = 3) |>
    round(2)

\[\begin{bmatrix} 0.5&0.24&0.13&0.12 \\ 0.24&0.5&0.14&0.13 \\ 0.13&0.14&0.5&0.24 \\ 0.12&0.13&0.24&0.5 \\ \end{bmatrix}\]

Contrast this with the similarity matrix we obtained in Example 1:

\[\begin{bmatrix} 0.52&0.23&0.04&0.01 \\ 0.23&0.69&0.11&0.02 \\ 0.04&0.11&0.71&0.14 \\ 0.01&0.02&0.14&0.46 \\ \end{bmatrix}\]

I’m not entirely sure about what is happening with the diagonal values not being identical in the second matrix, but that aside, the results here seem mostly similar. Person 1 is most similar to person 2, person 2 is most similar to person 1, person 3 is most similar to person 4, and person 4 is most similar to person 3.

But

The values aren’t identical, and you could imagine that for a more complicated example or a slightly different choice in feature collapsing, the resulting clustering solution would be different.

Going back to the example where our input dataframes are:

cortical thicknesses
cortical surface areas
subcortical volumes
mean blood pressure

You could imagine that individually converting each of these 4 input dataframes into 4 separate distance matrices, converting those into similarity matrices, and then fusing those 4 matrices into one single network by SNF could be a little bit skewed in favour of the brain imaging information. 3 of our sources are all describing structural brain features, while 1 source is about the blood.

Varying the snf_scheme parameter gives us access to a wider range of possible clustering solutions by changing which of the following approaches are taken to digest the initial input dataframes to a final fused network:

Individual: Input dataframes are combined into a final network by SNF.
Two-step: Input dataframes are combined within user-specified data domains by one round of SNF and then combined across domains by a second round of SNF.
Domain: Input dataframes are combined within domains by concatenation and then combined into a single fused network by SNF.

These options are not comprehensive. If there are other schemes you would like to be added into the package, feel free to make a request on GitHub.

References

Caruana, Rich, Mohamed Elhawary, Nam Nguyen, and Casey Smith. 2006. “Meta Clustering.” In Sixth International Conference on Data Mining (ICDM’06), 107–18. https://doi.org/10.1109/ICDM.2006.103.

Wang, Bo, Aziz M. Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Benjamin Haibe-Kains, and Anna Goldenberg. 2014. “Similarity Network Fusion for Aggregating Data Types on a Genomic Scale.” Nature Methods 11 (3): 333–37. https://doi.org/10.1038/nmeth.2810.