The Settings Matrix • metasnf

Download a copy of the vignette to follow along here: settings_matrix.Rmd

This vignette outlines the main functionality of the generate_settings_matrix function.

The basic settings_matrix

The most minimal settings_matrix can be obtained by providing a data_list object.

library(metasnf)

# It's best to list out the individual elements with names, i.e. data = ...,
#  name = ..., domain = ..., type = ..., but we'll skip that here for brevity.
data_list <- generate_data_list(
    list(cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

## Warning in generate_data_list(list(cort_t, "cortical_thickness",
## "neuroimaging", : 175 subject(s) dropped due to incomplete data.

settings_matrix <- generate_settings_matrix(
    data_list
)

head(settings_matrix)

##  [1] row_id                    alpha                    
##  [3] k                         t                        
##  [5] snf_scheme                clust_alg                
##  [7] cont_dist                 disc_dist                
##  [9] ord_dist                  cat_dist                 
## [11] mix_dist                  inc_cortical_thickness   
## [13] inc_cortical_surface_area inc_subcortical_volume   
## [15] inc_household_income      inc_pubertal_status      
## <0 rows> (or 0-length row.names)

The resulting columns are:

row_id: A label to keep track of which row is which
alpha: The alpha (also referred to as sigma or eta) hyperparameter in SNF
k: The K (nearest neighbours) hyperparameter in similarity matrix calculations and SNF
t: The T (number of iterations) hyperparameter used in SNF
snf_scheme: Which SNF “scheme” is being used to convert the initial provided dataframes into a final fused network (more on this in the appendix of the “Less Simple Example” vignette)
clust_alg: Which clustering algorithm will be applied to the final fused network. By default, this varies between the pre-provided options of (1) spectral clustering with the number of clusters determined by the eigen-gap heuristic and (2) the same thing but using the rotation cost heuristic. You can learn more about using this parameter in the clustering algorithnms vignette.
Columns ending in dist: Which distance metric is being used for the various types of features (more on this in the distance metrics vignette)
Columns starting with inc: Whether or not the corresponding dataframe will be included (1) or excluded (0) from this row

By varying the values in these columns, we can define distinct SNF pipelines that should give rise to a broader space of cluster solutions. The following sections outline how to use generate_settings_matrix to build a wide range of settings that will hopefully help you find a subtyping solution useful for your purposes.

Adding random rows

When not specifying any parameters beyond the number of rows that are created, the function will randomly (but sensibly) vary the values in the matrix.

# Through minimums and maximums
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 100,
)

head(settings_matrix)

##   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1      1   0.6 39 20          1         2         1         1        1        1
## 2      2   0.8 75 20          2         2         1         1        1        1
## 3      3   0.4 51 20          2         2         1         1        1        1
## 4      4   0.6 77 20          1         2         1         1        1        1
## 5      5   0.6 25 20          3         2         1         1        1        1
## 6      6   0.6 27 20          2         2         1         1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      1                         1
## 2        1                      1                         1
## 3        1                      1                         0
## 4        1                      1                         1
## 5        1                      0                         1
## 6        1                      1                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      1                    1                   1
## 2                      1                    1                   1
## 3                      1                    1                   1
## 4                      1                    0                   1
## 5                      1                    1                   1
## 6                      1                    1                   1

The alpha and k hyperparameters are varied from 0.3 to 0.8 and 10 to 100 respectively based on suggestion from the authors of SNF.

The t hyperparameter, which controls how many iterations of updates occur to the fused network during SNF, stays fixed at 20, by default. This value (20) has been empirically demonstrated to be sufficient for achieving convergence of the matrix, and varying it doesn’t seem to have much relevance to what kinds of cluster solutions are produced.

The snf_scheme column will vary from 1 to 3, which outlines the 3 differente schemes that are available.

The clust_alg column will vary randomly between (1) spectral clustering using the eigen-gap heuristic and (2) spectral clustering using the rotation cost heuristic by default.

The distance columns will always be 1 by default, as they will just use the default distance metrics of simple Euclidean for anything numeric and Gower’s distance for anything mixed or categorical.

Controlling the scheme, the clustering algorithms, and the distance metrics are discussed in more details in separate vignettes linked to above. Controlling the remaining options are shown below.

Alpha, k, and t

You can control any of these parameters either by providing a vector of values you’d like to randomly sample from or by specifying a minimum and maximum range.

# Through minimums and maximums
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 100,
    min_k = 10,
    max_k = 60,
    min_alpha = 0.3,
    max_alpha = 0.8,
    min_t = 15,
    max_t = 30
)

## Warning in add_settings_matrix_rows(settings_matrix = settings_matrix_base, :
## The original SNF paper recommends a t between 10 to 20. Empirically, setting t
## above 20 is always sufficient for SNF to converge. This warning is raised
## anytime a user tries to set a t value smaller than 10 or larger than 20.

# Through specific value sampling
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    k_values = c(10, 25, 50),
    alpha_values = c(0.4, 0.8),
    t_values = c(20, 30)
)

## Warning in add_settings_matrix_rows(settings_matrix = settings_matrix_base, :
## The original SNF paper recommends a t between 10 to 20. Empirically, setting t
## above 20 is always sufficient for SNF to converge. This warning is raised
## anytime a user tries to set a t value smaller than 10 or larger than 20.

Inclusion columns

Bounds on the number of input dataframes removed as well as the way in which the number removed is chosen can be controlled.

By default, generate_settings_matrix will pick a random value between 0 and 1 less than the total number of available dataframes based on an exponential probability distribution. The exponential distribution makes it so that it is very likely that a small number of dataframes will be dropped and much less likely that a large number of dataframes will be dropped.

You can control the distribution by changing the dropout_dist value to “uniform” (which will result in a much higher number of dataframes being dropped on average) or “none” (which will result in no dataframes being dropped).

# Exponential dropping
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    dropout_dist = "exponential" # the default behaviour
)

head(settings_matrix)

##   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1      1   0.4 63 20          2         2         1         1        1        1
## 2      2   0.3 70 20          1         1         1         1        1        1
## 3      3   0.5 96 20          2         1         1         1        1        1
## 4      4   0.6 21 20          2         2         1         1        1        1
## 5      5   0.7 61 20          2         1         1         1        1        1
## 6      6   0.7 95 20          1         1         1         1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      1                         1
## 2        1                      1                         1
## 3        1                      1                         1
## 4        1                      1                         1
## 5        1                      1                         1
## 6        1                      1                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      1                    1                   1
## 2                      1                    1                   1
## 3                      1                    1                   1
## 4                      1                    0                   1
## 5                      1                    0                   1
## 6                      1                    1                   1

# Uniform dropping
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    dropout_dist = "uniform"
)

head(settings_matrix)

##   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1      1   0.3 69 20          2         1         1         1        1        1
## 2      2   0.5 87 20          2         2         1         1        1        1
## 3      3   0.5 10 20          2         1         1         1        1        1
## 4      4   0.8 80 20          2         1         1         1        1        1
## 5      5   0.5 40 20          2         1         1         1        1        1
## 6      6   0.7 70 20          1         1         1         1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      1                         1
## 2        1                      0                         1
## 3        1                      0                         0
## 4        1                      1                         1
## 5        1                      1                         1
## 6        1                      1                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      1                    0                   1
## 2                      0                    1                   0
## 3                      1                    0                   0
## 4                      1                    1                   1
## 5                      1                    1                   0
## 6                      0                    1                   1

# No dropping
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    dropout_dist = "none"
)

head(settings_matrix)

##   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1      1   0.3 83 20          3         2         1         1        1        1
## 2      2   0.7 45 20          1         2         1         1        1        1
## 3      3   0.5 30 20          1         1         1         1        1        1
## 4      4   0.8 19 20          3         1         1         1        1        1
## 5      5   0.6 64 20          2         2         1         1        1        1
## 6      6   0.8 76 20          3         2         1         1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      1                         1
## 2        1                      1                         1
## 3        1                      1                         1
## 4        1                      1                         1
## 5        1                      1                         1
## 6        1                      1                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      1                    1                   1
## 2                      1                    1                   1
## 3                      1                    1                   1
## 4                      1                    1                   1
## 5                      1                    1                   1
## 6                      1                    1                   1

The bounds on the number of dataframes that can be dropped can be controlled using the min_removed_inputs and max_removed_inputs:

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    min_removed_inputs = 3
)

# No row will exclude fewer than 3 dataframes during SNF
head(settings_matrix)

##   row_id alpha  k  t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
## 1      1   0.5 94 20          2         2         1         1        1        1
## 2      2   0.6 22 20          1         2         1         1        1        1
## 3      3   0.7 38 20          3         2         1         1        1        1
## 4      4   0.7 69 20          1         2         1         1        1        1
## 5      5   0.4 73 20          1         1         1         1        1        1
## 6      6   0.6 39 20          1         2         1         1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      0                         0
## 2        1                      1                         0
## 3        1                      0                         0
## 4        1                      0                         1
## 5        1                      0                         1
## 6        1                      0                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      0                    1                   1
## 2                      0                    0                   1
## 3                      1                    1                   0
## 4                      1                    0                   0
## 5                      0                    0                   1
## 6                      0                    0                   1

Grid searching

If you are interested in grid searching over perhaps just a specific set of alpha and k values, you may want to consider varying those parameters and keeping everything else fixed:

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    alpha_values = c(0.3, 0.5, 0.8),
    k_values = c(20, 40, 60),
    dropout_dist = "none"
)

Assembling a settings_matrix in pieces

Rather than varying everything equally all at once, you may be interested in looking at “chunks” of solution spaces that are based on distinct settings matrices.

For example, you may want to look at 50 solutions generated with k = 50 and look at another 50 solutions generated with k = 80. You can build two separate settings matrices and concatenate them, but you can also build up a single matrix in parts using the add_settings_matrix_rows function:

set.seed(42)
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 25,
    k_values = 50
)

settings_matrix <- add_settings_matrix_rows(
    settings_matrix,
    nrow = 25,
    k_values = 80
)

dim(settings_matrix)

## [1] 50 16

settings_matrix$"k"

##  [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## [26] 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80

Manual adjustments

Don’t forget that the settings matrix is just a dataframe. You can always go in and modify things as you wish, but you do risk generating duplicate or invalid rows that the package functions would have prevented.

“Matrix building failed”

generate_settings_matrix will never build duplicate rows. A consequence of this is that if you request a very large number of rows over a very small range of possible values to vary over, it will be impossible for the matrix to be built. For example, there’s no way to generate 10 unique rows when the only thing allowed to vary is which clustering algorithm (1 or 2) is used - only 2 rows could ever be created.

If you encounter the error “Matrix building failed”, try to generate fewer rows or to be a little less strict with what values are allowed.