Download a copy of the vignette to follow along here: snf_config.Rmd
This vignette outlines how to construct and use the SNF config, an
object storing all the settings and hyperparameters required to convert
data in a data_list
class object into a space of cluster
solutions.
Creating a default SNF config
The most minimal SNF config (snf_config
class object)
can be obtained by providing a data list into the
snf_config()
function.
library(metasnf)
dl <- data_list(
list(cort_t, "cortical_thickness", "neuroimaging", "continuous"),
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
## ℹ 175 observations dropped due to incomplete data.
sc <- snf_config(dl, n_solutions = 5)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5
## SNF hyperparameters:
## alpha 0.6 0.8 0.4 0.6 0.6
## k 39 75 51 77 25
## t 20 20 20 20 20
## SNF scheme:
## 1 2 2 1 3
## Clustering functions:
## 2 2 2 2 2
## Distance functions:
## CNT 1 1 1 1 1
## DSC 1 1 1 1 1
## ORD 1 1 1 1 1
## CAT 1 1 1 1 1
## MIX 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✖
## cortical_surface_area ✔ ✔ ✖ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✖ ✔
## pubertal_status ✔ ✔ ✔ ✔ ✔
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 5 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1
## $ mrisdp_2 1, 1, 1, 1, 1
## $ mrisdp_3 1, 1, 1, 1, 1
## $ mrisdp_4 1, 1, 1, 1, 1
## $ mrisdp_5 1, 1, 1, 1, 1
## …and 329 more features.
Similarity network fusion-based clustering pipelines require the following steps:
- Selecting a set of data frames to integrate
- Converting those data frames into distance matrices using a distance function
- Converting those distance matrices to similarity matrices using the
SNFtool
package’saffinityMatrix()
function - Integrating the similarity matrices into one final similarity matrix
using the
SNFtool
package’sSNF()
function - Converting that final similarity matrix into a cluster solution using a clustering function
The SNF config is made up of four parts that all address various parts of that pipeline:
- The settings data frame (class
settings_df
, extends classdata.frame
), which contains information about SNF-specific hyperparameters (step 4), which distance and clustering functions will be used (steps 2 and 5), and if any components of the data list (data frames) will be excluded on a particular run (step 1). Each row of the data frame corresponds to a complete set of settings that can yield a single cluster solution from the data list. - The distance functions list (class
dist_fns_list
, extends classlist
), which stores the actual distance functions that are referenced in the settings data frame (step 2) - The clustering functions list (class
clust_fns_list
, extends classlist
), which similarly stores clustering functions (step 5) - The weights matrix (class
weights_matrix
, extends classesmatrix
,array
), which contains feature weights to account for during the data to distance matrix conversion step (step 2).
The settings data frame
You can view the settings data frame in closer detail as follows:
sc$"settings_df"
## 1 2 3 4 5
## SNF hyperparameters:
## alpha 0.6 0.8 0.4 0.6 0.6
## k 39 75 51 77 25
## t 20 20 20 20 20
## SNF scheme:
## 1 2 2 1 3
## Clustering functions:
## 2 2 2 2 2
## Distance functions:
## CNT 1 1 1 1 1
## DSC 1 1 1 1 1
## ORD 1 1 1 1 1
## CAT 1 1 1 1 1
## MIX 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✖
## cortical_surface_area ✔ ✔ ✖ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✖ ✔
## pubertal_status ✔ ✔ ✔ ✔ ✔
# Printed as a regular data frame
sc$"settings_df" |> as.data.frame()
## solution alpha k t snf_scheme clust_alg cnt_dist dsc_dist ord_dist cat_dist
## 1 1 0.6 39 20 1 2 1 1 1 1
## 2 2 0.8 75 20 2 2 1 1 1 1
## 3 3 0.4 51 20 2 2 1 1 1 1
## 4 4 0.6 77 20 1 2 1 1 1 1
## 5 5 0.6 25 20 3 2 1 1 1 1
## mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1 1 1 1
## 2 1 1 1
## 3 1 1 0
## 4 1 1 1
## 5 1 0 1
## inc_subcortical_volume inc_household_income inc_pubertal_status
## 1 1 1 1
## 2 1 1 1
## 3 1 1 1
## 4 1 0 1
## 5 1 1 1
The columns in a settings_df
class object include:
-
solution
: A label to keep track of each generated cluster solution. -
alpha
: The alpha (also referred to as sigma or eta in the original SNF paper) hyperparameter in SNF. This hyperparameter plays a role in converting distance matrices into similarity matrices. The process by whichSNFtool::affinityMatrix()
does this conversion essentially involves plugging the distance value as the x-coordinate of a normal distribution and pulling out the density at that point as the similarity. The thickness of the normal distribution is regulated by alpha, where a larger alpha leads to a broader normal distribution and a greater sensitivity to discriminating distances. -
k
: The k (nearest neighbours) hyperparameter in the distance matrix to similarity matrix conversion as well as in similarity network fusion. In the distance matrix to similarity matrix conversion (SNFtool::affinityMatrix()
), k controls how many nearest neighbours to consider when calculating how similar each observation is to its nearest neighbours. The closer an observation is to its k nearest neighbours, the broader the normal distribution that is used for the distance to similarity conversion. For the similarity network fusion step (SNFtool::SNF()
), k controls how intensely all the matrices should be sparsified before information is passed between them. With a very small k, say, k = 1, all the values in all the matrices will be reduced to 0 with the exception of one value between each observation and that observation’s most similar neighbour. -
t
: The T (number of iterations) hyperparameter used in SNF. A largert
results in more rounds of information passing between similarity matrices. SNF eventually converges, so overshooting this value offers no benefit but undershooting can yield inaccurate results. The original SNF developers recommend leaving this value at 20. -
snf_scheme
: Which SNF “scheme” is being used to convert the initial provided data frames into a final fused network (more on this in the SNF schemes vignette). -
clust_alg
: Which clustering algorithm function from the clustering functions list of the config will be applied to the final fused network. You can learn more about using this parameter in the clustering algorithnms vignette. - Columns ending in
dist
: Which distance metric function from the distance functions list of the config will be used for each of the various types of features in the data list (more on this in the distance metrics vignette). - Columns starting with
inc
: Whether or not the corresponding data frame will be included (1) or excluded (0) from this row.
By default, the alpha
and k
hyperparameters
are randomly varied from 0.3 to 0.8 and 10 to 100 respectively based on
suggestions from the original SNF paper. The t
hyperparameter by default stays fixed at 20. The snf_scheme
column varies randomly from 1 to 3, corresponding to each of the three
differente schemes that are available. The clust_alg
randomly varies between 1 and 2 for the two default clustering algoritm
functions: (1) spectral clustering using the eigen-gap heuristic to
calculate the number of clusters and (2) spectral clustering using the
rotation cost heuristic. The distance columns will always be 1 by
default, as there is only one default distance metric function per
variable type: simple Euclidean for anything numeric and Gower’s
distance for anything mixed or categorical.
The distance functions list
The distance functions list is simply a list of functions capable of converting a data frame into a distance matrix. Distance functions within the list are organized based on what type of variable they deal with: continuous, discrete, ordinal, categorical, or mixed (any combination of the former 4).
dfl <- sc$"dist_fns_list"
dfl
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
names(dfl)
## [1] "cnt_dist_fns" "dsc_dist_fns" "ord_dist_fns" "cat_dist_fns" "mix_dist_fns"
dfl$"cnt_dist_fns"[[1]]
## function (df, weights_row)
## {
## weights <- diag(weights_row, nrow = length(weights_row))
## weighted_df <- as.matrix(df) %*% weights
## distance_matrix <- as.matrix(stats::dist(weighted_df, method = "euclidean"))
## return(distance_matrix)
## }
## <bytecode: 0x55f5c2e02478>
## <environment: namespace:metasnf>
You can learn more about customizing distance metrics in the distance metrics vignette.
The clustering functions list
The clustering functions list is similarly a list of functions capable of converting a similarity matrix into a cluster solution (numeric vector).
cfl <- sc$"clust_fns_list"
cfl
## [1] spectral_eigen
## [2] spectral_rot
names(cfl)
## [1] "spectral_eigen" "spectral_rot"
cfl[[1]]
## function (similarity_matrix)
## {
## estimated_n <- estimate_nclust_given_graph(W = similarity_matrix,
## NUMC = 2:10)
## nclust_estimate <- estimated_n$`Eigen-gap best`
## solution <- SNFtool::spectralClustering(similarity_matrix,
## nclust_estimate)
## return(solution)
## }
## <bytecode: 0x55f5c2edbc68>
## <environment: namespace:metasnf>
You can learn more about customizing clustering functions in the clustering algorithnms vignette.
The weights matrix
wm <- sc$"weights_matrix"
wm
## Weights defined for 5 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1
## $ mrisdp_2 1, 1, 1, 1, 1
## $ mrisdp_3 1, 1, 1, 1, 1
## $ mrisdp_4 1, 1, 1, 1, 1
## $ mrisdp_5 1, 1, 1, 1, 1
## …and 329 more features.
class(wm) <- "matrix"
wm[1:5, 1:5]
## mrisdp_1 mrisdp_2 mrisdp_3 mrisdp_4 mrisdp_5
## [1,] 1 1 1 1 1
## [2,] 1 1 1 1 1
## [3,] 1 1 1 1 1
## [4,] 1 1 1 1 1
## [5,] 1 1 1 1 1
There’s one row in the weights matrix corresponding to every row in the settings data frame and one column for every feature in the data list. By default, all the weights are set to 1, so no weighting occurs.
Customizing an SNF config
When not specifying any parameters beyond the number of rows that are created, the function will randomly vary the values in the matrix.
# Through minimums and maximums
sc <- snf_config(
dl,
n_solutions = 100
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.6 0.8 0.3 0.7 0.7 0.8 0.6 0.7 0.8 0.3
## k 27 84 50 57 16 80 65 32 95 56
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 2 2 3 3 1 3 2 1 3 2
## Clustering functions:
## 2 1 1 1 1 1 1 1 1 2
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## cortical_surface_area ✔ ✔ ✔ ✔ ✔ ✖ ✔ ✔ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✖ ✔ ✔ ✔ ✖ ✔ ✔
## pubertal_status ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## …and settings defined to create 90 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 100 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## …and 329 more features.
Alpha, k, and t
You can control any of these parameters either by providing a vector of values you’d like to randomly sample from or by specifying a minimum and maximum range.
# Through minimums and maximums
sc <- snf_config(
dl,
n_solutions = 100,
min_k = 10,
max_k = 60,
min_alpha = 0.3,
max_alpha = 0.8
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
# Through specific value sampling
sc <- snf_config(
dl,
n_solutions = 20,
k_values = c(10, 25, 50),
alpha_values = c(0.4, 0.8)
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
Inclusion columns and data frame dropout
Bounds on the number of input data frames removed as well as the way in which the number removed is chosen can be controlled.
By default, the settings_df
generated during the call to
snf_config()
will pick a random value between 0 (printed as
a red X) and 1 (printed as a green checkmark) less than the total number
of available data frames in the data list based on an exponential
probability distribution. The exponential distribution makes it so that
it is very likely that a small number of data frames will be dropped and
much less likely that a large number of data frames will be dropped.
You can control the distribution by changing the
dropout_dist
value to “uniform” (which will result in a
much higher number of data frames being dropped on average) or “none”
(which will result in no data frames being dropped).
# Exponential dropping
sc <- snf_config(
dl,
n_solutions = 20,
dropout_dist = "exponential" # the default behaviour
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.4 0.7 0.5 0.8 0.5 0.4 0.8 0.3 0.4 0.5
## k 46 64 27 33 74 32 85 63 10 17
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 2 2 1 2 1 1 3 3 1 3
## Clustering functions:
## 2 2 1 1 1 2 2 2 1 1
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## cortical_surface_area ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## pubertal_status ✔ ✔ ✔ ✖ ✔ ✔ ✔ ✔ ✔ ✔
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## …and 329 more features.
# Uniform dropping
sc <- snf_config(
dl,
n_solutions = 20,
dropout_dist = "uniform"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.4 0.5 0.4 0.8 0.8 0.3 0.5 0.4 0.6 0.3
## k 23 68 14 50 91 61 95 61 86 26
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 2 3 1 3 2 2 3 2 1 2
## Clustering functions:
## 2 1 1 2 2 1 1 2 1 1
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✖ ✔ ✖ ✔ ✔ ✖ ✖ ✔ ✔
## cortical_surface_area ✔ ✔ ✔ ✖ ✖ ✔ ✖ ✔ ✔ ✔
## subcortical_volume ✔ ✖ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## pubertal_status ✔ ✖ ✔ ✔ ✔ ✔ ✔ ✖ ✔ ✔
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## …and 329 more features.
# No dropping
sc <- snf_config(
dl,
n_solutions = 20,
dropout_dist = "none"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.8 0.4 0.8 0.4 0.7 0.6 0.8 0.4 0.3 0.7
## k 80 53 27 67 63 11 53 14 33 27
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 1 2 1 1 1 3 2 2 1 2
## Clustering functions:
## 1 2 2 1 2 1 2 1 2 2
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## cortical_surface_area ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## pubertal_status ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## …and 329 more features.
The bounds on the number of data frames that can be dropped can be
controlled using the min_removed_inputs
and
max_removed_inputs
:
sc <- snf_config(
dl,
n_solutions = 20,
min_removed_inputs = 3
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
# No row will exclude fewer than 3 data frames during SNF
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.5 0.7 0.8 0.3 0.6 0.4 0.4 0.7 0.3 0.3
## k 82 44 24 56 96 10 94 27 34 61
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 1 1 1 2 2 3 2 1 2 2
## Clustering functions:
## 1 2 2 1 1 1 2 1 1 1
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✖ ✔ ✖ ✖ ✖ ✖ ✖ ✔ ✔ ✖
## cortical_surface_area ✖ ✖ ✔ ✖ ✖ ✔ ✖ ✔ ✖ ✔
## subcortical_volume ✔ ✖ ✖ ✔ ✔ ✖ ✔ ✖ ✔ ✖
## household_income ✔ ✖ ✖ ✔ ✔ ✖ ✔ ✖ ✖ ✖
## pubertal_status ✖ ✔ ✔ ✖ ✖ ✔ ✖ ✖ ✖ ✔
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## …and 329 more features.
Grid searching
If you are interested in grid searching over perhaps just a specific set of alpha and k values, you may want to consider varying those parameters and keeping everything else fixed:
sc <- snf_config(
dl,
n_solutions = 10,
alpha_values = c(0.3, 0.5, 0.8),
k_values = c(20, 40, 60),
dropout_dist = "none"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
## 1 2 3 4 5 6 7 8 9 10
## SNF hyperparameters:
## alpha 0.8 0.5 0.3 0.8 0.3 0.3 0.3 0.3 0.8 0.3
## k 40 20 40 40 60 40 20 20 20 40
## t 20 20 20 20 20 20 20 20 20 20
## SNF scheme:
## 3 3 3 3 2 2 3 2 2 2
## Clustering functions:
## 2 2 2 1 1 1 1 1 2 2
## Distance functions:
## CNT 1 1 1 1 1 1 1 1 1 1
## DSC 1 1 1 1 1 1 1 1 1 1
## ORD 1 1 1 1 1 1 1 1 1 1
## CAT 1 1 1 1 1 1 1 1 1 1
## MIX 1 1 1 1 1 1 1 1 1 1
## Component dropout:
## cortical_thickness ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## cortical_surface_area ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## subcortical_volume ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## household_income ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## pubertal_status ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 10 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## …and 329 more features.
Assembling an SNF config in pieces
Rather than varying everything equally all at once, you may be
interested in looking at “chunks” of solution spaces that are based on
distinct SNF configs. For example, you may want to look at 25 solutions
generated with k = 50 and look at another 25 solutions generated with k
= 80. You can build two separate SNF configs and join them using the
merge()
function.
set.seed(42)
sc_1 <- snf_config(
dl,
n_solutions = 25,
k_values = 50
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc_2 <- snf_config(
dl,
n_solutions = 25,
k_values = 80
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
full_sc <- merge(sc_1, sc_2)
“settings_df
building failed to converge”
snf_config()
will never build duplicate rows. A
consequence of this is that if you request a very large number of rows
over a very small range of possible values to vary over, it will be
impossible for the matrix to be built. For example, there’s no way to
generate 10 unique rows when the only varying parameter is which of two
clustering algorithms is used - only 2 rows could ever be created. If
you encounter the error “Matrix building failed”, try to generate fewer
rows or to be a little less strict with what values are allowed.