Skip to contents

Run SNF clustering pipeline on a list of subsampled data lists

Usage

batch_snf_subsamples(
  dl_subsamples,
  sc,
  processes = 1,
  return_sim_mats = FALSE,
  sim_mats_dir = NULL
)

Arguments

dl_subsamples

A list of subsampled data lists. This object is generated by the function batch_snf_subsamples().

sc

An snf_config class object which stores all sets of hyperparameters used to transform data in dl into a cluster solutions. See ?settings_df or https://branchlab.github.io/metasnf/articles/settings_df.html for more details.

processes

Specify number of processes used to complete SNF iterations

  • 1 (default) Sequential processing: function will iterate through the settings_df one row at a time with a for loop. This option will not make use of multiple CPU cores, but will show a progress bar.

  • 2 or higher: Parallel processing will use the future.apply::future_apply to distribute the SNF iterations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.

  • max: All available cores will be used.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

sim_mats_dir

If specified, this directory will be used to save all generated similarity matrices.

Value

By default, returns a one-element list: cluster_solutions, which is itself a list of cluster solution data frames corresponding to each of the provided data list subsamples. Setting the parameters return_sim_mats and return_solutions to TRUE will turn the result of the function to a three-element list containing the corresponding solutions data frames and final fused similarity matrices of those cluster solutions, should you require these objects for your own stability calculations.

Examples

# \donttest{
my_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)
#>  175 observations dropped due to incomplete data.

sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
#>  No distance functions specified. Using defaults.
#>  No clustering functions specified. Using defaults.

my_dl_subsamples <- subsample_dl(
    my_dl,
    n_subsamples = 20,
    subsample_fraction = 0.85
)

batch_subsample_results <- batch_snf_subsamples(
    my_dl_subsamples,
    sc
)
# }