Run variations of SNF — batch

This is the core function of the metasnf package. Using the information stored in a settings_df (see ?settings_df) and a data list (see ?data_list), run repeated complete SNF pipelines to generate a broad space of post-SNF cluster solutions.

Usage

batch_snf(dl, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL)

Arguments

dl

A nested list of input data from data_list().

sc

An snf_config class object which stores all sets of hyperparameters used to transform data in dl into a cluster solutions. See ?settings_df or https://branchlab.github.io/metasnf/articles/settings_df.html for more details.

processes

Specify number of processes used to complete SNF iterations

1 (default) Sequential processing: function will iterate through the settings_df one row at a time with a for loop. This option will not make use of multiple CPU cores, but will show a progress bar.
2 or higher: Parallel processing will use the future.apply::future_apply to distribute the SNF iterations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.
max: All available cores will be used.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

sim_mats_dir

If specified, this directory will be used to save all generated similarity matrices.

Value

By default, returns a solutions data frame (class "data.frame"), a a data frame containing one row for every row of the provided settings matrix, all the original columns of that settings data frame, and new columns containing the assigned cluster of each observation from the cluster solution derived by that row's settings. If return_sim_mats is TRUE, the function will instead return a list containing the solutions data frame as well as a list of the final similarity matrices (class "matrix") generated by SNF for each row of the settings data frame. If suppress_clustering is TRUE, the solutions data frame will not be returned in the output.

Examples

input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 3)
#> ℹ No distance functions specified. Using defaults.
#> ℹ No clustering functions specified. Using defaults.

# A solutions data frame without similarity matrices:
sol_df <- batch_snf(input_dl, sc)

# A solutions data frame with similarity matrices:
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mats_list(sol_df)
#> A similarity matrix list storing 3 200x200 similarity matrices.
#> Use `sim_mats_list[[i]]` to view the ith matrix.
#>