Skip to contents

Calculate coclustering data.

Usage

calculate_coclustering(subsample_solutions, sol_df, verbose = FALSE)

Arguments

subsample_solutions

A list of containing cluster solutions from distinct subsamples of the data. This object is generated by the function batch_snf_subsamples(). These solutions should correspond to the ones in the solutions data frame.

sol_df

A solutions data frame. This object is generated by the function batch_snf(). The solutions in the solutions data frame should correspond to those in the subsample solutions.

verbose

If TRUE, output time remaining estimates to console.

Value

A list containing the following components:

  • cocluster_dfs: A list of data frames, one per cluster solution, that shows the number of times that every pair of observations in the original cluster solution occurred in the same subsample, the number of times that every pair clustered together in a subsample, and the corresponding fraction of times that every pair clustered together in a subsample.

  • cocluster_ss_mats: The number of times every pair of observations occurred in the same subsample, formatted as a pairwise matrix.

  • cocluster_sc_mats: The number of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.

  • cocluster_cf_mats: The fraction of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.

  • cocluster_summary: Specifically among pairs of observations that clustered together in the original full cluster solution, what fraction of those pairs remained clustered together throughout the subsample solutions. This information is formatted as a data frame with one row per cluster solution.

Examples

# my_dl <- data_list(
#     list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
#     list(income, "household_income", "demographics", "continuous"),
#     list(pubertal, "pubertal_status", "demographics", "continuous"),
#     uid = "unique_id"
# )
# 
# sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
# 
# sol_df <- batch_snf(my_dl, sc)
# 
# my_dl_subsamples <- subsample_dl(
#     my_dl,
#     n_subsamples = 20,
#     subsample_fraction = 0.85
# )
# 
# batch_subsample_results <- batch_snf_subsamples(
#     my_dl_subsamples,
#     sc,
#     verbose = TRUE
# )
# 
# coclustering_results <- calculate_coclustering(
#     batch_subsample_results,
#     sol_df,
#     verbose = TRUE
# )