Skip to contents

Download a copy of the vignette to follow along here: feature_weights.Rmd

Generating and Using the Weights Matrix

The distance metrics used in metasnf are all capable of applying custom weights to included features. The code below outlines how to generate and use a weights_matrix (dataframe containing feature weights) object.

library(metasnf)

# Make sure to throw in all the data you're interested in visualizing for this
# data_list, including out-of-model measures and confounding variables.
data_list <- generate_data_list(
    list(abcd_h_income, "household_income", "demographics", "ordinal"),
    list(abcd_pubertal, "pubertal_status", "demographics", "continuous"),
    list(abcd_colour, "favourite_colour", "demographics", "categorical"),
    list(abcd_anxiety, "anxiety", "behaviour", "ordinal"),
    list(abcd_depress, "depressed", "behaviour", "ordinal"),
    uid = "patient"
)

summarize_dl(data_list)
#>                              name        type       domain length width
#> household_income household_income     ordinal demographics    136     2
#> pubertal_status   pubertal_status  continuous demographics    136     2
#> favourite_colour favourite_colour categorical demographics    136     2
#> anxiety                   anxiety     ordinal    behaviour    136     2
#> depressed               depressed     ordinal    behaviour    136     2

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    min_k = 20,
    max_k = 50,
    seed = 42
)
#> [1] "The global seed has been changed!"

weights_matrix <- generate_weights_matrix(
    data_list,
    nrow = 20
)

head(weights_matrix)
#>      household_income pubertal_status colour cbcl_anxiety_r cbcl_depress_r
#> [1,]                1               1      1              1              1
#> [2,]                1               1      1              1              1
#> [3,]                1               1      1              1              1
#> [4,]                1               1      1              1              1
#> [5,]                1               1      1              1              1
#> [6,]                1               1      1              1              1

By default, the weights are all 1. This is what batch_snf uses when no weights_matrix is supplied.

If you have custom feature weights you’d like to be used you can manually populate this dataframe. There’s one column per variable (no need to worry about column orders) and the number of rows should match the number of rows in the settings_matrix.

If you are just looking to broaden the space of cluster solutions you generate, you can use some of the built-in randomization options for the weights:

# Random uniformly distributed values
generate_weights_matrix(
    data_list,
    nrow = 5,
    fill = "uniform"
)
#>      household_income pubertal_status    colour cbcl_anxiety_r cbcl_depress_r
#> [1,]       0.08161542       0.3198375 0.8328815      0.9943410      0.3955367
#> [2,]       0.40378037       0.4627980 0.3132912      0.7119147      0.9593465
#> [3,]       0.83551451       0.9353873 0.2794196      0.4951427      0.1132382
#> [4,]       0.59499701       0.5917005 0.7100717      0.8079317      0.2355968
#> [5,]       0.35140389       0.5460431 0.3481677      0.5611197      0.5104740

# Random exponentially distributed values
generate_weights_matrix(
    data_list,
    nrow = 5,
    fill = "exponential"
)
#>      household_income pubertal_status    colour cbcl_anxiety_r cbcl_depress_r
#> [1,]        0.5123907       0.1624127 1.6042481      3.7447548     1.53441037
#> [2,]        3.9471338       0.4178442 0.2354796      0.3647522     0.22186034
#> [3,]        0.4215409       0.2394908 0.1519102      0.8262260     0.03348363
#> [4,]        1.4107604       2.3230736 2.0428148      0.2279961     0.48877057
#> [5,]        0.1756311       0.5256458 1.3623835      0.1072554     0.24304379

Once you’re happy with your weights_matrix, you can pass it into batch_snf:

batch_snf(
    data_list = data_list,
    settings_matrix = settings_matrix,
    weights_matrix = weights_matrix
)

The Nitty Gritty of How Weights are Used

The specific implementation of the weights during distance matrix calculations is dependent on the distance metric used, which you can learn more about in the distance metrics vignette.

The other aspect to understand if you want to know precisely how your weights are being used is related to the SNF schemes. Depending on which scheme is specified in the settings_matrix row, the variable columns that are involved at each distance matrix calculation can differ substantially.

For example, in the domain scheme, all variables of the same domain are concatenated prior to distance matrix calculation. If you have any domains with multiple types of variables (e.g., continuous and categorical), that will mean that the mixed distance metric (Gower’s method by default) will be used, and weights will be applied but only on a per-domain basis.

Here’s a more concrete example on how data set-up and SNF scheme can influence the variable weighting process: consider generating a data_list where every single input dataframe contains only 1 input variable. If that data_list is processed exclusively using the “individual” SNF scheme, feature weights won’t matter. This is because the individual SNF scheme calculates individual distance metrics for every input dataframe separately before fusing them together with SNF. Anytime a distance matrix is calculated, it’ll be for a single variable only, and the purpose of feature weighting (changing the relative contributions of input variables during the distance matrix calculations) will be lost.