Skip to contents

Download a copy of the vignette to follow along here: nmi_scores.Rmd

NMI scores were used in the original SNFtool package as a unitless way to compare the relative importance of different features in a final cluster solution. The premise of this approach is that if a feature was very important, clustering off of that feature alone should result in a solution that is very similar to the one that was generated by clustering off of all the features together.

In the original SNFtool implementation of calculating NMI scores, the cluster solution based on the individual feature being assessed was restricted to necessarily being generated using squared Euclidean distance, a K hyperparameter value of 20, an alpha hyperparameter value of 0.5, and spectral clustering with the number of clusters based on the best eigen-gap value of possible solutions spanning from 2 to 5 clusters.

In contrast, the metasnf implementation leverages all the architectural details and hyperparameters supplied in the original settings_matrix and batch_snf() call to make the solo-feature to all-feature solutions as comparable as possible.

The chunk below outlines how the primary NMI calculating function, batch_nmi(), can be used.

library(metasnf)

data_list <- generate_data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)
#> Warning in generate_data_list(list(subc_v, "subcortical_volume",
#> "neuroimaging", : 188 subject(s) dropped due to incomplete data.

set.seed(42)
settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 20,
    min_k = 20,
    max_k = 50
)

# Generation of 20 cluster solutions
solutions_matrix <- batch_snf(data_list, settings_matrix)

# Let's just calculate NMIs of the anxiety and depression data types for the
# first 5 cluster solutions to save time:
feature_nmis <- batch_nmi(data_list[4:5], solutions_matrix[1:5, ])

print(feature_nmis)
#>          feature   row_id_1  row_id_2  row_id_3  row_id_4  row_id_5
#> 1 cbcl_anxiety_r 0.08307759 0.3825622 0.5532495 0.4068634 0.2532882
#> 2 cbcl_depress_r 0.30514882 0.3348474 0.4058227 0.2307721 0.1486859

One important thing to note is that if the cluster space you initially set up when calling batch_snf relied on custom distance metrics, clustering algorithms, or the automatic_standard_normalize parameter, you should use those same values when calling batch_nmi() as well.

Another important note is that by default, batch_nmi will ignore the inc_* columns of the settings matrix, i.e., no data types are dropped during solo feature cluster solution calculations. This can lead to a bit of an odd interpretation if you view NMI as a direct reflection of contribution to the final SNF output. It is possible for a feature that was not a part of a particular cluster solution to still produce its own cluster solution that has a very high NMI score to the prior one. If you wish to suppress the calculation of NMIs for features that were not actually included in a particular SNF run due to having a 0 value in the inclusion column, you can set the ignore_inclusions parameter to FALSE.

Finally, if you’d like the NMI information to be presented in a transposed format, you can do that too by setting transpose to FALSE.