NMI Scores • metasnf

Download a copy of the vignette to follow along here: nmi_scores.Rmd

NMI scores were used in the original SNFtool package as a unitless way to compare the relative importance of different features in a final cluster solution. The premise of this approach is that if a feature was very important, clustering off of that feature alone should result in a solution that is very similar to the one that was generated by clustering off of all the features together.

In the original SNFtool implementation of calculating NMI scores, the cluster solution based on the individual feature being assessed was restricted to necessarily being generated using squared Euclidean distance, a K hyperparameter value of 20, an alpha hyperparameter value of 0.5, and spectral clustering with the number of clusters based on the best eigen-gap value of possible solutions spanning from 2 to 5 clusters.

In contrast, the metasnf implementation leverages all the architectural details and hyperparameters supplied in the original SNF config and batch_snf() call to make the solo-feature to all-feature solutions as comparable as possible.

The chunk below outlines how the primary NMI calculating function, calc_nmis(), can be used.

library(metasnf)

dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)
#> ℹ 188 observations dropped due to incomplete data.

set.seed(42)
sc <- snf_config(
    dl = dl,
    n_solutions = 2,
    min_k = 20,
    max_k = 50
)
#> ℹ No distance functions specified. Using defaults.
#> ℹ No clustering functions specified. Using defaults.

# Generation of 20 cluster solutions
sol_df <- batch_snf(dl, sc)

# Let's just calculate NMIs of the anxiety and depression data types for the
# first 5 cluster solutions to save time:
feature_nmis <- calc_nmis(dl[4:5], sol_df)

print(feature_nmis)
#>          feature         s1        s2
#> 1 cbcl_anxiety_r 0.08307759 0.3825622
#> 2 cbcl_depress_r 0.30514882 0.3348474

One important thing to note is that if the cluster space you initially set up when calling batch_snf relied on custom distance metrics, clustering algorithms, or the automatic_standard_normalize parameter, you should use those same values when calling calc_nmis() as well.

Another important note is that by default, calc_nmis will ignore the inc_* columns of the settings data frame, i.e., no data types are dropped during solo feature cluster solution calculations. This can lead to a bit of an odd interpretation if you view NMI as a direct reflection of contribution to the final SNF output. It is possible for a feature that was not a part of a particular cluster solution to still produce its own cluster solution that has a very high NMI score to the prior one. If you wish to suppress the calculation of NMIs for features that were not actually included in a particular SNF run due to having a 0 value in the inclusion column, you can set the ignore_inclusions parameter to FALSE.

Finally, if you’d like the NMI information to be presented in a transposed format, you can do that too by setting transpose to FALSE.