Skip to contents

Download a copy of the vignette to follow along here: distance_metrics.Rmd

The distance_metrics_list

metasnf enables users to customize what distance metrics are used in the SNF pipeline. All information about distance metrics are stored in a distance_metrics_list object.

By default, batch_snf will create its own distance_metrics_list by calling the generate_distance_metrics_list function with no additional arguments.

library(metasnf)

distance_metrics_list <- generate_distance_metrics_list()

The list is a list of functions (euclidean_distance() and gower_distance()), and printing them out directly can be messy. The summarize_dml() function prints the object with a nicer format.

summarize_dml(distance_metrics_list)
#> $continuous_distances
#> [1] "euclidean_distance"
#> 
#> $discrete_distances
#> [1] "euclidean_distance"
#> 
#> $ordinal_distances
#> [1] "euclidean_distance"
#> 
#> $categorical_distances
#> [1] "gower_distance"
#> 
#> $mixed_distances
#> [1] "gower_distance"

These lists must always contain at least 1 distance metric for each of the 5 recognized types of features: continuous, discrete, ordinal, categorical, and mixed (any combination of the previous four).

By default, continuous, discrete, and ordinal data are converted to distance matrices using simple Euclidean distance. Categorical and mixed data are handled using Gower’s formula as implemented by the cluster package (see ?cluster::daisy).

How the distance_metrics_list is used

To show how the distance_metrics_list is used, we’ll start by extending our distance_metrics_list beyond just the default options. metasnf provides a Euclidean distance function that applies standard normalization first, sn_euclidean_distance() (a wrapper around SNFtool::standardNormalization + stats::dist). Here’s how we can create a custom distance_metrics_list that includes this metric for continuous and discrete features.

my_distance_metrics <- generate_distance_metrics_list(
    continuous_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance
    ),
    discrete_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance
    )
)

summarize_dml(my_distance_metrics)
#> $continuous_distances
#> [1] "euclidean_distance"      "standard_norm_euclidean"
#> 
#> $discrete_distances
#> [1] "euclidean_distance"      "standard_norm_euclidean"
#> 
#> $ordinal_distances
#> [1] "euclidean_distance"
#> 
#> $categorical_distances
#> [1] "gower_distance"
#> 
#> $mixed_distances
#> [1] "gower_distance"

Now, during settings_matrix generation, we can provide our distance_metrics_list to ensure our new distance metrics will be used on some of our SNF runs.

Before making the settings_matrix, we’ll quickly need to setup some data (as was done in A Simple Example).

library(SNFtool)

data(Data1)
data(Data2)

Data1$"patient_id" <- 101:(nrow(Data1) + 100) # nolint
Data2$"patient_id" <- 101:(nrow(Data2) + 100) # nolint

data_list <- generate_data_list(
    list(Data1, "genes_1_and_2_exp", "gene_expression", "continuous"),
    list(Data2, "genes_1_and_2_meth", "gene_methylation", "continuous"),
    uid = "patient_id"
)

And then the settings_matrix can be generated:

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    distance_metrics_list = my_distance_metrics
)

# showing just the columns that are related to distances
settings_matrix |> dplyr::select(dplyr::ends_with("dist"))
#>    cont_dist disc_dist ord_dist cat_dist mix_dist
#> 1          2         2        1        1        1
#> 2          1         1        1        1        1
#> 3          2         2        1        1        1
#> 4          2         1        1        1        1
#> 5          1         2        1        1        1
#> 6          1         1        1        1        1
#> 7          2         1        1        1        1
#> 8          2         2        1        1        1
#> 9          1         1        1        1        1
#> 10         1         2        1        1        1

The continuous and discrete distance metrics values randomly fluctuate between 1 and 2, where 1 means that the first metric (euclidean_distance()) will be used and 2 means that the second metric (sn_euclidean_distance) will be used.

It’s important to note that the settings_matrix itself does not store the distance metrics, just the pointers to which position metric in the distance_metrics_list should be used for that SNF run. Because of that, you’ll need to supply the distance_metrics_list once again when calling batch_snf().

solutions_matrix <- batch_snf(
    data_list,
    settings_matrix,
    distance_metrics_list = my_distance_metrics
)

Removing the default distance_metrics

There are two ways to avoid using the default distance metrics if you don’t want to ever use them.

The first way is to use the keep_defaults parameter in generate_distance_metrics_list():

no_default_metrics <- generate_distance_metrics_list(
    continuous_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance
    ),
    discrete_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance
    ),
    ordinal_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance
    ),
    categorical_distances = list(
        "standard_norm_euclidean" = gower_distance
    ),
    mixed_distances = list(
        "standard_norm_euclidean" = gower_distance
    ),
    keep_defaults = FALSE
)

summarize_dml(no_default_metrics)
#> $continuous_distances
#> [1] "standard_norm_euclidean"
#> 
#> $discrete_distances
#> [1] "standard_norm_euclidean"
#> 
#> $ordinal_distances
#> [1] "standard_norm_euclidean"
#> 
#> $categorical_distances
#> [1] "standard_norm_euclidean"
#> 
#> $mixed_distances
#> [1] "standard_norm_euclidean"

For this option, it is necessary to provide at least one distance metric for every feature type. A distance_metrics_list cannot have any of its feature types be completely empty (even if you have no data for that feature type in the first place).

The second way is to explicitly specify which indices you want to sample from during settings_matrix generation:

my_distance_metrics <- generate_distance_metrics_list(
    continuous_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance,
        "some_other_metric" = sn_euclidean_distance
    ),
    discrete_distances = list(
        "standard_norm_euclidean" = sn_euclidean_distance,
        "some_other_metric" = sn_euclidean_distance
    )
)

summarize_dml(my_distance_metrics)
#> $continuous_distances
#> [1] "euclidean_distance"      "standard_norm_euclidean"
#> [3] "some_other_metric"      
#> 
#> $discrete_distances
#> [1] "euclidean_distance"      "standard_norm_euclidean"
#> [3] "some_other_metric"      
#> 
#> $ordinal_distances
#> [1] "euclidean_distance"
#> 
#> $categorical_distances
#> [1] "gower_distance"
#> 
#> $mixed_distances
#> [1] "gower_distance"

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    distance_metrics_list = my_distance_metrics,
    continuous_distances = 1,
    discrete_distances = c(2, 3)
)

settings_matrix |> dplyr::select(dplyr::ends_with("dist"))
#>    cont_dist disc_dist ord_dist cat_dist mix_dist
#> 1          1         2        1        1        1
#> 2          1         2        1        1        1
#> 3          1         3        1        1        1
#> 4          1         3        1        1        1
#> 5          1         3        1        1        1
#> 6          1         2        1        1        1
#> 7          1         3        1        1        1
#> 8          1         3        1        1        1
#> 9          1         2        1        1        1
#> 10         1         2        1        1        1

The second option can be quite useful when paired with add_settings_matrix_rows(), enabling you to build up distinct blocks of rows in the settings matrix that have different combinations of distance metrics. This can save you the trouble of needing to manage several distinct distance metrics lists or manage your solution space over separate runs of batch_snf.

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    distance_metrics_list = my_distance_metrics,
    continuous_distances = 1,
    discrete_distances = c(2, 3)
)

settings_matrix <- add_settings_matrix_rows(
    settings_matrix,
    nrow = 10,
    distance_metrics_list = my_distance_metrics,
    continuous_distances = 3,
    discrete_distances = 1
)

settings_matrix |> dplyr::select(dplyr::ends_with("dist"))
#>    cont_dist disc_dist ord_dist cat_dist mix_dist
#> 1          1         3        1        1        1
#> 2          1         2        1        1        1
#> 3          1         3        1        1        1
#> 4          1         3        1        1        1
#> 5          1         2        1        1        1
#> 6          1         2        1        1        1
#> 7          1         2        1        1        1
#> 8          1         2        1        1        1
#> 9          1         2        1        1        1
#> 10         1         2        1        1        1
#> 11         3         1        1        1        1
#> 12         3         1        1        1        1
#> 13         3         1        1        1        1
#> 14         3         1        1        1        1
#> 15         3         1        1        1        1
#> 16         3         1        1        1        1
#> 17         3         1        1        1        1
#> 18         3         1        1        1        1
#> 19         3         1        1        1        1
#> 20         3         1        1        1        1

In rows 1 to 10, only continuous data will always be handled by the first continuous distance metric while discrete data will be handled by the second and third discrete distance metrics. In rows 11 to 20, continuous data will always be handled by the third continuous distance metric while discrete data will only be handled by the first distance metric.

Supplying weights to distance metrics

Some distance metric functions can accept weights. Usually, weights will be applied by direct scaling of specified features. In some cases (e.g. categorical distance metric functions), the way in which weights are applied may be somewhat less intuitive. The bottom of this vignette outlines the available distance metric functions grouped by whether or not they accept weights. You can examine the documentation of those weighted functions to learn more about how the weights you provide will be used.

An important note on providing weights for a run of SNF is that the specific form of the data may not be what you expect by the time it is ready to be converted into a distance metric function. The “individual” and “two-step” SNF schemes involve distance metrics only being applied to the input dataframes in the data_list as they are. The “domain” scheme, however, concatenates data within a domain before converting that larger dataframe into a distance matrix. Anytime you have more than one dataframe with the same domain label and you use the domain SNF scheme, all the columns associated with that domain will be in a single dataframe when the distance metric function is applied.

The first step in providing custom weights is to generate a weights_matrix:

weights_matrix <- generate_weights_matrix(
    data_list
)

weights_matrix
#>      V1 V2 V3 V4
#> [1,]  1  1  1  1

By default, this function will return a dataframe containing all the columns in the data_list and a single row of 1s, which are the weights that will be used for a single run of SNF.

To actually use the matrix during SNF, you’ll need to make sure that the number of rows in the weights matrix is the same as the number of rows in your settings matrix.

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    distance_metrics_list = my_distance_metrics,
    continuous_distances = 1,
    discrete_distances = c(2, 3)
)

weights_matrix <- generate_weights_matrix(
    data_list,
    nrow = nrow(settings_matrix)
)

weights_matrix[1:5, ]
#>      V1 V2 V3 V4
#> [1,]  1  1  1  1
#> [2,]  1  1  1  1
#> [3,]  1  1  1  1
#> [4,]  1  1  1  1
#> [5,]  1  1  1  1

solutions_matrix_1 <- batch_snf(
    data_list,
    settings_matrix,
    distance_metrics_list = my_distance_metrics,
    weights_matrix = weights_matrix
)

solutions_matrix_2 <- batch_snf(
    data_list,
    settings_matrix,
    distance_metrics_list = my_distance_metrics
)

identical(
    solutions_matrix_1,
    solutions_matrix_2
) # Try this on your machine - It'll evaluate to TRUE
#> [1] TRUE

A weights_matrix of all 1s (the default weights_matrix used when you don’t supply one yourself) ’won’t actually do anything to the data. You can either replace the 1s with your own weights that you’ve calculated outside of the package, or use some random weights following a uniform or exponential distribution.

weights_matrix <- generate_weights_matrix(
    data_list,
    nrow = nrow(settings_matrix),
    fill = "uniform" # or fill = "exponential"
)

The default metrics (simple Euclidean for continuous, discrete, and ordinal data and Gower’s distance for categorical and mixed data) are both capable of applying weights to data before distance matrix generation.

Custom distance metrics

The remainder of this vignette deals with supplying custom distance metrics (including custom feature weighting). Making use of this functionality will require a good understanding of working with functions in R.

You can also supply your own custom distance metrics. Looking at the code from one of the package-provided distance functions shows some of the essential aspects of a well-formated distance function.

euclidean_distance
#> function (df, weights_row) 
#> {
#>     weights <- diag(weights_row, nrow = length(weights_row))
#>     weighted_df <- as.matrix(df) %*% weights
#>     distance_matrix <- as.matrix(stats::dist(weighted_df, method = "euclidean"))
#>     return(distance_matrix)
#> }
#> <bytecode: 0x559fef08ca28>
#> <environment: namespace:metasnf>

The function should accept two arguments: df and weights_row, and only give one output, distance_matrix. The function doesn’t actually need to make use of those weights if you don’t want it to.

By the time your data reaches a distance metric function, it (referred to as df) will always:

  1. have no UID column
  2. have at least one feature column
  3. have no missing values
  4. be a data.frame (not a tibble)

The feature column names won’t be altered from the values they had when they were loaded into the data_list.

For example, consider the anxiety raw data supplied by metasnf:

head(anxiety)
#> # A tibble: 6 × 2
#>   unique_id        cbcl_anxiety_r
#>   <chr>                     <dbl>
#> 1 NDAR_INV0567T2Y9              3
#> 2 NDAR_INV0GLZNC2W              1
#> 3 NDAR_INV0IZ157F8             NA
#> 4 NDAR_INV0J4PYA5F              0
#> 5 NDAR_INV0OYE291Q              2
#> 6 NDAR_INV0SM1JLXQ              2

Here’s how to make it look more like what the distance metric functions will expect to see:

processed_anxiety <- anxiety |>
    na.omit() |> # no NAs
    dplyr::rename("subjectkey" = "unique_id") |>
    data.frame(row.names = "subjectkey")

head(processed_anxiety)
#>                  cbcl_anxiety_r
#> NDAR_INV0567T2Y9              3
#> NDAR_INV0GLZNC2W              1
#> NDAR_INV0J4PYA5F              0
#> NDAR_INV0OYE291Q              2
#> NDAR_INV0SM1JLXQ              2
#> NDAR_INV0Z87UJDR              0

If we want to have a distance metric that calculates Euclidean distance, but also scales the resulting matrix down such that the biggest allowed distance is a 1, it would look like this:

my_scaled_euclidean <- function(df, weights_row) {
    # this function won't apply the weights it is given
    distance_matrix <- df |>
        stats::dist(method = "euclidean") |>
        as.matrix() # make sure it's formatted as a matrix
    distance_matrix <- distance_matrix / max(distance_matrix)
    return(distance_matrix)
}

You’ll need to be mindful of any edge cases that your function will run into. For example, this function will fail if the pairwise distances for all patients is 0 (a division by 0 will occur). If that specific situation ever happens, there’s probably something quite wrong with the data.

Once you’re happy that you distance function is working as you’d like it to:

my_scaled_euclidean(processed_anxiety)[1:5, 1:5]
#>                  NDAR_INV0567T2Y9 NDAR_INV0GLZNC2W NDAR_INV0J4PYA5F
#> NDAR_INV0567T2Y9              0.0              0.2              0.3
#> NDAR_INV0GLZNC2W              0.2              0.0              0.1
#> NDAR_INV0J4PYA5F              0.3              0.1              0.0
#> NDAR_INV0OYE291Q              0.1              0.1              0.2
#> NDAR_INV0SM1JLXQ              0.1              0.1              0.2
#>                  NDAR_INV0OYE291Q NDAR_INV0SM1JLXQ
#> NDAR_INV0567T2Y9              0.1              0.1
#> NDAR_INV0GLZNC2W              0.1              0.1
#> NDAR_INV0J4PYA5F              0.2              0.2
#> NDAR_INV0OYE291Q              0.0              0.0
#> NDAR_INV0SM1JLXQ              0.0              0.0

You can load it into a custom distance_metrics_list:

my_distance_metrics <- generate_distance_metrics_list(
    continuous_distances = list(
        "my_scaled_euclidean" = my_scaled_euclidean
    )
)

Requesting metrics

If there’s a metric you’d like to see added as a prewritten option included in the package, feel free to post an issue or make a pull request on the package’s GitHub.

List of prewritten distance metrics functions

These metrics can be used as is. They are all capable of accepting and applying custom weights provided by a weights_matrix.

  • euclidean_distance (Euclidean distance)
    • applies to continuous, discrete, or ordinal data
  • sn_euclidean_distance (Standard normalized Euclidean distance)
    • Standard normalize data, then use Euclidean distance
    • applies to continuous, discrete, or ordinal data
  • gower_distance (Gower’s distance)
    • applies to any data
  • siw_euclidean_distance (Squared (including weights) Euclidean distance)
    • Apply weights to dataframe, then calculate Euclidean distance, then square the results
  • sew_euclidean_distance (Squared (excluding weights) Euclidean distance)
    • Apply square root of weights to dataframe, then calculate Euclidean distance, then square the results
  • hamming_distance (Hamming distance)
    • Distance between patients is (1 * feature weight) summed over all features

References

Wang, Bo, Aziz M. Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Benjamin Haibe-Kains, and Anna Goldenberg. 2014. “Similarity Network Fusion for Aggregating Data Types on a Genomic Scale.” Nature Methods 11 (3): 333–37. https://doi.org/10.1038/nmeth.2810.