This function generates the major data object that will be processed when iterating through the each SNF pipeline defined in the settings_matrix. The data_list is a named and nested list containing input dataframes (data), the name of that input dataframe (for the user's reference), the 'domain' of that dataframe (the broader source of information that the input dataframe is capturing, determined by user's domain knowledge), and the type of feature stored in the dataframe (continuous, discrete, ordinal, categorical, or mixed).
Usage
generate_data_list(
...,
uid = NULL,
test_subjects = NULL,
train_subjects = NULL,
sort_subjects = TRUE,
remove_missing = TRUE,
return_missing = FALSE
)
Arguments
- ...
Any number of list formatted as (df, "df_name", "df_domain", "df_type") OR any number of lists of lists formatted as (df, "df_name", "df_domain", "df_type")
- uid
(string) the name of the uid column currently used data
- test_subjects
character vector of test subjects (useful if building a full data list for label propagation)
- train_subjects
character vector of train subjects (useful if building a full data list for label propagation)
- sort_subjects
If TRUE, the subjects in the data_list will be sorted
- remove_missing
If TRUE (default), subjects with incomplete data will be dropped from data_list creation. Setting this value to FALSE may lead to unusual and/or unstable results during SNF, clustering, p-value calculations or label propagation.
- return_missing
If TRUE, function returns a list where the first element is the data_list and the second element is a vector of unique IDs of patients who were removed during the complete data filtration step.
Value
A nested "list" class object. Each list component contains a 4-item list of a data frame, the user-assigned name of the data frame, the user-assigned domain of the data frame, and the user-labeled type of the data frame.
Examples
heart_rate_df <- data.frame(
patient_id = c("1", "2", "3"),
var1 = c(0.04, 0.1, 0.3),
var2 = c(30, 2, 0.3)
)
personality_test_df <- data.frame(
patient_id = c("1", "2", "3"),
var3 = c(900, 1990, 373),
var4 = c(509, 2209, 83)
)
survey_response_df <- data.frame(
patient_id = c("1", "2", "3"),
var5 = c(1, 3, 3),
var6 = c(2, 3, 3)
)
city_df <- data.frame(
patient_id = c("1", "2", "3"),
var7 = c("toronto", "montreal", "vancouver")
)
# Explicitly (Name each nested list element):
data_list <- generate_data_list(
list(
data = heart_rate_df,
name = "heart_rate",
domain = "clinical",
type = "continuous"
),
list(
data = personality_test_df,
name = "personality_test",
domain = "surveys",
type = "continuous"
),
list(
data = survey_response_df,
name = "survey_response",
domain = "surveys",
type = "ordinal"
),
list(
data = city_df,
name = "city",
domain = "location",
type = "categorical"
),
uid = "patient_id"
)
# Compact loading
data_list <- generate_data_list(
list(heart_rate_df, "heart_rate", "clinical", "continuous"),
list(personality_test_df, "personality_test", "surveys", "continuous"),
list(survey_response_df, "survey_response", "surveys", "ordinal"),
list(city_df, "city", "location", "categorical"),
uid = "patient_id"
)
# Printing data_list summaries
summarize_dl(data_list)
#> name type domain length width
#> 1 heart_rate continuous clinical 3 3
#> 2 personality_test continuous surveys 3 3
#> 3 survey_response ordinal surveys 3 3
#> 4 city categorical location 3 2
# Alternative loading: providing a single list of lists
list_of_lists <- list(
list(heart_rate_df, "data1", "domain1", "continuous"),
list(personality_test_df, "data2", "domain2", "continuous")
)
dl <- generate_data_list(
list_of_lists,
uid = "patient_id"
)