Title: | Easy Spatial Modeling with Random Forest |
---|---|
Description: | Automatic generation and selection of spatial predictors for spatial regression with Random Forest. Spatial predictors are surrogates of variables driving the spatial structure of a response variable. The package offers two methods to generate spatial predictors from a distance matrix among training cases: 1) Moran's Eigenvector Maps (MEMs; Dray, Legendre, and Peres-Neto 2006 <DOI:10.1016/j.ecolmodel.2006.02.015>): computed as the eigenvectors of a weighted matrix of distances; 2) RFsp (Hengl et al. <DOI:10.7717/peerj.5518>): columns of the distance matrix used as spatial predictors. Spatial predictors help minimize the spatial autocorrelation of the model residuals and facilitate an honest assessment of the importance scores of the non-spatial predictors. Additionally, functions to reduce multicollinearity, identify relevant variable interactions, tune random forest hyperparameters, assess model transferability via spatial cross-validation, and explore model results via partial dependence curves and interaction surfaces are included in the package. The modelling functions are built around the highly efficient 'ranger' package (Wright and Ziegler 2017 <DOI:10.18637/jss.v077.i01>). |
Authors: | Blas M. Benito [aut, cre, cph]
|
Maintainer: | Blas M. Benito <[email protected]> |
License: | GPL-3 |
Version: | 1.1.4 |
Built: | 2025-03-06 03:03:27 UTC |
Source: | https://github.com/blasbenito/spatialrf |
Computes the area under the ROC curve in models with binary responses.
auc(o, p)
auc(o, p)
o |
Numeric vector with observations, must have the same length as |
p |
Numeric vector with predictions, must have the same length as |
Numeric, AUC value.
if(interactive()){ out <- auc( o = c(0, 0, 1, 1), p = c(0.1, 0.6, 0.4, 0.8) ) }
if(interactive()){ out <- auc( o = c(0, 0, 1, 1), p = c(0.1, 0.6, 0.4, 0.8) ) }
Computes the correlation matrix among a set of predictors, orders the correlation matrix according to a user-defined preference order, and removes variables one by one, taking into account the preference order, until the remaining ones are below a given Pearson correlation threshold. Warning: variables in preference.order
not in colnames(x)
, and non-numeric columns are removed silently from x
and preference.order
. The same happens with rows having NA values (na.omit()
is applied). The function issues a warning if zero-variance columns are found.
auto_cor( x = NULL, preference.order = NULL, cor.threshold = 0.5, verbose = TRUE )
auto_cor( x = NULL, preference.order = NULL, cor.threshold = 0.5, verbose = TRUE )
x |
A data frame with predictors, or the result of |
preference.order |
Character vector indicating the user's order of preference to keep variables. Doesn't need to contain If not provided, variables in |
cor.threshold |
Numeric between 0 and 1, with recommended values between 0.5 and 0.9. Maximum Pearson correlation between any pair of the selected variables. Default: |
verbose |
Logical. if |
Can be chained together with auto_vif()
through pipes, see the examples below.
List with three slots:
cor
: correlation matrix of the selected variables.
selected.variables
: character vector with the names of the selected variables.
selected.variables.df
: data frame with the selected variables.
if(interactive()){ #load data data(plant_richness_df) #on a data frame out <- auto_cor(x = plant_richness_df[, 5:21]) #getting the correlation matrix out$cor #getting the names of the selected variables out$selected.variables #getting the data frame of selected variables out$selected.variables.df #on the result of auto_vif out <- auto_vif(x = plant_richness_df[, 5:21]) out <- auto_cor(x = out) #with pipes out <- plant_richness_df[, 5:21] %>% auto_vif() %>% auto_cor() }
if(interactive()){ #load data data(plant_richness_df) #on a data frame out <- auto_cor(x = plant_richness_df[, 5:21]) #getting the correlation matrix out$cor #getting the names of the selected variables out$selected.variables #getting the data frame of selected variables out$selected.variables.df #on the result of auto_vif out <- auto_vif(x = plant_richness_df[, 5:21]) out <- auto_cor(x = out) #with pipes out <- plant_richness_df[, 5:21] %>% auto_vif() %>% auto_cor() }
Selects predictors that are not linear combinations of other predictors by using computing their variance inflation factors (VIF). Allows the user to define an order of preference for the selection of predictors. Warning: variables in preference.order
not in colnames(x)
, and non-numeric columns are removed silently from x
and preference.order
. The same happens with rows having NA values (na.omit()
is applied). The function issues a warning if zero-variance columns are found.
auto_vif( x = NULL, preference.order = NULL, vif.threshold = 5, verbose = TRUE )
auto_vif( x = NULL, preference.order = NULL, vif.threshold = 5, verbose = TRUE )
x |
A data frame with predictors or the result of |
preference.order |
a character vector with columns names of x ordered by the user preference, Default: |
vif.threshold |
Numeric between 2.5 and 10 defining the selection threshold for the VIF analysis. Higher numbers result in a more relaxed variable selection. Default: 5. |
verbose |
Logical. if |
This function has two modes of operation:
1. When the argument preference.order
is NULL
, the function removes on each iteration the variable with the highest VIF until all VIF values are lower than vif.threshold
.
2. When preference.order
is provided, the variables are selected by giving them priority according to their order in preference.order
. If there are variables not in preference.order
, these are selected as in option 1. Once both groups of variables have been processed, all variables are put together and selected by giving priority to the ones in preference.order
. This method preserves the variables desired by the user as much as possible.
Can be chained together with auto_cor()
through pipes, see the examples below.
List with three slots:
vif
: data frame with the names of the selected variables and their respective VIF scores.
selected.variables
: character vector with the names of the selected variables.
selected.variables.df
: data frame with the selected variables.
if(interactive()){ #loading data data(plant_richness_df) #on a data frame out <- auto_vif(x = plant_richness_df[, 5:21]) #getting out the vif data frame out$vif #getting the names of the selected variables out$selected.variables #getting the data frame of selected variables out$selected.variables.df #on the result of auto_cor out <- auto_cor(x = plant_richness_df[, 5:21]) out <- auto_vif(x = out) #with pipes out <- plant_richness_df[, 5:21] %>% auto_cor() %>% auto_vif() }
if(interactive()){ #loading data data(plant_richness_df) #on a data frame out <- auto_vif(x = plant_richness_df[, 5:21]) #getting out the vif data frame out$vif #getting the names of the selected variables out$selected.variables #getting the data frame of selected variables out$selected.variables.df #on the result of auto_cor out <- auto_cor(x = plant_richness_df[, 5:21]) out <- auto_vif(x = out) #with pipes out <- plant_richness_df[, 5:21] %>% auto_cor() %>% auto_vif() }
Defines a Beowulf cluster from the IPs of the machines in the cluster, the number of cores of each machine, and the user name. The returned cluster has to be registered with doParallel::registerDoParallel()
.
beowulf_cluster( cluster.ips = NULL, cluster.cores = NULL, cluster.user = Sys.info()[["user"]], cluster.port = "11000", outfile = NULL )
beowulf_cluster( cluster.ips = NULL, cluster.cores = NULL, cluster.user = Sys.info()[["user"]], cluster.port = "11000", outfile = NULL )
cluster.ips |
Character vector with the IPs of the machines in the cluster. The first machine will be considered the main node of the cluster, and will generally be the machine on which the R code is being executed. Default: |
cluster.cores |
Numeric integer vector, number of cores on each machine. Default: |
cluster.user |
Character string, name of the user (should be the same throughout machines), Defaults to the current system user. |
cluster.port |
Character, port used by the machines in the cluster to communicate. The firewall in all computers must allow traffic from and to such port. Default: |
outfile |
Where to direct the messages provided by the workers. When working on a local computer, |
A list ready to be used as input for the spec
argument of the function makeCluster.
if(interactive()){ beowulf.cluster <- beowulf_cluster( cluster.ips = c( "10.42.0.1", "10.42.0.34", "10.42.0.104" ), cluster.cores = c(7, 4, 4), cluster.user = "blas", cluster.port = "11000" ) doParallel::registerDoParallel(cl = beowulf.cluster) #PARALLELIZED foreach LOOP HERE parallel::stopCluster(cl = beowulf.cluster) }
if(interactive()){ beowulf.cluster <- beowulf_cluster( cluster.ips = c( "10.42.0.1", "10.42.0.34", "10.42.0.104" ), cluster.cores = c(7, 4, 4), cluster.user = "blas", cluster.port = "11000" ) doParallel::registerDoParallel(cl = beowulf.cluster) #PARALLELIZED foreach LOOP HERE parallel::stopCluster(cl = beowulf.cluster) }
When the data is binary, setting the ranager
argument case.weights
helps to minimize the issues produced by class imbalance. This function takes a binary response variable and returns a vector of weights populated with the values 1/#zeros
and 1/#ones
. It is used internally by the function rf()
.
case_weights(data = NULL, dependent.variable.name = NULL)
case_weights(data = NULL, dependent.variable.name = NULL)
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
A vector with a length equal to nrow(data)
with the respective weights of the cases.
if(interactive()){ data <- data.frame( response = c(0, 0, 0, 1, 1) ) case_weights( data = data, dependent.variable.name = "response" ) }
if(interactive()){ data <- data.frame( response = c(0, 0, 0, 1, 1) ) case_weights( data = data, dependent.variable.name = "response" ) }
Generates four distance thresholds, from 0 to max(distance.matrix)/2.
default_distance_thresholds(distance.matrix = NULL)
default_distance_thresholds(distance.matrix = NULL)
distance.matrix |
Distance matrix. Default: |
A numeric vector with distance thresholds.
if(interactive()){ #loading example distance matrix data(distance_matrix) #computing set of default distance thresholds default_distance_thresholds(distance_matrix) }
if(interactive()){ #loading example distance matrix data(distance_matrix) #computing set of default distance thresholds default_distance_thresholds(distance_matrix) }
Distance matrix (in km) among the edges of the American ecoregions described in the plant_richness_df dataset.
data(distance_matrix)
data(distance_matrix)
A numeric matrix with 227 rows and columns.
Generates a double-centered matrix (row and column means are zero) from the weights of a distance matrix (see weights_from_distance_matrix()
) and a distance threshold. This is a required step before the computation of Moran's Eigenvector Maps.
double_center_distance_matrix ( distance.matrix = NULL, distance.threshold = 0 )
double_center_distance_matrix ( distance.matrix = NULL, distance.threshold = 0 )
distance.matrix |
Distance matrix. Default: |
distance.threshold |
Numeric, positive, in the range of values of |
A double-centered matrix of the same dimensions as x
.
weights_from_distance_matrix()
, mem()
, mem_multithreshold()
if(interactive()){ #loading the distance matrix data(distance_matrix) x <- double_center_distance_matrix( distance.matrix = distance_matrix ) x }
if(interactive()){ #loading the distance matrix data(distance_matrix) x <- double_center_distance_matrix( distance.matrix = distance_matrix ) x }
Removes spatial predictors that are pair-wise correlated with other spatial predictors (which happens when there are several close distance thresholds), and spatial predictors correlated with non-spatial predictors.
filter_spatial_predictors( data = NULL, predictor.variable.names = NULL, spatial.predictors.df = NULL, cor.threshold = 0.5 )
filter_spatial_predictors( data = NULL, predictor.variable.names = NULL, spatial.predictors.df = NULL, cor.threshold = 0.5 )
data |
Data frame with a response variable and a set of predictors. Default: |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
spatial.predictors.df |
Data frame of spatial predictors. |
cor.threshold |
Numeric between 0 and 1, maximum Pearson correlation between any pair of the selected variables. Default: |
A data frame with non-redundant spatial predictors.
if(interactive()){ #loading data data("distance_matrix") data("plant_richness_df") #computing Moran's Eigenvector Maps spatial.predictors.df <- mem_multithreshold( distance_matrix = distance_matrix, distance.thresholds = c(0, 1000) ) #filtering spatial predictors spatial.predictors.df <- filter_spatial_predictors( data = plant_richness_df, predictor.variable.names = colnames(plant_richness_df)[5:21], spatial.predictors.df = spatial.predictors.df, cor.threshold = 0.50 ) }
if(interactive()){ #loading data data("distance_matrix") data("plant_richness_df") #computing Moran's Eigenvector Maps spatial.predictors.df <- mem_multithreshold( distance_matrix = distance_matrix, distance.thresholds = c(0, 1000) ) #filtering spatial predictors spatial.predictors.df <- filter_spatial_predictors( data = plant_richness_df, predictor.variable.names = colnames(plant_richness_df)[5:21], spatial.predictors.df = spatial.predictors.df, cor.threshold = 0.50 ) }
Returns performance metrics produced by rf_evaluate()
.
get_evaluation(model)
get_evaluation(model)
model |
A model fitted with |
A data frame with evaluation scores. The following columns are shown:
model
: Identifies the given model. The values are "Full", (original model introduced into rf_evaluate()
), "Training" (model trained on an independent training spatial fold), and "Testing" (predictive performance of the training model on an independent testing spatial fold). The performance values of the "Testing" model represent the model performance on unseen data, and hence its ability to generalize.
metric
: Four values representing different evaluation metrics, "rmse", "nrmse", "r.squared", and "pseudo.r.squared".
mean
, sd
, min
, and max
: Average, standard deviation, minimum, and maximum of each metric across the evaluation (cross-validation) iterations.
rf_evaluate()
, plot_evaluation()
, print_evaluation()
if(interactive()){ #loading data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluating the model with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1, verbose = FALSE ) #getting evaluation results from the model x <- get_evaluation(rf.model) x }
if(interactive()){ #loading data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluating the model with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1, verbose = FALSE ) #getting evaluation results from the model x <- get_evaluation(rf.model) x }
Gets variable importance scores from rf()
, rf_repeat()
, and rf_spatial()
models.
get_importance(model)
get_importance(model)
model |
A model fitted with |
A data frame with variable names and importance scores.
rf()
, rf_repeat()
, rf_spatial()
, plot_importance()
, print_importance()
.
if(interactive()){ data(plant_richness_df) data(distance_matrix) rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) x <- get_importance(rf.model) x }
if(interactive()){ data(plant_richness_df) data(distance_matrix) rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) x <- get_importance(rf.model) x }
Gets local importance scores from rf()
, rf_repeat()
, and rf_spatial()
models.
get_importance_local(model)
get_importance_local(model)
model |
A model fitted with |
A data frame with variable names and local importance scores.
rf()
, rf_repeat()
, rf_spatial()
, plot_importance()
, print_importance()
.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fittinga random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #getting importance scores x <- get_importance_local(rf.model) x }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fittinga random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #getting importance scores x <- get_importance_local(rf.model) x }
Returns the Moran's I test on the residuals of a model produced by rf()
, rf_repeat()
, or rf_spatial()
.
get_moran(model)
get_moran(model)
model |
A model fitted with |
A data frame with Moran's I test results produced by moran_multithreshold()
.
moran()
, moran_multithreshold()
, plot_moran()
, print_moran()
.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #getting Moran's I of the residuals x <- get_moran(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #getting Moran's I of the residuals x <- get_moran(rf.model) }
Returns the performance slot of an rf()
, rf_repeat()
, or rf_spatial()
model computed on the out-of-bag data.
get_performance(model)
get_performance(model)
model |
Model fitted with |
A data frame with four columns:
metric
Name of the performance metric.
median
Value of the performance metric. Truly a median only if the model is fitted with rf_repeat()
.
median_absolute_deviation
median absolute deviation (MAD), only if the model is fitted with rf_repeat()
, and NA
otherwise.
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #getting model performance x <- get_performance(rf.model) x }
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #getting model performance x <- get_performance(rf.model) x }
Returns model predictions from a model fitted with rf()
, rf_repeat()
, or rf_spatial()
.
get_predictions(model)
get_predictions(model)
model |
A model produced by |
A vector with predictions, or median of the predictions across repetitions if the model was fitted with rf_repeat()
.
if(interactive()){ #loading example data data(plant_richness_df) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #get vector of predictions x <- get_predictions(rf.model) x }
if(interactive()){ #loading example data data(plant_richness_df) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #get vector of predictions x <- get_predictions(rf.model) x }
Returns the residuals of models fitted with rf()
, rf_repeat()
, or rf_spatial()
.
get_residuals(model)
get_residuals(model)
model |
A model fitted with |
A vector with model residuals, or the median of model residuals across repetitions if the model was fitted with rf_repeat()
.
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #getting vector with residuals x <- get_residuals(rf.model) x }
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #getting vector with residuals x <- get_residuals(rf.model) x }
Generates and returns the data required to plot the response curves of a model fitted with rf()
, rf_repeat()
, or rf_spatial()
.
get_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, verbose = TRUE )
get_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, verbose = TRUE )
model |
A model fitted with |
variables |
Character vector, names of predictors to plot. If |
quantiles |
Numeric vector with values between 0 and 1, argument |
grid.resolution |
Integer between 20 and 500. Resolution of the plotted curve Default: |
verbose |
Logical, if TRUE the plot is printed. Default: |
All variables that are not plotted in a particular response curve are set to the values of their respective quantiles, and the response curve for each one of these quantiles is shown in the plot.
A data frame with the following columns:
response
: Predicted values of the response, obtained with stats::predict()
.
predictor
: Values of the given predictor.
quantile
: Grouping column, values of the quantiles at which the other predictors are set to generate the response curve.
model
: Model number, only relevant if the model was produced with rf_repeat()
.
predictor.name
: Grouping variable, name of the predictor.
response.name
: Grouping variable, name of the response variable.
if(interactive()){ #loading example data data(plant_richness_df) #fitting random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #getting data frame with response curves p <- get_response_curves(out) head(p) }
if(interactive()){ #loading example data data(plant_richness_df) #fitting random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #getting data frame with response curves p <- get_response_curves(out) head(p) }
Returns spatial predictors from a model fitted with rf_spatial()
in order of importance.
get_spatial_predictors(model)
get_spatial_predictors(model)
model |
A model fitted with |
A data frame with the spatial predictors included in the model.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fittind spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000), n.cores = 1, method = "mem.moran.sequential" ) #getting data frame with the selected spatial predictors spatial.predictors <- get_spatial_predictors(model) head(spatial.predictors) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fittind spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000), n.cores = 1, method = "mem.moran.sequential" ) #getting data frame with the selected spatial predictors spatial.predictors <- get_spatial_predictors(model) head(spatial.predictors) }
Returns TRUE
if dependent.variable.name
is a binary variable with the values 1 and 0.
is_binary(data = NULL, dependent.variable.name = NULL)
is_binary(data = NULL, dependent.variable.name = NULL)
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
Logical.
if(interactive()){ #dummy data frame data <- data.frame( response = c(0, 0, 0, 1, 1) ) #checking if response is binary is_binary( data = data, dependent.variable.name = "response" ) }
if(interactive()){ #dummy data frame data <- data.frame( response = c(0, 0, 0, 1, 1) ) #checking if response is binary is_binary( data = data, dependent.variable.name = "response" ) }
Used internally by make_spatial_folds()
and rf_evaluate()
. Uses the coordinates of a point xy.i
to generate two spatially independent data folds from the data frame xy
. It does so by growing a rectangular buffer from xy.i
until a number of records defined by training.fraction
is inside the buffer. The indices of these records are then stored as "training" in the output list. The indices of the remaining records outside of the buffer are stored as "testing". These training and testing records can be then used to evaluate a model on independent data via cross-validation.
make_spatial_fold( data = NULL, dependent.variable.name = NULL, xy.i = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.8 )
make_spatial_fold( data = NULL, dependent.variable.name = NULL, xy.i = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.8 )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
xy.i |
One row data frame with at least three columns: "x" (longitude), "y" (latitude), and "id" (integer, id of the record). Can be a row of |
xy |
A data frame with at least three columns: "x" (longitude), "y" (latitude), and "id" (integer, index of the record). Default: |
distance.step.x |
Numeric, distance step used during the growth in the x axis of the buffers defining the training folds. Default: |
distance.step.y |
Numeric, distance step used during the growth in the y axis of the buffers defining the training folds. Default: |
training.fraction |
Numeric, fraction of the data to be included in the training fold, Default: |
A list with two slots named training
and testing
with the former having the indices of the training records selected from xy
, and the latter having the indices of the testing records.
make_spatial_folds()
, rf_evaluate()
if(interactive()){ #loading example data data(plant_richness_df) #getting case coordinates xy <- plant_richness_df[, 1:3] colnames(xy) <- c("id", "x", "y") #building a spatial fold centered in the first pair of coordinates out <- make_spatial_fold( xy.i = xy[1, ], xy = xy, training.fraction = 0.6 ) #indices of the training and testing folds out$training out$testing #plotting the data plot(xy[ c("x", "y")], type = "n", xlab = "", ylab = "") #plots training points points(xy[out$training, c("x", "y")], col = "red4", pch = 15) #plots testing points points(xy[out$testing, c("x", "y")], col = "blue4", pch = 15) #plots xy.i points(xy[1, c("x", "y")], col = "black", pch = 15, cex = 2) }
if(interactive()){ #loading example data data(plant_richness_df) #getting case coordinates xy <- plant_richness_df[, 1:3] colnames(xy) <- c("id", "x", "y") #building a spatial fold centered in the first pair of coordinates out <- make_spatial_fold( xy.i = xy[1, ], xy = xy, training.fraction = 0.6 ) #indices of the training and testing folds out$training out$testing #plotting the data plot(xy[ c("x", "y")], type = "n", xlab = "", ylab = "") #plots training points points(xy[out$training, c("x", "y")], col = "red4", pch = 15) #plots testing points points(xy[out$testing, c("x", "y")], col = "blue4", pch = 15) #plots xy.i points(xy[1, c("x", "y")], col = "black", pch = 15, cex = 2) }
Applies make_spatial_fold()
to every record in a data frame xy.selected
to generate as many spatially independent folds over the dataset xy
as rows are in xy.selected
.
make_spatial_folds( data = NULL, dependent.variable.name = NULL, xy.selected = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.75, n.cores = parallel::detectCores() - 1, cluster = NULL )
make_spatial_folds( data = NULL, dependent.variable.name = NULL, xy.selected = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.75, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
xy.selected |
Data frame with at least three columns: "x" (longitude), "y" (latitude), and "id" (integer, id of the record). Usually a subset of |
xy |
data frame with at least three columns: "x" (longitude), "y" (latitude), and "id" (integer, index of the record). Default: |
distance.step.x |
Numeric, distance step used during the growth in the x axis of the buffers defining the training folds. Default: |
distance.step.y |
Numeric, distance step used during the growth in the y axis of the buffers defining the training folds. Default: |
training.fraction |
numeric, fraction of the data to be included in the growing buffer as training data, Default: |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with as many slots as rows are in xy.selected
. Each slot has two slots named training
and testing
, with the former having the indices of the training records selected from xy, and the latter having the indices of the testing records.
make_spatial_fold()
, rf_evaluate()
if(interactive()){ #loading example data data(plant_richness_df) #getting case coordinates xy <- plant_richness_df[, 1:3] colnames(xy) <- c("id", "x", "y") #thining til 20 cases xy.selected <- thinning_til_n( xy = xy, n = 20 ) #making spatial folds centered on these 20 cases out <- make_spatial_folds( xy.selected = xy.selected, xy = xy, distance.step = 0.05, #degrees training.fraction = 0.6, n.cores = 1 ) #plotting training and testing folds plot(xy[ c("x", "y")], type = "n", xlab = "", ylab = "") #plots training points points(xy[out[[10]]$training, c("x", "y")], col = "red4", pch = 15) #plots testing points points(xy[out[[10]]$testing, c("x", "y")], col = "blue4", pch = 15) #plots xy.i points(xy[10, c("x", "y")], col = "black", pch = 15, cex = 2) }
if(interactive()){ #loading example data data(plant_richness_df) #getting case coordinates xy <- plant_richness_df[, 1:3] colnames(xy) <- c("id", "x", "y") #thining til 20 cases xy.selected <- thinning_til_n( xy = xy, n = 20 ) #making spatial folds centered on these 20 cases out <- make_spatial_folds( xy.selected = xy.selected, xy = xy, distance.step = 0.05, #degrees training.fraction = 0.6, n.cores = 1 ) #plotting training and testing folds plot(xy[ c("x", "y")], type = "n", xlab = "", ylab = "") #plots training points points(xy[out[[10]]$training, c("x", "y")], col = "red4", pch = 15) #plots testing points points(xy[out[[10]]$testing, c("x", "y")], col = "blue4", pch = 15) #plots xy.i points(xy[10, c("x", "y")], col = "black", pch = 15, cex = 2) }
Computes the positive Moran's Eigenvector Maps of a distance matrix.
mem( distance.matrix = NULL, distance.threshold = 0, colnames.prefix = "mem" )
mem( distance.matrix = NULL, distance.threshold = 0, colnames.prefix = "mem" )
distance.matrix |
Distance matrix. Default: |
distance.threshold |
Numeric vector with distance thresholds defining different neighborhood extents within the distance matrix, Default: 0 |
colnames.prefix |
Character, name prefix for the output columns. Default: |
Takes the distance matrix x
, double-centers it with double_center_distance_matrix()
, applies eigen, and returns eigenvectors with positive normalized eigenvalues (a.k.a Moran's Eigenvector Maps, or MEMs). These MEMs are later used as spatial predictors by rf_spatial()
.
A data frame with positive Moran's Eigenvector Maps.
mem_multithreshold()
, rf_spatial()
if(interactive()){ #loading example distance matrix data(distance_matrix) #Moran's Eigenvector Maps of the distance matrix mem <- mem(x = distance_matrix) }
if(interactive()){ #loading example distance matrix data(distance_matrix) #Moran's Eigenvector Maps of the distance matrix mem <- mem(x = distance_matrix) }
Computes Moran's Eigenvector Maps of a distance matrix (using mem()
) over different distance thresholds.
mem_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
mem_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
distance.matrix |
Distance matrix. Default: |
distance.thresholds |
Numeric vector with distance thresholds defining neighborhood in the distance matrix, Default: |
max.spatial.predictors |
Maximum number of spatial predictors to generate. Only useful to save memory when the distance matrix |
The function takes the distance matrix x
, computes its weights at difference distance thresholds, double-centers the resulting weight matrices with double_center_distance_matrix()
, applies eigen to each double-centered matrix, and returns eigenvectors with positive normalized eigenvalues for different distance thresholds.
A data frame with as many rows as the distance matrix x
containing positive Moran's Eigenvector Maps. The data frame columns are named "spatial_predictor_DISTANCE_COLUMN", where DISTANCE is the given distance threshold, and COLUMN is the column index of the given spatial predictor.
if(interactive()){ #loading example data data(distance_matrix) #computing Moran's eigenvector maps for 0, 1000, and 2000 km mem.df <- mem_multithreshold( distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000) ) head(mem.df) }
if(interactive()){ #loading example data data(distance_matrix) #computing Moran's eigenvector maps for 0, 1000, and 2000 km mem.df <- mem_multithreshold( distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000) ) head(mem.df) }
Computes the spatial correlation coefficient (Moran's I) of a vector given a distance matrix, and a distance threshold used to define "neighborhood".
moran( x = NULL, distance.matrix = NULL, distance.threshold = NULL, verbose = TRUE )
moran( x = NULL, distance.matrix = NULL, distance.threshold = NULL, verbose = TRUE )
x |
Numeric vector, generally model residuals, Default: |
distance.matrix |
Distance matrix among cases in |
distance.threshold |
numeric value in the range of values available in |
verbose |
Logical, if |
Inspired in the Moran.I()
function of the ape package.
A list with three named slots:
test
: Data frame with observed and expected Moran's I values, p-value, and interpretation.
plot
: Moran's plot of the vector x against the spatial lags of x.
plot.df
: Data used in the Moran's plot.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness) #Moran's I of the response variable out <- moran( x = plant_richness$richness_species_vascular, distance.matrix = distance_matrix ) out }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness) #Moran's I of the response variable out <- moran( x = plant_richness$richness_species_vascular, distance.matrix = distance_matrix ) out }
Applies moran()
to different distance thresholds at the same time.
moran_multithreshold( x = NULL, distance.matrix = NULL, distance.thresholds = NULL, verbose = TRUE )
moran_multithreshold( x = NULL, distance.matrix = NULL, distance.thresholds = NULL, verbose = TRUE )
x |
Numeric vector, generally model residuals, Default: |
distance.matrix |
Distance matrix among cases in |
distance.thresholds |
Numeric vector, distances below each value are set to 0 on separated copies of the distance matrix for the computation of Moran's I at different neighborhood distances. If |
verbose |
Logical, if |
Using different distance thresholds helps to take into account the uncertainty about what "neighborhood" means in ecological systems (1000km in geological time means little, but 100m might be quite a long distance for a tree to disperse seeds over), and allows to explore spatial autocorrelation of model residuals for several minimum-distance criteria at once.
A named list with the slots:
df
: Data frame with the results of moran per distance threshold.
plot
: A plot of Moran's I across distance thresholds.
max.moran
: Maximum value of Moran's I across thresholds.
max.moran.distance.threshold
: Distance threshold with the maximum Moran's I value.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness) #computing Moran's I for the response variable at several reference distances out <- moran_multithreshold( x = plant_richness$richness_species_vascular, distance.matrix = distance_matrix, distance.thresholds = c(0, 100, 1000, 10000), plot = TRUE ) out }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness) #computing Moran's I for the response variable at several reference distances out <- moran_multithreshold( x = plant_richness$richness_species_vascular, distance.matrix = distance_matrix, distance.thresholds = c(0, 100, 1000, 10000), plot = TRUE ) out }
Shows the size of the objects currently in the R environment. Helps to locate large objects cluttering the R environment and/or causing memory problems during the execution of large workflows.
objects_size(n = 10)
objects_size(n = 10)
n |
Number of objects to show, Default: |
A data frame with the row names indicating the object name, the field 'Type' indicating the object type, 'Size' indicating the object size, and the columns 'Length/Rows' and 'Columns' indicating the object dimensions if applicable.
if(interactive()){ #creating dummy objects x <- matrix(runif(100), 10, 10) y <- matrix(runif(10000), 100, 100) #reading their in-memory size objects_size() }
if(interactive()){ #creating dummy objects x <- matrix(runif(100), 10, 10) y <- matrix(runif(10000), 100, 100) #reading their in-memory size objects_size() }
Optimizes the selection of spatial predictors using two different methods: "moran.i", and "p.value".
optimization_function( x = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, optimization.method = "moran.i" )
optimization_function( x = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, optimization.method = "moran.i" )
x |
Optimization data frame generated internally by |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the optimization process. Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization on the number of added spatial predictors. Default: |
optimization.method |
Character, one of "moran.i", and "p.value". Default: |
The method "moran.i" tries to maximize 1 - Moran's
I while taking into account the R-squared of the model and a penalization on the number of introduced spatial predictors through the expression
(1 - Moran's I) + w1 * r.squared - w2 * penalization
The method "p.value" uses a binary version of the p-values of Moran's I (1 if >= 0.05, 0 otherwise), and uses the expression
max(1 - Moran's I, binary p-value) + w1 * r.squared - w2 * penalization
The "moran.i" method generally selects more spatial predictors than the "p.value" method.
A numeric vector with the optimization criteria.
select_spatial_predictors_recursive()
, select_spatial_predictors_sequential()
Extracts all factors of a principal component analysis of a matrix or data frame. Just a convenient wrapper for prcomp.
pca( x = NULL, colnames.prefix = "pca_factor" )
pca( x = NULL, colnames.prefix = "pca_factor" )
x |
numeric matrix or data frame, Default: NULL |
colnames.prefix |
character, name prefix for the output columns, Default: 'pca_factor' |
Columns in x
with zero variance are removed before computing the PCA.
A data frame with the PCA factors of x
.
if(interactive()){ #load example distance matrix data(distance_matrix) #PCA of the distance matrix out <- pca(x = distance_matrix) out }
if(interactive()){ #load example distance matrix data(distance_matrix) #PCA of the distance matrix out <- pca(x = distance_matrix) out }
Computes PCA factors of a distance matrix over different distance thresholds to generate spatial predictors for a model fitted with rf_spatial()
.
pca_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
pca_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
distance.matrix |
Distance matrix. Default: |
distance.thresholds |
Numeric vector with distance thresholds defining neighborhood in the distance matrix, Default: |
max.spatial.predictors |
Integer, maximum number of spatial predictors to generate. Only useful when the distance matrix |
The distance matrix is converted into weights with weights_from_distance_matrix()
before computing the PCA. This produces more meaningful spatial predictors than using the distance matrix as is.
A data frame with the PCA factors of the thresholded matrix. The data frame columns are named "spatial_predictor_DISTANCE_COLUMN", where DISTANCE is the given distance threshold, and COLUMN is the column index of the given predictor.
if(interactive()){ #loading example distance matrix load(distance_matrix) #PCA factors of the distance matrix for two reference distances x <- pca_multithreshold( distance.matrix = distance_matrix, distance.thresholds = c(0, 1000) ) head(x) }
if(interactive()){ #loading example distance matrix load(distance_matrix) #PCA factors of the distance matrix for two reference distances x <- pca_multithreshold( distance.matrix = distance_matrix, distance.thresholds = c(0, 1000) ) head(x) }
Richness of vascular plants of the American ecoregions as defined in Ecoregions 2017.
data(plant_richness_df)
data(plant_richness_df)
A data frame with 227 rows and 22 columns:
ecoregion_id
: Id of the ecoregion).
x
: Longitude in degrees (WGS84).
y
: Latitude in degrees (WGS84).
richness_species_vascular
: Number of vascular species found in the ecoregion. Response variable.
bias_area_km2
: Area of the ecoregion in squared kilometers.
bias_species_per_record
: Number of species divided by the number of spatial GBIF records available in the ecoregion as a measure of sampling bias.
climate_aridity_index_average
: Average of the ecoregion.
climate_hypervolume
: Volume of the climatic envelope of the ecoregion, computed with the hypervolume package.
climate_velocity_lgm_average
: Average climate velocity of the ecoregion since the Last Glacial Maximum.
neighbors_count
: Number of immediate neighbors of the ecoregion as a measure of connectivity/isolation.
neighbors_percent_shared_edge
: Percentage of shared edge with the neighbors as a measure of connectivity/isolation.
human_population_density
: Population density of the ecoregion.
topography_elevation_average
: Average elevation of the ecoregion.
landcover_herbs_percent_average
: Average cover percentage of herbs extracted from MODIS Vegetation Continuous Fields.
fragmentation_cohesion
: Geographic fragmentation index of the ecoregion as computed with the R package landscapemetrics.
fragmentation_division
: Another fragmentation index.
neighbors_area
: Total area of the ecoregions's immediate neighbors.
human_population
: Human population in the ecoregion.
human_footprint_average
: Average human footprint in the ecoregion.
climate_bio1_average
: Average mean annual temperature.
climate_bio15_minimum
: Average precipitation seasonality.
Plots the results of an spatial cross-validation performed with rf_evaluate()
.
plot_evaluation( model, fill.color = viridis::viridis( 3, option = "F", alpha = 0.8, direction = -1 ), line.color = "gray30", verbose = TRUE, notch = TRUE )
plot_evaluation( model, fill.color = viridis::viridis( 3, option = "F", alpha = 0.8, direction = -1 ), line.color = "gray30", verbose = TRUE, notch = TRUE )
model |
A model resulting from |
fill.color |
Character vector with three hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
verbose |
Logical, if |
notch |
Logical, if |
A ggplot.
rf_evaluate()
, get_evaluation()
, print_evaluation()
.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluating the model with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #plotting the evaluation results plot_evaluation(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluating the model with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #plotting the evaluation results plot_evaluation(rf.model) }
Plots variable importance scores of rf()
, rf_repeat()
, and rf_spatial()
models. Distributions of importance scores produced with rf_repeat()
are plotted using ggplot2::geom_violin
, which shows the median of the density estimate rather than the actual median of the data. However, the violin plots are ordered from top to bottom by the real median of the data to make small differences in median importance easier to spot. Ths function does not plot the result of rf_importance()
yet, but you can find it under model$importance$cv.per.variable.plot
.
plot_importance( model, fill.color = viridis::viridis( 100, option = "F", direction = -1, alpha = 1, end = 0.9 ), line.color = "white", verbose = TRUE )
plot_importance( model, fill.color = viridis::viridis( 100, option = "F", direction = -1, alpha = 1, end = 0.9 ), line.color = "white", verbose = TRUE )
model |
A model fitted with |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
verbose |
Logical, if |
A ggplot.
print_importance()
, get_importance()
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #plotting variable importance scores plot_importance(model = rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #plotting variable importance scores plot_importance(model = rf.model) }
Plots the results of spatial autocorrelation tests for a variety of functions within the package. The x axis represents the Moran's I estimate, the y axis contains the values of the distance thresholds, the dot sizes represent the p-values of the Moran's I estimate, and the red dashed line represents the theoretical null value of the Moran's I estimate.
plot_moran( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), line.color = "gray30", option = 1, ncol = 1, verbose = TRUE )
plot_moran( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), line.color = "gray30", option = 1, ncol = 1, verbose = TRUE )
model |
A model fitted with |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
option |
Integer, type of plot. If |
ncol |
Number of columns of the plot. Only relevant when |
verbose |
Logical, if |
A ggplot.
moran()
, moran_multithreshold()
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #Incremental/multiscale Moran's I plot_moran(rf.model) #Moran's scatterplot plot_moran(rf.model, option = 2) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #Incremental/multiscale Moran's I plot_moran(rf.model) #Moran's scatterplot plot_moran(rf.model, option = 2) }
Plots optimization data frames produced by select_spatial_predictors_sequential()
and select_spatial_predictors_recursive()
.
plot_optimization( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), verbose = TRUE )
plot_optimization( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), verbose = TRUE )
model |
A model produced by |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
verbose |
Logical, if |
If the method used to fit a model with rf_spatial()
is "hengl", the function returns nothing, as this method does not require optimization.
A ggplot.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #names of the response and predictors dependent.variable.name <- "richness_species_vascular" predictor.variable.names <- colnames(plant_richness_df)[5:21] #spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "mem.moran.sequential", n.cores = 1, seed = 1 ) #plotting selection of spatial predictors plot_optimization(model = model) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #names of the response and predictors dependent.variable.name <- "richness_species_vascular" predictor.variable.names <- colnames(plant_richness_df)[5:21] #spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "mem.moran.sequential", n.cores = 1, seed = 1 ) #plotting selection of spatial predictors plot_optimization(model = model) }
Plots normality and autocorrelation tests of model residuals.
plot_residuals_diagnostics( model, point.color = viridis::viridis(100, option = "F"), line.color = "gray10", fill.color = viridis::viridis(4, option = "F", alpha = 0.95)[2], option = 1, ncol = 1, verbose = TRUE )
plot_residuals_diagnostics( model, point.color = viridis::viridis(100, option = "F"), line.color = "gray10", fill.color = viridis::viridis(4, option = "F", alpha = 0.95)[2], option = 1, ncol = 1, verbose = TRUE )
model |
A model produced by |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
fill.color |
Character string, fill color of the bars produced by |
option |
(argument of |
ncol |
(argument of |
verbose |
Logical, if |
A patchwork object.
if(interactive()){ #load example data data(plant_richness_df) data(distance_matrix) #fit a random forest model x <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, n.cores = 1 ) #residuals diagnostics plot_residuals_diagnostics(x) }
if(interactive()){ #load example data data(plant_richness_df) data(distance_matrix) #fit a random forest model x <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, n.cores = 1 ) #residuals diagnostics plot_residuals_diagnostics(x) }
Plots the response curves of models fitted with rf()
, rf_repeat()
, or rf_spatial()
.
plot_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, line.color = viridis::viridis(length(quantiles), option = "F", end = 0.9), ncol = 2, show.data = FALSE, verbose = TRUE )
plot_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, line.color = viridis::viridis(length(quantiles), option = "F", end = 0.9), ncol = 2, show.data = FALSE, verbose = TRUE )
model |
A model fitted with |
variables |
Character vector, names of predictors to plot. If |
quantiles |
Numeric vector with values between 0 and 1, argument |
grid.resolution |
Integer between 20 and 500. Resolution of the plotted curve Default: |
line.color |
Character vector with colors, or function to generate colors for the lines representing |
ncol |
Integer, argument of wrap_plots. Defaults to the rounded squared root of the number of plots. Default: |
show.data |
Logical, if |
verbose |
Logical, if TRUE the plot is printed. Default: |
All variables that are not plotted in a particular response curve are set to the values of their respective quantiles, and the response curve for each one of these quantiles is shown in the plot. When the input model was fitted with rf_repeat()
with keep.models = TRUE
, then the plot shows the median of all model runs, and each model run separately as a thinner line. The output list can be plotted all at once with patchwork::wrap_plots(p)
or cowplot::plot_grid(plotlist = p)
, or one by one by extracting each plot from the list.
A list with slots named after the selected variables
, with one ggplot each.
if(interactive()){ #loading example data data(plant_richness_df) #fitting a random forest model m <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #response curves of most important predictors plot_response_curves(model = m) }
if(interactive()){ #loading example data data(plant_richness_df) #fitting a random forest model m <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #response curves of most important predictors plot_response_curves(model = m) }
Plots response surfaces for any given pair of predictors in a rf()
, rf_repeat()
, or rf_spatial()
model.
plot_response_surface( model = NULL, a = NULL, b = NULL, quantiles = 0.5, grid.resolution = 100, point.size.range = c(0.5, 2.5), point.alpha = 1, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.9), point.color = "gray30", verbose = TRUE )
plot_response_surface( model = NULL, a = NULL, b = NULL, quantiles = 0.5, grid.resolution = 100, point.size.range = c(0.5, 2.5), point.alpha = 1, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.9), point.color = "gray30", verbose = TRUE )
model |
A model fitted with |
a |
Character string, name of a model predictor. If |
b |
Character string, name of a model predictor. If |
quantiles |
Numeric vector between 0 and 1. Argument |
grid.resolution |
Integer between 20 and 500. Resolution of the plotted surface Default: |
point.size.range |
Numeric vector of length 2 with the range of point sizes used by geom_point. Using |
point.alpha |
Numeric between 0 and 1, transparency of the points. Setting it to |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
point.color |
Character vector with a color name (e.g. "red4"). Default: |
verbose |
Logical, if TRUE the plot is printed. Default: |
All variables that are not a
or b
in a response curve are set to the values of their respective quantiles to plot the response surfaces. The output list can be plotted all at once with patchwork::wrap_plots(p)
or cowplot::plot_grid(plotlist = p)
, or one by one by extracting each plot from the list.
A list with slots named after the selected quantiles
, each one with a ggplot.
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #plot interactions between most important predictors plot_response_surfaces(x = out) }
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = FALSE ) #plot interactions between most important predictors plot_response_surfaces(x = out) }
Plots the dependent variable against each predictor.
plot_training_df( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, ncol = 4, method = "loess", point.color = viridis::viridis(100, option = "F"), line.color = "gray30" )
plot_training_df( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, ncol = 4, method = "loess", point.color = viridis::viridis(100, option = "F"), line.color = "gray30" )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
ncol |
Number of columns of the plot. Argument |
method |
Method for geom_smooth, one of: "lm", "glm", "gam", "loess", or a function, for example |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
A wrap_plots object.
if(interactive()){ #load example data data(plant_richness_df) #scatterplot of the training data plot_training_data( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21] ) }
if(interactive()){ #load example data data(plant_richness_df) #scatterplot of the training data plot_training_data( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21] ) }
Plots the the Moran's I test of the response and the predictors in a training data frame.
plot_training_df_moran( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1), point.color = "gray30" )
plot_training_df_moran( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1), point.color = "gray30" )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector, distances below each value are set to 0 on separated copies of the distance matrix for the computation of Moran's I at different neighborhood distances. If |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
point.color |
Character vector with a color name (e.g. "red4"). Default: |
A ggplot2 object.
if(interactive()){ #load example data data(plant_richness_df) data(distance_matrix) #plot Moran's I of training data plot_moran_training_data( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c( 0, 2000, 4000, 6000, 8000 ) ) }
if(interactive()){ #load example data data(plant_richness_df) data(distance_matrix) #plot Moran's I of training data plot_moran_training_data( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c( 0, 2000, 4000, 6000, 8000 ) ) }
rf_tuning()
Plots the tuning of the hyperparameters num.trees
, mtry
, and min.node.size
performed by rf_tuning()
.
plot_tuning( model, point.color = viridis::viridis( 100, option = "F" ), verbose = TRUE )
plot_tuning( model, point.color = viridis::viridis( 100, option = "F" ), verbose = TRUE )
model |
A model fitted with |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
verbose |
Logical, if |
A ggplot.
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #tune random forest model rf.model <- rf_tuning( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1, verbose = FALSE ) #generate tuning plot plot_tuning(model = rf.model) }
if(interactive()){ #load example data data(plant_richness_df) #fit random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #tune random forest model rf.model <- rf_tuning( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1, verbose = FALSE ) #generate tuning plot plot_tuning(model = rf.model) }
Prepares variable importance data frames and plots for models fitted with rf_spatial()
.
prepare_importance_spatial(model)
prepare_importance_spatial(model)
model |
An importance data frame with spatial predictors, or a model fitted with |
A list with importance data frames in different formats depending on whether the model was fitted with rf()
or rf_repeat()
.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fittind spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing the importance data frame importance <- prepare_importance_spatial(model) names(importance) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fittind spatial model model <- rf_spatial( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing the importance data frame importance <- prepare_importance_spatial(model) names(importance) }
Prints the results of an spatial cross-validation performed with rf_evaluate()
.
print_evaluation(model)
print_evaluation(model)
model |
A model resulting from |
A table printed to the standard output.
plot_evaluation()
, get_evaluation()
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluation with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #checking evaluation results print_evaluation(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluation with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #checking evaluation results print_evaluation(rf.model) }
Prints variable importance scores from rf, rf_repeat, and rf_spatial models.
print_importance( model, verbose = TRUE )
print_importance( model, verbose = TRUE )
model |
A model fitted with rf, rf_repeat, or rf_spatial. |
verbose |
Logical, if |
A table printed to the standard output.
plot_importance()
, get_importance()
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #printing variable importance scores print_importance(model = rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #printing variable importance scores print_importance(model = rf.model) }
Prints the results of a Moran's I test on the residuals of a model.
print_moran( model, caption = NULL, verbose = TRUE )
print_moran( model, caption = NULL, verbose = TRUE )
model |
A model fitted with |
caption |
Character, caption of the output table, Default: |
verbose |
Logical, if |
Prints a table in the console using the huxtable package.
moran()
, moran_multithreshold()
, get_moran()
, plot_moran()
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #printing Moran's I of model's residuals print_moran(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000, 2000), n.cores = 1, verbose = FALSE ) #printing Moran's I of model's residuals print_moran(rf.model) }
Prints the performance slot of a model fitted with rf()
, rf_repeat()
, or rf_spatial()
. For models fitted with rf_repeat()
it shows the median and the median absolute deviation of each performance measure.
print_performance(model)
print_performance(model)
model |
Model fitted with |
Prints model performance scores to the console.
print_performance()
, get_performance()
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #printing performance scores print_performance(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance.matrix) #fitting a random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #printing performance scores print_performance(rf.model) }
Custom print method for models fitted with rf()
, rf_repeat()
, and rf_spatial()
.
## S3 method for class 'rf' print(x, ...)
## S3 method for class 'rf' print(x, ...)
x |
A model fitted with |
... |
Additional arguments for print methods. |
Prints model details to the console.
print_evaluation()
, print_importance()
, print_moran()
, print_performance()
if(interactive()){ #loading example data data("plant_richness_df") data("distance_matrix") #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #printing model summary print(rf.model) }
if(interactive()){ #loading example data data("plant_richness_df") data("distance_matrix") #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #printing model summary print(rf.model) }
Ranks spatial predictors generated by mem_multithreshold()
or pca_multithreshold()
by their effect in reducing the Moran's I of the model residuals (ranking.method = "effect"
), or by their own Moran's I (ranking.method = "moran"
).
In the former case, one model of the type y ~ predictors + spatial_predictor_X
is fitted per spatial predictor, and the Moran's I of this model's residuals is compared with the one of the model without spatial predictors (y ~ predictors
), to finally rank the spatial predictor from maximum to minimum difference in Moran's I.
In the latter case, the spatial predictors are ordered by their Moran's I alone (this is the faster option).
In both cases, spatial predictors that are redundant with others at a Pearson correlation > 0.5 and spatial predictors with no effect (no reduction of Moran's I or Moran's I of the spatial predictor equal or lower than 0) are removed.
This function has been designed to be used internally by rf_spatial()
rather than directly by a user.
rank_spatial_predictors( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, ranking.method = c("moran", "effect"), reference.moran.i = 1, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rank_spatial_predictors( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, ranking.method = c("moran", "effect"), reference.moran.i = 1, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
List with ranger arguments. See rf or rf_repeat for further details. |
spatial.predictors.df |
Data frame of spatial predictors. |
ranking.method |
Character, method used by to rank spatial predictors. The method "effect" ranks spatial predictors according how much each predictor reduces Moran's I of the model residuals, while the method "moran" ranks them by their own Moran's I. Default: |
reference.moran.i |
Moran's I of the residuals of the model without spatial predictors. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with four slots:
method
: Character, name of the method used to rank the spatial predictors.
criteria
: Data frame with two different configurations depending on the ranking method. If ranking.method = "effect"
, the columns contain the names of the spatial predictors, the r-squared of the model, the Moran's I of the model residuals, the difference between the Moran's I of the model including the given spatial predictor, and the Moran's I of the model fitted without spatial predictors, and the interpretation of the Moran's I value. If ranking.method = "moran"
, only the name of the spatial predictor and it's Moran's I are in the output data frame.
ranking
: Ordered character vector with the names of the spatial predictors selected.
spatial.predictors.df
: data frame with the selected spatial predictors in the order of the ranking.
if(interactive()){ #loading distance matrix data(distance_matrix) #computing Moran's Eigenvector Maps mem.df <- mem( distance.matrix = distance_matrix[1:50, 1:50], distance.threshold = 0 ) #ranking by the Moran's I of the spatial predictor rank <- rank_spatial_predictors( distance.matrix = distance_matrix[1:50, 1:50], distance.thresholds = 0, spatial.predictors.df = mem.df, ranking.method = "moran", n.cores = 1 ) #checking Moran's I of MEMs rank$criteria #checking rank of MEMs rank$ranking }
if(interactive()){ #loading distance matrix data(distance_matrix) #computing Moran's Eigenvector Maps mem.df <- mem( distance.matrix = distance_matrix[1:50, 1:50], distance.threshold = 0 ) #ranking by the Moran's I of the spatial predictor rank <- rank_spatial_predictors( distance.matrix = distance_matrix[1:50, 1:50], distance.thresholds = 0, spatial.predictors.df = mem.df, ranking.method = "moran", n.cores = 1 ) #checking Moran's I of MEMs rank$criteria #checking rank of MEMs rank$ranking }
Rescales a numeric vector to a new range.
rescale_vector( x = NULL, new.min = 0, new.max = 1, integer = FALSE )
rescale_vector( x = NULL, new.min = 0, new.max = 1, integer = FALSE )
x |
Numeric vector. Default: |
new.min |
New minimum value. Default: |
new.max |
New maximum value. Default: |
integer |
Logical, if |
A numeric vector of the same length as x, but with its values rescaled between new.min
and new.max.
if(interactive()){ out <- rescale_vector( x = rnorm(100), new.min = 0, new.max = 100, integer = TRUE ) out }
if(interactive()){ out <- rescale_vector( x = rnorm(100), new.min = 0, new.max = 100, integer = TRUE ) out }
Applies a Shapiro-Wilks test to a numeric vector, and plots the qq plot and the histogram.
residuals_diagnostics(residuals, predictions)
residuals_diagnostics(residuals, predictions)
residuals |
Numeric vector, model residuals. |
predictions |
Numeric vector, model predictions. |
The function shapiro.test()
has a hard limit of 5000 cases. If the model residuals have more than 5000 cases, then sample(x = residuals, size = 5000)
is applied to the model residuals before the test.
A list with four slots:
/item w
W statistic returned by shapiro.test()
.
/item p.value
p-value of the Shapiro test.
/item interpretation
Character vector, one of "x is normal", "x is not normal".
/item plot
A patchwork plot with the qq plot and the histogram of x.
ggplot
,aes
,geom_qq_line
,ggtheme
,labs
,geom_freqpoly
,geom_abline
plot_annotation
if(interactive()){ residuals_diagnostics( residuals = runif(100), predictions = runif(100) ) }
if(interactive()){ residuals_diagnostics( residuals = runif(100), predictions = runif(100) ) }
Applies a Shapiro-Wilks test to a numeric vector, and returns a list with the statistic W, its p-value, and a character string with the interpretation.
residuals_test(residuals)
residuals_test(residuals)
residuals |
Numeric vector, model residuals. |
A list with four slots:
/item w
W statistic returned by shapiro.test()
.
/item p.value
p-value of the Shapiro test.
/item interpretation
Character vector, one of "x is normal", "x is not normal".
/item plot
A patchwork plot with the qq plot and the histogram of x.
ggplot
,aes
,geom_qq_line
,ggtheme
,labs
,geom_freqpoly
,geom_abline
plot_annotation
if(interactive()){ residuals_test(residuals = runif(100)) }
if(interactive()){ residuals_test(residuals = runif(100)) }
A convenient wrapper for ranger that completes its output by providing the Moran's I of the residuals for different distance thresholds, the rmse and nrmse (as computed by root_mean_squared_error()
), and variable importance scores based on a scaled version of the data generated by scale.
rf( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
xy |
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. The ranger arguments |
scaled.importance |
Logical, if |
seed |
Integer, random seed to facilitate reproducibility. If set to a given number, the returned model is always the same. Default: |
verbose |
Boolean. If TRUE, messages and plots generated during the execution of the function are displayed. Default: |
n.cores |
Integer, number of cores to use. Default: |
cluster |
A cluster definition generated with |
Please read the help file of ranger for further details. Notice that the formula
interface of ranger is supported through ranger.arguments
, but variable interactions are not allowed (but check the_feature_engineer()
).
A ranger model with several extra slots:
ranger.arguments
: Stores the values of the arguments used to fit the ranger model.
importance
: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores (difference in accuracy between permuted and non permuted variables for every case, computed on the out-of-bag data).
performance
: performance scores: R squared on out-of-bag data, R squared (cor(observed, predicted) ^ 2), pseudo R squared (cor(observed, predicted)), RMSE, and normalized RMSE (NRMSE).
residuals
: residuals, normality test of the residuals computed with residuals_test()
, and spatial autocorrelation of the residuals computed with moran_multithreshold()
.
if(interactive()){ #loading example data data("plant_richness_df") data("distance_matrix") #fittind random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) class(out) #data frame with ordered variable importance out$importance$per.variable #variable importance plot out$importance$per.variable.plot #performance out$performance #spatial correlation of the residuals out$spatial.correlation.residuals$per.distance #plot of the Moran's I of the residuals for different distance thresholds out$spatial.correlation.residuals$plot #predictions for new data as done with ranger models: predicted <- stats::predict( object = out, data = plant_richness_df, type = "response" )$predictions #alternative data input methods ############################### #ranger.arguments can contain ranger arguments and any other rf argument my.ranger.arguments <- list( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[8:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000) ) #fitting model with these ranger arguments out <- rf( ranger.arguments = my.ranger.arguments, n.cores = 1 ) }
if(interactive()){ #loading example data data("plant_richness_df") data("distance_matrix") #fittind random forest model out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) class(out) #data frame with ordered variable importance out$importance$per.variable #variable importance plot out$importance$per.variable.plot #performance out$performance #spatial correlation of the residuals out$spatial.correlation.residuals$per.distance #plot of the Moran's I of the residuals for different distance thresholds out$spatial.correlation.residuals$plot #predictions for new data as done with ranger models: predicted <- stats::predict( object = out, data = plant_richness_df, type = "response" )$predictions #alternative data input methods ############################### #ranger.arguments can contain ranger arguments and any other rf argument my.ranger.arguments <- list( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[8:21], distance.matrix = distance_matrix, distance.thresholds = c(0, 1000) ) #fitting model with these ranger arguments out <- rf( ranger.arguments = my.ranger.arguments, n.cores = 1 ) }
Uses rf_evaluate()
to compare the performance of several models on independent spatial folds via spatial cross-validation.
rf_compare( models = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.8), line.color = "gray30", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_compare( models = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.8), line.color = "gray30", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
models |
Named list with models resulting from |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". Default: |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metrics |
Character vector, names of the performance metrics selected. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with three slots:
comparison.df
: Data frame with one performance value per spatial fold, metric, and model.
spatial.folds
: List with the indices of the training and testing records for each evaluation repetition.
plot
: Violin-plot of comparison.df
.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #fitting a spatial model with Moran's Eigenvector Maps rf.spatial <- rf_spatial( model = rf.model, n.cores = 1 ) #comparing the spatial and non spatial models comparison <- rf_compare( models = list( `Non spatial` = rf.model, Spatial = rf.spatial ), xy = plant_richness_df[, c("x", "y")], metrics = c("r.squared", "rmse"), n.cores = 1 ) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #fitting a spatial model with Moran's Eigenvector Maps rf.spatial <- rf_spatial( model = rf.model, n.cores = 1 ) #comparing the spatial and non spatial models comparison <- rf_compare( models = list( `Non spatial` = rf.model, Spatial = rf.spatial ), xy = plant_richness_df[, c("x", "y")], metrics = c("r.squared", "rmse"), n.cores = 1 ) }
Evaluates the performance of random forest on unseen data over independent spatial folds.
rf_evaluate( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, grow.testing.folds = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_evaluate( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, grow.testing.folds = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
Model fitted with |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metrics |
Character vector, names of the performance metrics selected. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
grow.testing.folds |
Logic. By default, this function grows contiguous training folds to keep the spatial structure of the data as intact as possible. However, when setting |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The evaluation algorithm works as follows: the number of repetitions
and the input dataset (stored in model$ranger.arguments$data
) are used as inputs for the function thinning_til_n()
, that applies thinning()
to the input data until as many cases as repetitions
are left, and as separated as possible. Each of these remaining records will be used as a "fold center". From that point, the fold grows, until a number of points equal (or close) to training.fraction
is reached. The indices of the records within the grown spatial fold are stored as "training" in the output list, and the remaining ones as "testing". Then, for each spatial fold, a "training model" is fitted using the cases corresponding with the training indices, and predicted over the cases corresponding with the testing indices. The model predictions on the "unseen" data are compared with the observations, and the performance measures (R squared, pseudo R squared, RMSE and NRMSE) computed.
A model of the class "rf_evaluate" with a new slot named "evaluation", that is a list with the following slots:
training.fraction
: Value of the argument training.fraction
.
spatial.folds
: Result of applying make_spatial_folds()
on the data coordinates. It is a list with as many slots as repetitions
are indicated by the user. Each slot has two slots named "training" and "testing", each one having the indices of the cases used on the training and testing models.
per.fold
: Data frame with the evaluation results per spatial fold (or repetition). It contains the ID of each fold, it's central coordinates, the number of training and testing cases, and the training and testing performance measures: R squared, pseudo R squared (cor(observed, predicted)), rmse, and normalized rmse.
per.model
: Same data as above, but organized per fold and model ("Training", "Testing", and "Full").
aggregated
: Same data, but aggregated by model and performance measure.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluation with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #checking evaluation results plot_evaluation(rf.model) print_evaluation(rf.model) x <- get_evaluation(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) #evaluation with spatial cross-validation rf.model <- rf_evaluate( model = rf.model, xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) #checking evaluation results plot_evaluation(rf.model) print_evaluation(rf.model) x <- get_evaluation(rf.model) }
Evaluates the contribution of the predictors to model transferability via spatial cross-validation. The function returns the median increase or decrease in a given evaluation metric (R2, pseudo R2, RMSE, nRMSE, or AUC) when a variable is introduced in a model, by comparing and evaluating via spatial cross-validation models with and without the given variable. This function was devised to provide importance scores that would be less sensitive to spatial autocorrelation than those computed internally by random forest on the out-of-bag data. This function is experimental.
rf_importance( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metric = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_importance( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metric = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
Model fitted with |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metric |
Character, nams of the performance metric to use. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The input model with new data in its "importance" slot. The new importance scores are included in the data frame model$importance$per.variable
, under the column names "importance.cv" (median contribution to transferability over spatial cross-validation repetitions), "importance.cv.mad" (median absolute deviation of the performance scores over spatial cross-validation repetitions), "importance.cv.percent" ("importance.cv" expressed as a percent, taking the full model's performance as baseline), and "importance.cv.mad" (median absolute deviation of "importance.cv"). The plot is stored as "cv.per.variable.plot".
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) xy <- plant_richness_df[, c("x", "y")] #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, xy = xy, n.cores = 1, verbose = FALSE ) #computing predictor contribution to model transferability rf.model <- rf_importance(rf.model) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) xy <- plant_richness_df[, c("x", "y")] #fitting random forest model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, xy = xy, n.cores = 1, verbose = FALSE ) #computing predictor contribution to model transferability rf.model <- rf_importance(rf.model) }
Fits several random forest models on the same data in order to capture the effect of the algorithm's stochasticity on the variable importance scores, predictions, residuals, and performance measures. The function relies on the median to aggregate performance and importance values across repetitions. It is recommended to use it after a model is fitted (rf()
or rf_spatial()
), tuned (rf_tuning()
), and/or evaluated (rf_evaluate()
). This function is designed to be used after fitting a model with rf()
or rf_spatial()
, tuning it with rf_tuning()
and evaluating it with rf_evaluate()
.
rf_repeat( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, repetitions = 10, keep.models = TRUE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_repeat( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, repetitions = 10, keep.models = TRUE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
xy |
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
scaled.importance |
Logical. If |
repetitions |
Integer, number of random forest models to fit. Default: |
keep.models |
Logical, if |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A ranger model with several new slots:
ranger.arguments
: Stores the values of the arguments used to fit the ranger model.
importance
: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores.
performance
: out-of-bag performance scores: R squared, pseudo R squared, RMSE, and normalized RMSE (NRMSE).
pseudo.r.squared
: computed as the correlation between the observations and the predictions.
residuals
: residuals, normality test of the residuals computed with residuals_test()
, and spatial autocorrelation of the residuals computed with moran_multithreshold()
.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting 5 random forest models out <- rf_repeat( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, repetitions = 5, n.cores = 1 ) #data frame with ordered variable importance out$importance$per.variable #per repetition out$importance$per.repetition #variable importance plot out$importance$per.repetition.plot #performance out$performance #spatial correlation of the residuals for different distance thresholds out$spatial.correlation.residuals$per.distance #plot of the Moran's I of the residuals for different distance thresholds out$spatial.correlation.residuals$plot #using a model as an input for rf_repeat() rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[8:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #repeating the model 5 times rf.repeat <- rf_repeat( model = rf.model, n.cores = 1 ) rf.repeat$performance rf.repeat$importance$per.repetition.plot }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting 5 random forest models out <- rf_repeat( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, repetitions = 5, n.cores = 1 ) #data frame with ordered variable importance out$importance$per.variable #per repetition out$importance$per.repetition #variable importance plot out$importance$per.repetition.plot #performance out$performance #spatial correlation of the residuals for different distance thresholds out$spatial.correlation.residuals$per.distance #plot of the Moran's I of the residuals for different distance thresholds out$spatial.correlation.residuals$plot #using a model as an input for rf_repeat() rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[8:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #repeating the model 5 times rf.repeat <- rf_repeat( model = rf.model, n.cores = 1 ) rf.repeat$performance rf.repeat$importance$per.repetition.plot }
Fits spatial random forest models using different methods to generate, rank, and select spatial predictors acting as proxies of spatial processes not considered by the non-spatial predictors. The end goal is providing the model with information about the spatial structure of the data to minimize the spatial correlation (Moran's I) of the model residuals and generate honest variable importance scores.
rf_spatial( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = TRUE, method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive", "hengl", "hengl.moran.sequential", "hengl.effect.sequential", "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential", "pca.effect.recursive"), max.spatial.predictors = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_spatial( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = TRUE, method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive", "hengl", "hengl.moran.sequential", "hengl.effect.sequential", "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential", "pca.effect.recursive"), max.spatial.predictors = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with distances in the same units as |
xy |
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
scaled.importance |
Logical. If |
method |
Character, method to build, rank, and select spatial predictors. One of:
|
max.spatial.predictors |
Integer, maximum number of spatial predictors to generate. Useful when memory problems arise due to a large number of spatial predictors, Default: |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the selection of spatial components. See Details, Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for adding an increasing number of spatial predictors during selection. Default: |
seed |
Integer, random seed to facilitate reproducibility. Default: |
verbose |
Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The function uses three different methods to generate spatial predictors ("hengl", "pca", and "mem"), two methods to rank them in order to define in what order they are introduced in the model ("effect" and "moran), and two methods to select the spatial predictors that minimize the spatial correlation of the model residuals ("sequential" and "recursive"). All method names but "hengl" (that uses the complete distance matrix as predictors in the spatial model) are named by combining a method to generate the spatial predictors, a method to rank them, and a method to select them, separated by a point. Examples are "mem.moran.sequential" or "mem.effect.recursive". All combinations are not possible, since the ranking method "moran" cannot be used with the selection method "recursive" (because the logics behind them are very different, see below). Methods to generate spatial predictors:
"hengl"
: named after the method RFsp presented in the paper "Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables", by Hengl et al. (2018), where the authors propose to use the distance matrix among records as predictors in spatial random forest models (RFsp method). In this function, all methods starting with "hengl" use either the complete distance matrix, or select columns of the distance matrix as spatial predictors.
"mem"
: Generates Moran's Eigenvector Maps, that is, the eigenvectors of the double-centered weights of the distance matrix. The method is described in "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014).
"pca"
: Computes spatial predictors from the principal component analysis of a weighted distance matrix (see weights_from_distance_matrix()
). This is an experimental method, use with caution.
Methods to rank spatial predictors (see rank_spatial_predictors()
):
"moran"
: Computes the Moran's I of each spatial predictor, selects the ones with positive values, and ranks them from higher to lower Moran's I.
"effect"
: If a given non-spatial random forest model is defined as y = p1 + ... + pn
, being p1 + ... + pn
the set of predictors, for every spatial predictor generated (spX
) a spatial model y = p1 + ... + pn + spX
is fitted, and the Moran's I of its residuals is computed. The spatial predictors are then ranked by how much they help to reduce spatial autocorrelation between the non-spatial and the spatial model.
Methods to select spatial predictors:
"sequential"
(see select_spatial_predictors_sequential()
): The spatial predictors are added one by one in the order they were ranked, and once all spatial predictors are introduced, the best first n predictors are selected. This method is similar to the one employed in the MEM methodology (Moran's Eigenvector Maps) described in the paper "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014). This method generally introduces tens of predictors into the model, but usually offers good results.
"recursive"
(see select_spatial_predictors_recursive()
): This method tries to find the smallest combination of spatial predictors that reduce the spatial correlation of the model's residuals the most. The algorithm goes as follows: 1. The first ranked spatial predictor is introduced into the model; 2. the remaining predictors are ranked again using the "effect" method, using the model in 1. as reference. The first spatial predictor in the resulting ranking is then introduced into the model, and the steps 1. and 2. are repeated until spatial predictors stop having an effect in reducing the Moran's I of the model residuals. This method takes longer to compute, but generates smaller sets of spatial predictors. This is an experimental method, use with caution.
Once ranking procedure is completed, an algorithm is used to select the minimal subset of spatial predictors that reduce the most the Moran's I of the residuals: for each new spatial predictor introduced in the model, the Moran's I of the residuals, it's p-value, a binary version of the p-value (0 if < 0.05 and 1 if >= 0.05), the R-squared of the model, and a penalization linear with the number of spatial predictors introduced (computed as (1 / total spatial predictors) * introduced spatial predictors
) are rescaled between 0 and 1. Then, the optimization criteria is computed as max(1 - Moran's I, p-value binary) + (weight.r.squared * R-squared) - (weight.penalization.n.predictors * penalization)
. The predictors from the first one to the one with the highest optimization criteria are then selected as the best ones in reducing the spatial correlation of the model residuals, and used along with data
to fit the final spatial model.
A ranger model with several new slots:
ranger.arguments
: Values of the arguments used to fit the ranger model.
importance
: A list containing the vector of variable importance as originally returned by ranger (scaled or not depending on the value of 'scaled.importance'), a data frame with the predictors ordered by their importance, and a ggplot showing the importance values.
performance
: With the out-of-bag R squared, pseudo R squared, RMSE and NRMSE of the model.
residuals
: residuals, normality test of the residuals computed with residuals_test()
, and spatial autocorrelation of the residuals computed with moran_multithreshold()
.
spatial
: A list with four slots:
method
: Character, method used to generate, rank, and select spatial predictors.
names
: Character vector with the names of the selected spatial predictors. Not returned if the method is "hengl".
optimization
: Criteria used to select the spatial predictors. Not returned if the method is "hengl".
plot
: Plot of the criteria used to select the spatial predictors. Not returned if the method is "hengl".
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #names of the response and predictors dependent.variable.name <- "richness_species_vascular" predictor.variable.names <- colnames(plant_richness_df)[5:21] #hengl model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "hengl", n.cores = 1 ) #mem.moran.sequential model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "mem.moran.sequential", n.cores = 1 ) #fitting an rf_spatial model from an rf model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) rf.model$spatial.correlation.residuals$plot #spatial version of the rf model rf.spatial <- rf_spatial(model = rf.model) rf.spatial$spatial.correlation.residuals$plot }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #names of the response and predictors dependent.variable.name <- "richness_species_vascular" predictor.variable.names <- colnames(plant_richness_df)[5:21] #hengl model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "hengl", n.cores = 1 ) #mem.moran.sequential model <- rf_spatial( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, method = "mem.moran.sequential", n.cores = 1 ) #fitting an rf_spatial model from an rf model rf.model <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1, verbose = FALSE ) rf.model$spatial.correlation.residuals$plot #spatial version of the rf model rf.spatial <- rf_spatial(model = rf.model) rf.spatial$spatial.correlation.residuals$plot }
Finds the optimal set of random forest hyperparameters num.trees
, mtry
, and min.node.size
via grid search by maximizing the model's R squared, or AUC, if the response variable is binomial, via spatial cross-validation performed with rf_evaluate()
.
rf_tuning( model = NULL, num.trees = NULL, mtry = NULL, min.node.size = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
rf_tuning( model = NULL, num.trees = NULL, mtry = NULL, min.node.size = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
num.trees |
Numeric integer vector with the number of trees to fit on each model repetition. Default: |
mtry |
Numeric integer vector, number of predictors to randomly select from the complete pool of predictors on each tree split. Default: |
min.node.size |
Numeric integer, minimal number of cases in a terminal node. Default: |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of independent spatial folds to use during the cross-validation. Default: |
training.fraction |
Proportion between 0.2 and 0.9 indicating the number of records to be used in model training. Default: |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A model with a new slot named tuning
, with a data frame with the results of the tuning analysis.
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting model to tune out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #model tuning tuning <- rf_tuning( model = out, num.trees = c(100, 500), mtry = c(2, 8), min.node.size = c(5, 10), xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) }
if(interactive()){ #loading example data data(plant_richness_df) data(distance_matrix) #fitting model to tune out <- rf( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #model tuning tuning <- rf_tuning( model = out, num.trees = c(100, 500), mtry = c(2, 8), min.node.size = c(5, 10), xy = plant_richness_df[, c("x", "y")], n.cores = 1 ) }
Computes the rmse or normalized rmse (nrmse) between two numeric vectors of the same length representing observations and model predictions.
root_mean_squared_error( o, p, normalization = c("rmse", "all", "mean", "sd", "maxmin", "iq") )
root_mean_squared_error( o, p, normalization = c("rmse", "all", "mean", "sd", "maxmin", "iq") )
o |
Numeric vector with observations, must have the same length as |
p |
Numeric vector with predictions, must have the same length as |
normalization |
character, normalization method, Default: "rmse" (see Details). |
The normalization methods go as follows:
"rmse"
: RMSE with no normalization.
"mean"
: RMSE dividied by the mean of the observations (rmse/mean(o)).
"sd"
: RMSE dividied by the standard deviation of the observations (rmse/sd(o)).
"maxmin"
: RMSE divided by the range of the observations (rmse/(max(o) - min(o))).
"iq"
: RMSE divided by the interquartile range of the observations (rmse/(quantile(o, 0.75) - quantile(o, 0.25)))
Named numeric vector with either one or 5 values, as selected by the user.
if(interactive()){ root_mean_squared_error( o = runif(10), p = runif(10) ) }
if(interactive()){ root_mean_squared_error( o = runif(10), p = runif(10) ) }
Selects spatial predictors following these steps:
Gets the spatial predictors ranked by rank_spatial_predictors()
and fits a model of the form y ~ predictors + best_spatial_predictor_1
. The Moran's I of the residuals of this model is used as reference value for the next step.
The remaining spatial predictors are introduced again into rank_spatial_predictors()
, and the spatial predictor with the highest ranking is introduced in a new model of the form y ~ predictors + best_spatial_predictor_1 + best_spatial_predictor_2
.
Steps 1 and 2 are repeated until the Moran's I doesn't improve for a number of repetitions equal to the 20 percent of the total number of spatial predictors introduced in the function.
This method allows to select the smallest set of spatial predictors that have the largest joint effect in reducing the spatial correlation of the model residuals, while maintaining the model's R-squared as high as possible. As a consequence of running rank_spatial_predictors()
on each iteration, this method includes in the final model less spatial predictors than the sequential method implemented in select_spatial_predictors_sequential()
would do, while minimizing spatial correlation and maximizing the R squared of the model as much as possible.
select_spatial_predictors_recursive( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.25, weight.penalization.n.predictors = 0, n.cores = parallel::detectCores() - 1, cluster = NULL )
select_spatial_predictors_recursive( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.25, weight.penalization.n.predictors = 0, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
spatial.predictors.df |
Data frame of spatial predictors. |
spatial.predictors.ranking |
Ranking of predictors returned by |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the optimization index. Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for the number of spatial predictors added in the optimization index. Default: |
n.cores |
Integer, number of cores to use. Default: |
cluster |
A cluster definition generated by |
The algorithm works as follows. If the function rank_spatial_predictors()
returns 10 ranked spatial predictors (sp1 to sp10, being sp7 the best one), select_spatial_predictors_recursive()
is going to first fit the model y ~ predictors + sp7
. Then, the spatial predictors sp2 to sp9 are again ranked with rank_spatial_predictors()
using the model y ~ predictors + sp7
as reference (at this stage, some of the spatial predictors might be dropped due to lack of effect). When the new ranking of spatial predictors is ready (let's say they are sp5, sp3, and sp4), the best one (sp5) is included in the model y ~ predictors + sp7 + sp5
, and the remaining ones go again to rank_spatial_predictors()
to repeat the process until spatial predictors are depleted.
A list with two slots: optimization
, a data frame with the index of the spatial predictor added on each iteration, the spatial correlation of the model residuals, and the R-squared of the model, and best.spatial.predictors
, that is a character vector with the names of the spatial predictors that minimize the Moran's I of the residuals and maximize the R-squared of the model.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #response and preditor names dependent.variable.name = "richness_species_vascular" predictor.variable.names = colnames(plant_richness_df)[5:21] #non-spatial model model <- rf( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing spatial predictors spatial.predictors <- mem_multithreshold( distance.matrix = distance_matrix, distance.thresholds = 0 ) #ranking spatial predictors spatial.predictors.ranking <- rank_spatial_predictors( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, spatial.predictors.df = spatial.predictors, ranking.method = "moran", reference.moran.i = model$spatial.correlation.residuals$max.moran, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #selecting the best subset of predictors selection <- select_spatial_predictors_recursive( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, spatial.predictors.df = spatial.predictors, spatial.predictors.ranking = spatial.predictors.ranking, n.cores = 1 ) selection$optimization selection$best.spatial.predictors plot_optimization(selection$optimization) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #response and preditor names dependent.variable.name = "richness_species_vascular" predictor.variable.names = colnames(plant_richness_df)[5:21] #non-spatial model model <- rf( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing spatial predictors spatial.predictors <- mem_multithreshold( distance.matrix = distance_matrix, distance.thresholds = 0 ) #ranking spatial predictors spatial.predictors.ranking <- rank_spatial_predictors( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, spatial.predictors.df = spatial.predictors, ranking.method = "moran", reference.moran.i = model$spatial.correlation.residuals$max.moran, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #selecting the best subset of predictors selection <- select_spatial_predictors_recursive( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, spatial.predictors.df = spatial.predictors, spatial.predictors.ranking = spatial.predictors.ranking, n.cores = 1 ) selection$optimization selection$best.spatial.predictors plot_optimization(selection$optimization) }
Selects spatial predictors by adding them sequentially into a model while monitoring the Moran's I of the model residuals and the model's R-squared. Once all the available spatial predictors have been added to the model, the function identifies the first n
predictors that minimize the spatial correlation of the residuals and maximize R-squared, and returns the names of the selected spatial predictors and a data frame with the selection criteria.
select_spatial_predictors_sequential( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.75, weight.penalization.n.predictors = 0.25, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
select_spatial_predictors_sequential( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.75, weight.penalization.n.predictors = 0.25, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
spatial.predictors.df |
Data frame of spatial predictors. |
spatial.predictors.ranking |
Ranking of the spatial predictors returned by |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the optimization index. Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for the number of spatial predictors added in the optimization index. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use. Default: |
cluster |
A cluster definition generated by |
The algorithm works as follows: If the function rank_spatial_predictors returns 10 spatial predictors (sp1 to sp10, ordered from best to worst), select_spatial_predictors_sequential is going to fit the models y ~ predictors + sp1
, y ~ predictors + sp1 + sp2
, until all spatial predictors are used in y ~ predictors + sp1 ... sp10
. The model with lower Moran's I of the residuals and higher R-squared (computed on the out-of-bag data) is selected, and its spatial predictors returned.
A list with two slots: optimization
, a data frame with the index of the spatial predictor added on each iteration, the spatial correlation of the model residuals, and the R-squared of the model, and best.spatial.predictors
, that is a character vector with the names of the spatial predictors that minimize the Moran's I of the residuals and maximize the R-squared of the model.
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #common arguments dependent.variable.name = "richness_species_vascular" predictor.variable.names = colnames(plant_richness_df)[5:21] #non-spatial model model <- rf( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing spatial predictors spatial.predictors <- mem_multithreshold( distance.matrix = distance.matrix, distance.thresholds = 0 ) #ranking spatial predictors by their Moran's I (faster option) spatial.predictors.ranking <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = spatial.predictors, reference.moran.i = model$spatial.correlation.residuals$max.moran, distance.matrix = distance.matrix, distance.thresholds = 0, n.cores = 1 ) #selecting the best subset of predictors selection <- select_spatial_predictors_sequential( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, spatial.predictors.df = spatial.predictors, spatial.predictors.ranking = spatial.predictors.ranking, n.cores = 1 ) selection$optimization selection$best.spatial.predictors plot_optimization(selection$optimization) }
if(interactive()){ #loading example data data(distance_matrix) data(plant_richness_df) #common arguments dependent.variable.name = "richness_species_vascular" predictor.variable.names = colnames(plant_richness_df)[5:21] #non-spatial model model <- rf( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, n.cores = 1 ) #preparing spatial predictors spatial.predictors <- mem_multithreshold( distance.matrix = distance.matrix, distance.thresholds = 0 ) #ranking spatial predictors by their Moran's I (faster option) spatial.predictors.ranking <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = spatial.predictors, reference.moran.i = model$spatial.correlation.residuals$max.moran, distance.matrix = distance.matrix, distance.thresholds = 0, n.cores = 1 ) #selecting the best subset of predictors selection <- select_spatial_predictors_sequential( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, distance.thresholds = 0, spatial.predictors.df = spatial.predictors, spatial.predictors.ranking = spatial.predictors.ranking, n.cores = 1 ) selection$optimization selection$best.spatial.predictors plot_optimization(selection$optimization) }
Computes the standard error of the mean of a numeric vector as round(sqrt(var(x)/length(x)), 3)
standard_error(x)
standard_error(x)
x |
A numeric vector. |
The function removes NA
values before computing the standard error, and rounds the result to 3 decimal places.
A numeric value.
if(interactive()){ standard_error(runif(10)) }
if(interactive()){ standard_error(runif(10)) }
Computes the mode of a numeric or character vector
statistical_mode(x)
statistical_mode(x)
x |
Numeric or character vector. |
Statistical mode of x
.
if(interactive()){ statistical_mode(c(10, 9, 10, 8)) }
if(interactive()){ statistical_mode(c(10, 9, 10, 8)) }
Suggests candidate variable interactions and composite features able to improve predictive accuracy over data not used to train the model via spatial cross-validation with rf_evaluate()
. For a pair of predictors a
and b
, interactions are build via multiplication (a * b
), while composite features are built by extracting the first factor of a principal component analysis performed with pca()
, after rescaling a
and b
between 1 and 100. Interactions and composite features are named a..x..b
and a..pca..b
respectively.
Candidate variables a
and b
are selected from those predictors in predictor.variable.names
with a variable importance above importance.threshold
(set by default to the median of the importance scores).
For each interaction and composite feature, a model including all the predictors plus the interaction or composite feature is fitted, and it's R squared (or AUC if the response is binary) computed via spatial cross-validation (see rf_evaluate()
) is compared with the R squared of the model without interactions or composite features.
From all the potential interactions screened, only those with a positive increase in R squared (or AUC when the response is binomial) of the model, a variable importance above the median, and a maximum correlation among themselves and with the predictors in predictor.variable.names
not higher than cor.threshold
(set to 0.5 by default) are selected. Such a restrictive set of rules ensures that the selected interactions can be used right away for modeling purposes without increasing model complexity unnecessarily. However, the suggested variable interactions might not make sense from a domain expertise standpoint, so please, examine them with care.
The function returns the criteria used to select the interactions, and the data required to use these interactions a model.
the_feature_engineer( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, xy = NULL, ranger.arguments = NULL, repetitions = 30, training.fraction = 0.75, importance.threshold = 0.75, cor.threshold = 0.75, point.color = viridis::viridis(100, option = "F", alpha = 0.8), seed = NULL, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
the_feature_engineer( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, xy = NULL, ranger.arguments = NULL, repetitions = 30, training.fraction = 0.75, importance.threshold = 0.75, cor.threshold = 0.75, point.color = viridis::viridis(100, option = "F", alpha = 0.8), seed = NULL, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables, or object of class |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If not provided, the comparison between models with and without variable interactions is not done. |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
importance.threshold |
Numeric between 0 and 1, quantile of variable importance scores over which to select individual predictors to explore interactions among them. Larger values reduce the number of potential interactions explored. Default: |
cor.threshold |
Numeric, maximum Pearson correlation between any pair of the selected interactions, and between any interaction and the predictors in |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with seven slots:
screening
: Data frame with selection scores of all the interactions considered.
selected
: Data frame with selection scores of the selected interactions.
df
: Data frame with the computed interactions.
plot
: List of plots of the selected interactions versus the response variable. The output list can be plotted all at once with patchwork::wrap_plots(p)
or cowplot::plot_grid(plotlist = p)
, or one by one by extracting each plot from the list.
data
: Data frame with the response variable, the predictors, and the selected interactions, ready to be used as data
argument in the package functions.
dependent.variable.name
: Character, name of the response.
predictor.variable.names
: Character vector with the names of the predictors and the selected interactions.
if(interactive()){ #load example data data(plant_richness_df) new.features <- the_feature_engineer( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = TRUE ) new.features$screening new.features$selected new.features$columns }
if(interactive()){ #load example data data(plant_richness_df) new.features <- the_feature_engineer( data = plant_richness_df, dependent.variable.name = "richness_species_vascular", predictor.variable.names = colnames(plant_richness_df)[5:21], n.cores = 1, verbose = TRUE ) new.features$screening new.features$selected new.features$columns }
Resamples a set of points with x and y coordinates to impose a minimum distance among nearby points.
thinning(xy, minimum.distance = NULL)
thinning(xy, minimum.distance = NULL)
xy |
A data frame with columns named "x" and "y" representing geographic coordinates. |
minimum.distance |
Numeric, minimum distance to be set between nearby points, in the same units as the coordinates of xy. |
Generally used to remove redundant points that could produce pseudo-replication, and to limit sampling bias by disaggregating clusters of points.
A data frame with the same columns as xy
with points separated by the defined minimum distance.
if(interactive()){ #load example data data(plant_richness_df) #thinning to points separated by 5 degrees plant_richness.thin <- thinning( x = plant_richness_df, minimum.distance = 5 #points separated by at least 5 degrees ) plant_richness.thin }
if(interactive()){ #load example data data(plant_richness_df) #thinning to points separated by 5 degrees plant_richness.thin <- thinning( x = plant_richness_df, minimum.distance = 5 #points separated by at least 5 degrees ) plant_richness.thin }
Resamples a set of points with x and y coordinates by increasing the distance step by step until a given sample size is obtained.
thinning_til_n( xy, n = 30, distance.step = NULL )
thinning_til_n( xy, n = 30, distance.step = NULL )
xy |
A data frame with columns named "x" and "y" representing geographic coordinates. Default: |
n |
Integer, number of samples to obtain. Must be lower than |
distance.step |
Numeric, distance step used during the thinning iterations. If |
A data frame with the same columns as xy with a row number close to n.
if(interactive()){ #loading example data data(plant_richness_df) #thinning to ~20 records plant_richness.thin <- thinning_til_n( x = plant_richness_df, n = 20 ) plant_richness.thin }
if(interactive()){ #loading example data data(plant_richness_df) #thinning to ~20 records plant_richness.thin <- thinning_til_n( x = plant_richness_df, n = 20 ) plant_richness.thin }
Computes the variance inflation factor (VIF) of the colums in a data frame. Warning: variables in preference.order
not in colnames(x)
, and non-numeric columns are removed silently from x
and preference.order
. The same happens with rows having NA values (na.omit()
is applied). The function issues a warning if zero-variance columns are found.
vif(x)
vif(x)
x |
Data frame with numeric columns, typically containing a set of model predictors. |
A data frame with two columns having the name of the variables in 'x' and their respective VIF values.
if(interactive()){ data(plant_richness_df) vif(plant_richness_df[, 5:21]) }
if(interactive()){ data(plant_richness_df) vif(plant_richness_df[, 5:21]) }
Transforms a distance matrix into weights (1/distance.matrix) normalized by the row sums. Used to compute Moran's I values and Moran's Eigenvector Maps. Allows to apply a threshold to the distance matrix before computing the weights.
weights_from_distance_matrix( distance.matrix = NULL, distance.threshold = 0 )
weights_from_distance_matrix( distance.matrix = NULL, distance.threshold = 0 )
distance.matrix |
Distance matrix. Default: |
distance.threshold |
Numeric, positive, in the range of values of |
A weighted distance matrix.
if(interactive()){ #loading example distance matrix data(distance_matrix) #computing matrix of weights distance.matrix.weights <- weights_from_distance_matrix( distance.matrix = distance_matrix ) distance.matrix.weights }
if(interactive()){ #loading example distance matrix data(distance_matrix) #computing matrix of weights distance.matrix.weights <- weights_from_distance_matrix( distance.matrix = distance_matrix ) distance.matrix.weights }