| Title: | Easy Spatial Modeling with Random Forest |
|---|---|
| Description: | Automatic generation and selection of spatial predictors for Random Forest models fitted to spatially structured data. Spatial predictors are constructed from a distance matrix among training samples using Moran's Eigenvector Maps (MEMs; Dray, Legendre, and Peres-Neto 2006 <DOI:10.1016/j.ecolmodel.2006.02.015>) or the RFsp approach (Hengl et al. <DOI:10.7717/peerj.5518>). These predictors are used alongside user-supplied explanatory variables in Random Forest models. The package provides functions for model fitting, multicollinearity reduction, interaction identification, hyperparameter tuning, evaluation via spatial cross-validation, and result visualization using partial dependence and interaction plots. Model fitting relies on the 'ranger' package (Wright and Ziegler 2017 <DOI:10.18637/jss.v077.i01>). |
| Authors: | Blas M. Benito [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-5105-7232>) |
| Maintainer: | Blas M. Benito <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.5 |
| Built: | 2026-05-12 11:22:53 UTC |
| Source: | https://github.com/blasbenito/spatialrf |
Computes variance inflation factors for all variables in a data frame and returns them in a tidy format, sorted by VIF in descending order.
.vif_to_df(x).vif_to_df(x)
x |
Data frame with numeric predictors for which to compute VIF values. |
Data frame with two columns: variable (character, variable names) and vif (numeric, VIF scores), sorted by VIF in descending order.
Other utilities:
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
Computes the area under the ROC curve (AUC) for binary classification.
auc(o, p)auc(o, p)
o |
Numeric vector of actual binary labels (0 or 1). Must have the same length as |
p |
Numeric vector of predicted probabilities (typically 0 to 1). Must have the same length as |
Numeric value between 0 and 1 representing the AUC. Higher values indicate better classification performance, with 0.5 indicating random performance and 1.0 indicating perfect classification.
Other utilities:
.vif_to_df(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
auc( o = c(0, 0, 1, 1), p = c(0.1, 0.6, 0.4, 0.8) )auc( o = c(0, 0, 1, 1), p = c(0.1, 0.6, 0.4, 0.8) )
Filters predictors using sequential evaluation of pairwise correlations. Predictors are ranked by user preference (or column order) and evaluated sequentially. Each candidate is added to the selected pool only if its maximum absolute correlation with already-selected predictors does not exceed the threshold.
auto_cor( x = NULL, preference.order = NULL, cor.threshold = 0.5, verbose = TRUE )auto_cor( x = NULL, preference.order = NULL, cor.threshold = 0.5, verbose = TRUE )
x |
Data frame with predictors, or a |
preference.order |
Character vector specifying variable preference order. Does not need to include all variables in |
cor.threshold |
Numeric between 0 and 1 (recommended: 0.5 to 0.9). Maximum allowed absolute Pearson correlation between selected variables. Default: |
verbose |
Logical. If |
The algorithm follows these steps:
Rank predictors by preference.order (or use column order if NULL).
Initialize selection pool with first predictor.
For each remaining candidate:
Compute maximum absolute correlation with selected predictors.
If max correlation equal or lower than cor.threshold, add to selected pool.
Otherwise, skip candidate.
Return selected predictors.
Data cleaning: Variables in preference.order not found in colnames(x) are silently removed. Non-numeric columns are removed with a warning. Rows with NA values are removed via na.omit(). Zero-variance columns trigger a warning but are not removed.
This function can be chained with auto_vif() through pipes (see examples).
List with class variable_selection containing:
cor: Correlation matrix of selected variables (only if 2+ variables selected).
selected.variables: Character vector of selected variable names.
selected.variables.df: Data frame containing selected variables.
Other preprocessing:
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
data( plants_df, plants_predictors ) y <- auto_cor( x = plants_df[, plants_predictors] ) y$selected.variables y$cor head(y$selected.variables.df)data( plants_df, plants_predictors ) y <- auto_cor( x = plants_df[, plants_predictors] ) y$selected.variables y$cor head(y$selected.variables.df)
Filters predictors using sequential evaluation of variance inflation factors. Predictors are ranked by user preference (or column order) and evaluated sequentially. Each candidate is added to the selected pool only if the maximum VIF of all predictors (candidate plus already-selected) does not exceed the threshold.
auto_vif(x = NULL, preference.order = NULL, vif.threshold = 5, verbose = TRUE)auto_vif(x = NULL, preference.order = NULL, vif.threshold = 5, verbose = TRUE)
x |
Data frame with predictors, or a |
preference.order |
Character vector specifying variable preference order. Does not need to include all variables in |
vif.threshold |
Numeric (recommended: 2.5 to 10). Maximum allowed VIF among selected variables. Higher values allow more collinearity. Default: |
verbose |
Logical. If |
The algorithm follows these steps:
Rank predictors by preference.order (or use column order if NULL).
Initialize selection pool with first predictor.
For each remaining candidate:
Compute VIF for candidate plus all selected predictors.
If max VIF equal or lower than vif.threshold, add candidate to selected pool.
Otherwise, skip candidate.
Return selected predictors with their VIF values.
Data cleaning: Variables in preference.order not found in colnames(x) are silently removed. Non-numeric columns are removed with a warning. Rows with NA values are removed via na.omit(). Zero-variance columns trigger a warning but are not removed.
This function can be chained with auto_cor() through pipes (see examples).
List with class variable_selection containing:
vif: Data frame with selected variable names and their VIF scores.
selected.variables: Character vector of selected variable names.
selected.variables.df: Data frame containing selected variables.
Other preprocessing:
auto_cor(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
data( plants_df, plants_predictors ) y <- auto_vif( x = plants_df[, plants_predictors] ) y$selected.variables y$vif head(y$selected.variables.df)data( plants_df, plants_predictors ) y <- auto_vif( x = plants_df[, plants_predictors] ) y$selected.variables y$vif head(y$selected.variables.df)
Creates a Beowulf cluster configuration from machine IPs, core counts, and user credentials.
beowulf_cluster( cluster.ips = NULL, cluster.cores = NULL, cluster.user = Sys.info()[["user"]], cluster.port = "11000", outfile = NULL )beowulf_cluster( cluster.ips = NULL, cluster.cores = NULL, cluster.user = Sys.info()[["user"]], cluster.port = "11000", outfile = NULL )
cluster.ips |
Character vector of machine IP addresses in the cluster. The first IP is the main node (typically the machine running this code). Default: |
cluster.cores |
Integer vector of core counts for each machine. Must match the length of |
cluster.user |
Character string for the user name across all machines. Default: current system user. |
cluster.port |
Character string specifying the communication port. Default: |
outfile |
Character string or |
Network requirements: Firewalls on all machines must allow traffic on the specified port.
Usage workflow:
Create cluster with this function
Register with doParallel::registerDoParallel()
Run parallelized code (e.g., foreach loops)
Stop cluster with parallel::stopCluster()
Cluster object created by parallel::makeCluster(), ready for registration with doParallel::registerDoParallel().
Other utilities:
.vif_to_df(),
auc(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
## Not run: # Create cluster with 3 machines beowulf.cluster <- beowulf_cluster( cluster.ips = c( "192.168.1.10", # main node "192.168.1.11", "192.168.1.12" ), cluster.cores = c(7, 4, 4), cluster.user = "username", cluster.port = "11000" ) # Register cluster for parallel processing doParallel::registerDoParallel(cl = beowulf.cluster) # Run parallelized code (e.g., foreach loop) # your_parallel_code_here # Stop cluster when done parallel::stopCluster(cl = beowulf.cluster) ## End(Not run)## Not run: # Create cluster with 3 machines beowulf.cluster <- beowulf_cluster( cluster.ips = c( "192.168.1.10", # main node "192.168.1.11", "192.168.1.12" ), cluster.cores = c(7, 4, 4), cluster.user = "username", cluster.port = "11000" ) # Register cluster for parallel processing doParallel::registerDoParallel(cl = beowulf.cluster) # Run parallelized code (e.g., foreach loop) # your_parallel_code_here # Stop cluster when done parallel::stopCluster(cl = beowulf.cluster) ## End(Not run)
Generates case weights to balance binary response variables for use with ranger models. Used internally by rf().
case_weights(data = NULL, dependent.variable.name = NULL)case_weights(data = NULL, dependent.variable.name = NULL)
data |
Data frame containing the response variable. Default: |
dependent.variable.name |
Character string specifying the response variable name. Must be a column in |
The weighting scheme assigns higher weights to the minority class to balance training:
Cases with value 0: weight = 1 / n_zeros
Cases with value 1: weight = 1 / n_ones
This ensures both classes contribute equally to model training regardless of class imbalance.
Numeric vector of length nrow(data) with case weights. Each weight is the inverse of the class frequency: 1/n_zeros for 0s and 1/n_ones for 1s.
Other preprocessing:
auto_cor(),
auto_vif(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
# Imbalanced dataset: 3 zeros, 2 ones weights <- case_weights( data = data.frame( response = c(0, 0, 0, 1, 1) ), dependent.variable.name = "response" ) weights # Returns: 0.333, 0.333, 0.333, 0.5, 0.5 # Zeros get weight 1/3, ones get weight 1/2# Imbalanced dataset: 3 zeros, 2 ones weights <- case_weights( data = data.frame( response = c(0, 0, 0, 1, 1) ), dependent.variable.name = "response" ) weights # Returns: 0.333, 0.333, 0.333, 0.5, 0.5 # Zeros get weight 1/3, ones get weight 1/2
Generates four evenly-spaced distance thresholds for spatial predictor generation, ranging from 0 to half the maximum distance in the matrix.
default_distance_thresholds(distance.matrix = NULL)default_distance_thresholds(distance.matrix = NULL)
distance.matrix |
Numeric distance matrix (typically square and symmetric). Default: |
The maximum threshold is set to half the maximum distance to avoid spatial predictors based on distances that are too large to capture meaningful spatial autocorrelation. The four thresholds are evenly spaced using seq() with length.out = 4.
Numeric vector of length 4 with distance thresholds (floored to integers).
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
data(plants_distance) thresholds <- default_distance_thresholds( distance.matrix = plants_distance ) thresholds # Example output: c(0, 3333, 6666, 10000) # Four evenly-spaced thresholds from 0 to max(plants_distance)/2data(plants_distance) thresholds <- default_distance_thresholds( distance.matrix = plants_distance ) thresholds # Example output: c(0, 3333, 6666, 10000) # Four evenly-spaced thresholds from 0 to max(plants_distance)/2
Double-centers a distance matrix by converting it to weights and centering to zero row and column means. Required for computing Moran's Eigenvector Maps.
double_center_distance_matrix(distance.matrix = NULL, distance.threshold = 0)double_center_distance_matrix(distance.matrix = NULL, distance.threshold = 0)
distance.matrix |
Numeric distance matrix. Default: |
distance.threshold |
Numeric distance threshold for weight calculation. Distances above this threshold are set to 0 during weighting. Default: |
Double-centering is performed in two steps:
Convert distances to weights using weights_from_distance_matrix()
Center the matrix: subtract row means, subtract column means, and add the grand mean
The resulting matrix is symmetric with zero row and column means, suitable for Moran's Eigenvector Maps computation.
Double-centered numeric matrix with the same dimensions as distance.matrix. The matrix has row means and column means of zero.
weights_from_distance_matrix(), mem(), mem_multithreshold()
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
data(plants_distance) # Double-center the distance matrix centered <- double_center_distance_matrix( distance.matrix = plants_distance ) # Verify row means are zero head(rowMeans(centered)) # Verify column means are zero head(colMeans(centered))data(plants_distance) # Double-center the distance matrix centered <- double_center_distance_matrix( distance.matrix = plants_distance ) # Verify row means are zero head(rowMeans(centered)) # Verify column means are zero head(colMeans(centered))
Removes spatial predictors that are highly correlated with other spatial predictors or with non-spatial predictors. Particularly useful when using multiple distance thresholds that produce correlated spatial predictors.
filter_spatial_predictors( data = NULL, predictor.variable.names = NULL, spatial.predictors.df = NULL, cor.threshold = 0.5 )filter_spatial_predictors( data = NULL, predictor.variable.names = NULL, spatial.predictors.df = NULL, cor.threshold = 0.5 )
data |
Data frame containing the predictor variables. Default: |
predictor.variable.names |
Character vector of non-spatial predictor names. Must match column names in |
spatial.predictors.df |
Data frame of spatial predictors (e.g., from |
cor.threshold |
Numeric between 0 and 1 (recommended: 0.5 to 0.75). Maximum allowed absolute Pearson correlation. Default: |
Filtering is performed in two steps:
Remove spatial predictors correlated with each other (using auto_cor())
Remove spatial predictors correlated with non-spatial predictors
This two-step process ensures the retained spatial predictors are independent of both each other and the environmental predictors, improving model interpretability and reducing multicollinearity.
Data frame containing only spatial predictors with correlations below cor.threshold (both among themselves and with non-spatial predictors).
Other spatial_analysis:
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data( plants_df, plants_predictors, plants_distance ) # Generate spatial predictors using multiple distance thresholds mem.df <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000) ) # Filter spatial predictors to remove redundancy # Removes spatial predictors correlated > 0.50 with each other # or with environmental predictors spatial.predictors.filtered <- filter_spatial_predictors( data = plants_df, predictor.variable.names = plants_predictors, spatial.predictors.df = mem.df, cor.threshold = 0.50 ) # Check dimensions ncol(mem.df) # original number ncol(spatial.predictors.filtered) # after filteringdata( plants_df, plants_predictors, plants_distance ) # Generate spatial predictors using multiple distance thresholds mem.df <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000) ) # Filter spatial predictors to remove redundancy # Removes spatial predictors correlated > 0.50 with each other # or with environmental predictors spatial.predictors.filtered <- filter_spatial_predictors( data = plants_df, predictor.variable.names = plants_predictors, spatial.predictors.df = mem.df, cor.threshold = 0.50 ) # Check dimensions ncol(mem.df) # original number ncol(spatial.predictors.filtered) # after filtering
Extracts aggregated performance metrics from a model evaluated with rf_evaluate().
get_evaluation(model)get_evaluation(model)
model |
Model object with class |
This function returns aggregated statistics across all cross-validation repetitions. The "Testing" model metrics indicate the model's ability to generalize to unseen spatial locations.
Data frame with aggregated evaluation metrics containing:
model: Model type - "Full" (original model), "Training" (trained on training folds), or "Testing" (performance on testing folds, representing generalization ability).
metric: Metric name - "rmse", "nrmse", "r.squared", or "pseudo.r.squared".
mean, sd, min, max: Summary statistics across cross-validation repetitions.
rf_evaluate(), plot_evaluation(), print_evaluation()
Other model_info:
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
if(interactive()){ data(plants_rf, plants_xy) # Evaluate model with spatial cross-validation m_evaluated <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) # Extract evaluation metrics eval_metrics <- get_evaluation(m_evaluated) eval_metrics }if(interactive()){ data(plants_rf, plants_xy) # Evaluate model with spatial cross-validation m_evaluated <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) # Extract evaluation metrics eval_metrics <- get_evaluation(m_evaluated) eval_metrics }
Extracts variable importance scores from models fitted with rf(), rf_repeat(), or rf_spatial().
get_importance(model)get_importance(model)
model |
Model object from |
For spatial models (rf_spatial()) with many spatial predictors, this function returns aggregated importance statistics for spatial predictors to improve readability. Non-spatial models return per-variable importance scores directly.
Data frame with columns variable (character) and importance (numeric), sorted by decreasing importance.
rf(), rf_repeat(), rf_spatial(), plot_importance(), print_importance()
Other model_info:
get_evaluation(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract variable importance importance <- get_importance(plants_rf) head(importance) # View top 5 most important variables importance[1:5, ]data(plants_rf) # Extract variable importance importance <- get_importance(plants_rf) head(importance) # View top 5 most important variables importance[1:5, ]
Extracts local (case-specific) variable importance scores from models fitted with rf(), rf_repeat(), or rf_spatial().
get_importance_local(model)get_importance_local(model)
model |
Model object from |
Local importance measures how much each predictor contributes to predictions for individual observations, unlike global importance which summarizes contributions across all observations. This can reveal spatial or contextual patterns in variable influence.
Data frame with one row per observation and one column per predictor variable. Each cell contains the local importance score for that variable at that observation.
rf(), rf_repeat(), rf_spatial(), get_importance(), plot_importance(), print_importance()
Other model_info:
get_evaluation(),
get_importance(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract local importance scores local_imp <- get_importance_local(plants_rf) # View structure: rows = observations, columns = variables dim(local_imp) head(local_imp) # Find which variable is most important for first observation colnames(local_imp)[which.max(local_imp[1, ])]data(plants_rf) # Extract local importance scores local_imp <- get_importance_local(plants_rf) # View structure: rows = observations, columns = variables dim(local_imp) head(local_imp) # Find which variable is most important for first observation colnames(local_imp)[which.max(local_imp[1, ])]
Extracts Moran's I test results for spatial autocorrelation in model residuals from models fitted with rf(), rf_repeat(), or rf_spatial().
get_moran(model)get_moran(model)
model |
Model object from |
Moran's I tests for spatial autocorrelation in model residuals. Significant positive values indicate residuals are spatially clustered, suggesting the model hasn't fully captured spatial patterns. For spatial models (rf_spatial()), low or non-significant Moran's I values indicate successful removal of spatial autocorrelation.
Data frame with Moran's I statistics at multiple distance thresholds. Columns include distance.threshold, moran.i (statistic), p.value, interpretation, and method.
moran(), moran_multithreshold(), plot_moran(), print_moran()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract Moran's I test results moran_results <- get_moran(plants_rf) moran_results # Check for significant spatial autocorrelation significant <- moran_results[moran_results$p.value < 0.05, ] significantdata(plants_rf) # Extract Moran's I test results moran_results <- get_moran(plants_rf) moran_results # Check for significant spatial autocorrelation significant <- moran_results[moran_results$p.value < 0.05, ] significant
Extracts out-of-bag (OOB) performance metrics from models fitted with rf(), rf_repeat(), or rf_spatial().
get_performance(model)get_performance(model)
model |
Model object from |
Out-of-bag (OOB) performance is computed using observations not included in bootstrap samples during model training. Metrics typically include R-squared, pseudo R-squared, RMSE, and normalized RMSE. For repeated models, the median and median absolute deviation summarize performance across repetitions.
Data frame with performance metrics:
For rf() and rf_spatial(): columns metric and value
For rf_repeat(): columns metric, median, and median_absolute_deviation (MAD across repetitions)
rf(), rf_repeat(), rf_spatial(), print_performance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract OOB performance metrics performance <- get_performance(plants_rf) performance # For repeated models, median and MAD are returned # (example would require rf_repeat model)data(plants_rf) # Extract OOB performance metrics performance <- get_performance(plants_rf) performance # For repeated models, median and MAD are returned # (example would require rf_repeat model)
Extracts fitted (in-sample) predictions from models fitted with rf(), rf_repeat(), or rf_spatial().
get_predictions(model)get_predictions(model)
model |
Model object from |
This function returns fitted predictions for the training data used to build the model, not predictions for new data. For out-of-sample predictions on new data use stats::predict().
Numeric vector of fitted predictions with length equal to the number of training observations. For rf_repeat() models, returns the median prediction across repetitions.
rf(), rf_repeat(), rf_spatial(), get_residuals()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract fitted predictions predictions <- get_predictions(plants_rf) head(predictions) # Check length matches number of observations length(predictions) # Compare with observed values to assess fit # (observed values would be in original data)data(plants_rf) # Extract fitted predictions predictions <- get_predictions(plants_rf) head(predictions) # Check length matches number of observations length(predictions) # Compare with observed values to assess fit # (observed values would be in original data)
Extracts residuals (observed - predicted values) from models fitted with rf(), rf_repeat(), or rf_spatial().
get_residuals(model)get_residuals(model)
model |
Model object from |
Residuals are calculated as observed minus predicted values. They can be used to assess model fit, check assumptions, and diagnose patterns such as spatial autocorrelation (see get_moran()). Ideally, residuals should be randomly distributed with no systematic patterns.
Numeric vector of residuals with length equal to the number of training observations. For rf_repeat() models, returns the median residual across repetitions.
rf(), rf_repeat(), rf_spatial(), get_predictions(), get_moran(), plot_residuals_diagnostics()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract residuals residuals <- get_residuals(plants_rf) head(residuals) # Check basic statistics summary(residuals) # Plot distribution to check for patterns hist(residuals, main = "Residual Distribution", xlab = "Residuals")data(plants_rf) # Extract residuals residuals <- get_residuals(plants_rf) head(residuals) # Check basic statistics summary(residuals) # Plot distribution to check for patterns hist(residuals, main = "Residual Distribution", xlab = "Residuals")
Extracts data for plotting partial dependence (response) curves showing how predictions vary with each predictor from models fitted with rf(), rf_repeat(), or rf_spatial().
get_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, verbose = TRUE )get_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, verbose = TRUE )
model |
Model object from |
variables |
Character vector of predictor names to plot. If |
quantiles |
Numeric vector of quantiles (0 to 1) at which to fix non-plotted predictors. Multiple quantiles show response variation under different scenarios. Default: |
grid.resolution |
Integer (20 to 500) specifying the number of points along the predictor axis. Higher values produce smoother curves. Default: |
verbose |
Logical. If |
Response curves (also called partial dependence plots) show how predicted values change as a focal predictor varies while holding other predictors constant at specified quantile values. This reveals the marginal effect of each predictor.
The function generates curves by:
Creating a grid of values for the focal predictor
Fixing non-plotted predictors at each quantile (e.g., 0.1, 0.5, 0.9)
Predicting responses across the grid
Repeating for each selected predictor and quantile combination
Multiple quantiles reveal whether the effect of a predictor is consistent across different environmental contexts (parallel curves) or varies depending on other conditions (non-parallel curves).
Data frame with the following columns:
response: Predicted response values.
predictor: Predictor values along the gradient.
quantile: Factor indicating which quantile was used to fix other predictors.
model: Model index (only for rf_repeat() models with multiple repetitions).
predictor.name: Character name of the focal predictor.
response.name: Character name of the response variable.
rf(), rf_repeat(), rf_spatial(), plot_response_curves(), get_importance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) # Extract response curve data for plotting curves <- get_response_curves( model = plants_rf, variables = NULL, # auto-select important variables quantiles = c(0.1, 0.5, 0.9) ) # View structure head(curves) str(curves) # Check unique predictors included unique(curves$predictor.name)data(plants_rf) # Extract response curve data for plotting curves <- get_response_curves( model = plants_rf, variables = NULL, # auto-select important variables quantiles = c(0.1, 0.5, 0.9) ) # View structure head(curves) str(curves) # Check unique predictors included unique(curves$predictor.name)
Extracts the spatial predictors (Moran's Eigenvector Maps) used in a model fitted with rf_spatial().
get_spatial_predictors(model)get_spatial_predictors(model)
model |
Model object from |
Spatial predictors are Moran's Eigenvector Maps (MEMs) automatically generated and selected by rf_spatial() to capture spatial autocorrelation patterns in the data. This function extracts these predictors, which can be useful for understanding spatial structure or for making predictions on new spatial locations.
Data frame containing the spatial predictor values for each observation, with predictors ordered by decreasing importance.
rf_spatial(), mem(), mem_multithreshold(), get_importance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf_spatial) # Extract spatial predictors spatial_preds <- get_spatial_predictors(plants_rf_spatial) head(spatial_preds) # Check dimensions dim(spatial_preds) # View predictor names (ordered by importance) colnames(spatial_preds)data(plants_rf_spatial) # Extract spatial predictors spatial_preds <- get_spatial_predictors(plants_rf_spatial) head(spatial_preds) # Check dimensions dim(spatial_preds) # View predictor names (ordered by importance) colnames(spatial_preds)
Tests whether a variable contains only the values 0 and 1.
is_binary(data = NULL, dependent.variable.name = NULL)is_binary(data = NULL, dependent.variable.name = NULL)
data |
Data frame containing the variable to check. |
dependent.variable.name |
Character string with the name of the variable to test. Must be a column name in |
This function is used internally by spatialRF to determine whether to apply classification-specific methods (e.g., case weighting with case_weights()). The function returns FALSE if:
The variable has more than two unique values
The variable has only one unique value (constant)
The unique values are not exactly 0 and 1 (e.g., 1 and 2, or TRUE and FALSE)
Missing values (NA) are ignored when determining unique values.
Logical. TRUE if the variable contains exactly two unique values (0 and 1), FALSE otherwise.
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
# Binary variable (returns TRUE) is_binary( data = data.frame(response = c(0, 0, 0, 1, 1)), dependent.variable.name = "response" ) # Non-binary variable (returns FALSE) is_binary( data = data.frame(response = c(1, 2, 3, 4, 5)), dependent.variable.name = "response" ) # Binary but wrong values (returns FALSE) is_binary( data = data.frame(response = c(1, 1, 2, 2)), dependent.variable.name = "response" )# Binary variable (returns TRUE) is_binary( data = data.frame(response = c(0, 0, 0, 1, 1)), dependent.variable.name = "response" ) # Non-binary variable (returns FALSE) is_binary( data = data.frame(response = c(1, 2, 3, 4, 5)), dependent.variable.name = "response" ) # Binary but wrong values (returns FALSE) is_binary( data = data.frame(response = c(1, 1, 2, 2)), dependent.variable.name = "response" )
Generates two spatially independent data folds by growing a rectangular buffer from a focal point until a specified fraction of records falls inside. Used internally by make_spatial_folds() and rf_evaluate() for spatial cross-validation.
make_spatial_fold( data = NULL, dependent.variable.name = NULL, xy.i = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.8 )make_spatial_fold( data = NULL, dependent.variable.name = NULL, xy.i = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.8 )
data |
Data frame containing response variable and predictors. Required only for binary response variables. |
dependent.variable.name |
Character string with the name of the response variable. Must be a column name in |
xy.i |
Single-row data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Defines the focal point from which the buffer grows. |
xy |
Data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Contains all spatial coordinates for the dataset. |
distance.step.x |
Numeric value specifying the buffer growth increment along the x-axis. Default: |
distance.step.y |
Numeric value specifying the buffer growth increment along the y-axis. Default: |
training.fraction |
Numeric value between 0.1 and 0.9 specifying the fraction of records to include in the training fold. Default: |
This function creates spatially independent training and testing folds for spatial cross-validation. The algorithm works as follows:
Starts with a small rectangular buffer centered on the focal point (xy.i)
Grows the buffer incrementally by distance.step.x and distance.step.y
Continues growing until the buffer contains the desired number of records (training.fraction * total records)
Assigns records inside the buffer to training and records outside to testing
Special handling for binary response variables:
When data and dependent.variable.name are provided and the response is binary (0/1), the function ensures that training.fraction applies to the number of presences (1s), not total records. This prevents imbalanced sampling in presence-absence models.
List with two elements:
training: Integer vector of record IDs (from xy$id) in the training fold.
testing: Integer vector of record IDs (from xy$id) in the testing fold.
make_spatial_folds(), rf_evaluate(), is_binary()
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_folds(),
the_feature_engineer(),
weights_from_distance_matrix()
data(plants_df, plants_xy) # Create spatial fold centered on first coordinate fold <- make_spatial_fold( xy.i = plants_xy[1, ], xy = plants_xy, training.fraction = 0.6 ) # View training and testing record IDs fold$training fold$testing # Visualize the spatial split (training = red, testing = blue, center = black) if (interactive()) { plot(plants_xy[c("x", "y")], type = "n", xlab = "", ylab = "") points(plants_xy[fold$training, c("x", "y")], col = "red4", pch = 15) points(plants_xy[fold$testing, c("x", "y")], col = "blue4", pch = 15) points(plants_xy[1, c("x", "y")], col = "black", pch = 15, cex = 2) }data(plants_df, plants_xy) # Create spatial fold centered on first coordinate fold <- make_spatial_fold( xy.i = plants_xy[1, ], xy = plants_xy, training.fraction = 0.6 ) # View training and testing record IDs fold$training fold$testing # Visualize the spatial split (training = red, testing = blue, center = black) if (interactive()) { plot(plants_xy[c("x", "y")], type = "n", xlab = "", ylab = "") points(plants_xy[fold$training, c("x", "y")], col = "red4", pch = 15) points(plants_xy[fold$testing, c("x", "y")], col = "blue4", pch = 15) points(plants_xy[1, c("x", "y")], col = "black", pch = 15, cex = 2) }
Applies make_spatial_fold() to every row in xy.selected, generating one spatially independent fold centered on each focal point. Used for spatial cross-validation in rf_evaluate().
make_spatial_folds( data = NULL, dependent.variable.name = NULL, xy.selected = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.75, n.cores = parallel::detectCores() - 1, cluster = NULL )make_spatial_folds( data = NULL, dependent.variable.name = NULL, xy.selected = NULL, xy = NULL, distance.step.x = NULL, distance.step.y = NULL, training.fraction = 0.75, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame containing response variable and predictors. Required only for binary response variables. |
dependent.variable.name |
Character string with the name of the response variable. Must be a column name in |
xy.selected |
Data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Defines the focal points for fold creation. Typically a spatially thinned subset of |
xy |
Data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Contains all spatial coordinates for the dataset. |
distance.step.x |
Numeric value specifying the buffer growth increment along the x-axis. Default: |
distance.step.y |
Numeric value specifying the buffer growth increment along the y-axis. Default: |
training.fraction |
Numeric value between 0.1 and 0.9 specifying the fraction of records to include in the training fold. Default: |
n.cores |
Integer specifying the number of CPU cores for parallel execution. Default: |
cluster |
Optional cluster object created with |
This function creates multiple spatially independent folds for spatial cross-validation by calling make_spatial_fold() once for each row in xy.selected. Each fold is created by growing a rectangular buffer from the corresponding focal point until the desired training.fraction is achieved.
Parallel execution:
The function uses parallel processing to speed up fold creation. You can control parallelization with n.cores or provide a pre-configured cluster object.
Typical workflow:
Thin spatial points with thinning() or thinning_til_n() to create xy.selected
Create spatial folds with this function
Use the folds for spatial cross-validation in rf_evaluate()
List where each element corresponds to a row in xy.selected and contains:
training: Integer vector of record IDs (from xy$id) in the training fold.
testing: Integer vector of record IDs (from xy$id) in the testing fold.
make_spatial_fold(), rf_evaluate(), thinning(), thinning_til_n()
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
the_feature_engineer(),
weights_from_distance_matrix()
data(plants_df, plants_xy) # Thin to 10 focal points to speed up example xy.thin <- thinning_til_n( xy = plants_xy, n = 10 ) # Create spatial folds centered on the 10 thinned points folds <- make_spatial_folds( xy.selected = xy.thin, xy = plants_xy, distance.step.x = 0.05, training.fraction = 0.6, n.cores = 1 ) # Each element is a fold with training and testing indices length(folds) # 10 folds names(folds[[1]]) # "training" and "testing" # Visualize first fold (training = red, testing = blue, center = black) if (interactive()) { plot(plants_xy[c("x", "y")], type = "n", xlab = "", ylab = "") points(plants_xy[folds[[1]]$training, c("x", "y")], col = "red4", pch = 15) points(plants_xy[folds[[1]]$testing, c("x", "y")], col = "blue4", pch = 15) points( plants_xy[folds[[1]]$training[1], c("x", "y")], col = "black", pch = 15, cex = 2 ) }data(plants_df, plants_xy) # Thin to 10 focal points to speed up example xy.thin <- thinning_til_n( xy = plants_xy, n = 10 ) # Create spatial folds centered on the 10 thinned points folds <- make_spatial_folds( xy.selected = xy.thin, xy = plants_xy, distance.step.x = 0.05, training.fraction = 0.6, n.cores = 1 ) # Each element is a fold with training and testing indices length(folds) # 10 folds names(folds[[1]]) # "training" and "testing" # Visualize first fold (training = red, testing = blue, center = black) if (interactive()) { plot(plants_xy[c("x", "y")], type = "n", xlab = "", ylab = "") points(plants_xy[folds[[1]]$training, c("x", "y")], col = "red4", pch = 15) points(plants_xy[folds[[1]]$testing, c("x", "y")], col = "blue4", pch = 15) points( plants_xy[folds[[1]]$training[1], c("x", "y")], col = "black", pch = 15, cex = 2 ) }
Computes Moran's Eigenvector Maps (MEMs) from a distance matrix. Returns only eigenvectors with positive spatial autocorrelation, which capture broad to medium-scale spatial patterns.
mem(distance.matrix = NULL, distance.threshold = 0, colnames.prefix = "mem")mem(distance.matrix = NULL, distance.threshold = 0, colnames.prefix = "mem")
distance.matrix |
Numeric distance matrix between spatial locations. |
distance.threshold |
Numeric value specifying the maximum distance for spatial neighbors. Distances above this threshold are set to zero. Default: |
colnames.prefix |
Character string used as prefix for column names in the output. Default: |
Moran's Eigenvector Maps (MEMs) are spatial variables that represent spatial structures at different scales. The function creates MEMs through the following steps:
Double-centers the distance matrix using double_center_distance_matrix()
Computes eigenvectors and eigenvalues using base::eigen()
Normalizes eigenvalues by dividing by the maximum absolute eigenvalue
Selects only eigenvectors with positive normalized eigenvalues
Positive vs. negative eigenvalues:
Eigenvectors with positive eigenvalues represent positive spatial autocorrelation (nearby locations are similar), capturing broad to medium-scale spatial patterns. Eigenvectors with negative eigenvalues represent negative spatial autocorrelation (nearby locations are dissimilar) and are excluded. The returned MEMs are ordered by eigenvalue magnitude, with the first columns capturing the broadest spatial patterns.
These MEMs are used as spatial predictors in rf_spatial() to account for spatial autocorrelation in model residuals.
Data frame where each column is a MEM (spatial predictor) representing a different scale of spatial pattern. Columns are named with the pattern <prefix>_<number> (e.g., "mem_1", "mem_2").
mem_multithreshold(), rf_spatial(), double_center_distance_matrix()
Other spatial_analysis:
filter_spatial_predictors(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_distance) # Compute MEMs from distance matrix mems <- mem(distance.matrix = plants_distance) # View structure head(mems) dim(mems) # Check column names colnames(mems)[1:5]data(plants_distance) # Compute MEMs from distance matrix mems <- mem(distance.matrix = plants_distance) # View structure head(mems) dim(mems) # Check column names colnames(mems)[1:5]
Computes Moran's Eigenvector Maps (MEMs) using mem() at multiple distance thresholds and combines them into a single data frame. This creates spatial predictors capturing patterns at different spatial scales.
mem_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )mem_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
distance.matrix |
Numeric distance matrix between spatial locations. |
distance.thresholds |
Numeric vector of distance thresholds. Each threshold defines the maximum distance for spatial neighbors at that scale. Default: |
max.spatial.predictors |
Integer specifying the maximum number of spatial predictors to return. If the total number of MEMs exceeds this value, only the first |
This function generates spatial predictors at multiple spatial scales by computing MEMs at different distance thresholds. Different thresholds capture spatial patterns at different scales:
Smaller thresholds (e.g., 0) capture fine-scale spatial patterns
Larger thresholds capture broad-scale spatial patterns
Algorithm:
For each distance threshold, calls mem() to compute MEMs
Each mem() call applies the threshold, double-centers the matrix, and extracts positive eigenvectors
Combines all MEMs into a single data frame
Optionally limits the total number of predictors with max.spatial.predictors
The resulting MEMs are used as spatial predictors in rf_spatial() to model spatial autocorrelation at multiple scales simultaneously.
Data frame with one row per observation (matching distance.matrix dimensions) and columns representing MEMs at different distance thresholds. Column names follow the pattern spatial_predictor_<threshold>_<number> (e.g., "spatial_predictor_0_1", "spatial_predictor_1000_2").
mem(), rf_spatial(), default_distance_thresholds(), double_center_distance_matrix()
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_distance) # Compute MEMs for multiple distance thresholds mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View structure head(mems) dim(mems) # Check column names showing threshold and predictor number colnames(mems)[1:6] # Limit number of spatial predictors mems_limited <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000), max.spatial.predictors = 20 ) dim(mems_limited)data(plants_distance) # Compute MEMs for multiple distance thresholds mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View structure head(mems) dim(mems) # Check column names showing threshold and predictor number colnames(mems)[1:6] # Limit number of spatial predictors mems_limited <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000), max.spatial.predictors = 20 ) dim(mems_limited)
Computes Moran's I, a measure of spatial autocorrelation that tests whether values are more similar (positive autocorrelation) or dissimilar (negative autocorrelation) among spatial neighbors than expected by chance.
moran( x = NULL, distance.matrix = NULL, distance.threshold = NULL, verbose = TRUE )moran( x = NULL, distance.matrix = NULL, distance.threshold = NULL, verbose = TRUE )
x |
Numeric vector to test for spatial autocorrelation. Typically model residuals or a response variable. |
distance.matrix |
Numeric distance matrix between observations. Must have the same number of rows as the length of |
distance.threshold |
Numeric value specifying the maximum distance for spatial neighbors. Distances above this threshold are set to zero during weighting. Default: |
verbose |
Logical. If |
Moran's I is a measure of spatial autocorrelation that quantifies the degree to which nearby observations have similar values. The statistic ranges approximately from -1 to +1:
Positive values: Similar values cluster together (positive spatial autocorrelation)
Values near zero: Random spatial pattern (no spatial autocorrelation)
Negative values: Dissimilar values are adjacent (negative spatial autocorrelation, rare in practice)
Statistical testing:
The function compares the observed Moran's I to the expected value under the null hypothesis of no spatial autocorrelation (EI = -1/(n-1)). The p-value is computed using a normal approximation. Results are interpreted at 0.05 significance level.
Moran's scatterplot:
The plot shows original values (x-axis) against spatially lagged values (y-axis). The slope of the fitted line approximates Moran's I. Points in quadrants I and III indicate positive spatial autocorrelation; points in quadrants II and IV indicate negative spatial autocorrelation.
This implementation is inspired by the Moran.I() function in the ape package.
List of class "moran" with three elements:
test: Data frame containing:
distance.threshold: The distance threshold used
moran.i.null: Expected Moran's I under null hypothesis of no spatial autocorrelation
moran.i: Observed Moran's I statistic
p.value: Two-tailed p-value from normal approximation
interpretation: Text interpretation of the result
plot: ggplot object showing Moran's scatterplot (values vs. spatial lag values with linear fit).
plot.df: Data frame with columns x (original values) and x.lag (spatially lagged values) used to generate the plot.
moran_multithreshold(), get_moran()
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_df, plants_distance, plants_response) # Test for spatial autocorrelation in response variable moran_test <- moran( x = plants_df[[plants_response]], distance.matrix = plants_distance, distance.threshold = 1000 ) # View test results moran_test$test # Access components moran_test$test$moran.i # Observed Moran's I moran_test$test$p.value # P-value moran_test$test$interpretation # Text interpretationdata(plants_df, plants_distance, plants_response) # Test for spatial autocorrelation in response variable moran_test <- moran( x = plants_df[[plants_response]], distance.matrix = plants_distance, distance.threshold = 1000 ) # View test results moran_test$test # Access components moran_test$test$moran.i # Observed Moran's I moran_test$test$p.value # P-value moran_test$test$interpretation # Text interpretation
Computes Moran's I at multiple distance thresholds to assess spatial autocorrelation across different neighborhood scales. Identifies the distance threshold with the strongest spatial autocorrelation.
moran_multithreshold( x = NULL, distance.matrix = NULL, distance.thresholds = NULL, verbose = TRUE )moran_multithreshold( x = NULL, distance.matrix = NULL, distance.thresholds = NULL, verbose = TRUE )
x |
Numeric vector to test for spatial autocorrelation. Typically model residuals or a response variable. |
distance.matrix |
Numeric distance matrix between observations. Must have the same number of rows as the length of |
distance.thresholds |
Numeric vector of distance thresholds defining different neighborhood scales. Each threshold specifies the maximum distance for spatial neighbors at that scale. Default: |
verbose |
Logical. If |
This function applies moran() at multiple distance thresholds to explore spatial autocorrelation at different spatial scales. This multi-scale approach is valuable for several reasons:
Scale exploration: Different processes may operate at different spatial scales. Testing multiple thresholds reveals the scale(s) at which spatial autocorrelation is strongest.
Optimal neighborhood definition: Identifies the distance threshold that best captures the spatial structure in the data.
Uncertainty assessment: Spatial neighborhoods are often uncertain in ecological and spatial data. Testing multiple thresholds accounts for this uncertainty.
Interpreting results:
The plot shows Moran's I values across distance thresholds. Peaks in Moran's I indicate spatial scales where autocorrelation is strongest. The max.moran and max.moran.distance.threshold values identify the optimal scale. Significant results (p equal or lower than 0.05) indicate spatial autocorrelation at that particular scale.
This function is commonly used to:
Detect spatial autocorrelation in model residuals at multiple scales
Determine appropriate distance thresholds for generating spatial predictors with mem_multithreshold()
Assess whether spatial patterns vary across scales
List with four elements:
per.distance: Data frame with one row per distance threshold, containing columns:
distance.threshold: Distance threshold used
moran.i: Observed Moran's I statistic
moran.i.null: Expected Moran's I under null hypothesis
p.value: Two-tailed p-value
interpretation: Text interpretation of the result
plot: ggplot object showing how Moran's I varies across distance thresholds, highlighting significant results.
max.moran: Numeric value of the maximum Moran's I observed across all thresholds.
max.moran.distance.threshold: Distance threshold (in distance matrix units) where Moran's I is maximized.
moran(), mem_multithreshold(), default_distance_thresholds(), get_moran()
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_df, plants_distance, plants_response) # Test spatial autocorrelation at multiple distance thresholds moran_multi <- moran_multithreshold( x = plants_df[[plants_response]], distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View results for all thresholds moran_multi$per.distance # Find optimal distance threshold moran_multi$max.moran.distance.threshold moran_multi$max.moran # Plot shows spatial autocorrelation across scales moran_multi$plotdata(plants_df, plants_distance, plants_response) # Test spatial autocorrelation at multiple distance thresholds moran_multi <- moran_multithreshold( x = plants_df[[plants_response]], distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View results for all thresholds moran_multi$per.distance # Find optimal distance threshold moran_multi$max.moran.distance.threshold moran_multi$max.moran # Plot shows spatial autocorrelation across scales moran_multi$plot
Returns a summary of objects in the current R workspace, sorted from largest to smallest by memory size. Useful for identifying memory-intensive objects and diagnosing memory issues.
objects_size(n = 10)objects_size(n = 10)
n |
Integer specifying the number of largest objects to display. Default: |
This utility function helps monitor memory usage by displaying the largest objects in your workspace. It's particularly useful for:
Identifying memory bottlenecks during large spatial analyses
Deciding which objects to remove to free memory
Understanding the memory footprint of different data structures
The function examines all objects in the global environment (.GlobalEnv) and calculates their memory usage using utils::object.size(). Objects are automatically sorted by size in descending order.
Data frame with object names as row names and four columns:
Type: Object class (e.g., "data.frame", "matrix", "list").
Size: Memory size with automatic unit formatting (e.g., "1.2 Mb", "500 bytes").
Length/Rows: Number of elements (for vectors) or rows (for data frames/matrices).
Columns: Number of columns (for data frames/matrices; NA for vectors and other objects).
utils::object.size(), base::ls(), base::rm()
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
# Create some objects of different sizes small_vector <- runif(100) medium_matrix <- matrix(runif(10000), 100, 100) large_matrix <- matrix(runif(100000), 1000, 100) # View the 5 largest objects objects_size(n = 5) # Check all objects (up to 10 by default) objects_size()# Create some objects of different sizes small_vector <- runif(100) medium_matrix <- matrix(runif(10000), 100, 100) large_matrix <- matrix(runif(100000), 1000, 100) # View the 5 largest objects objects_size(n = 5) # Check all objects (up to 10 by default) objects_size()
Computes optimization scores for candidate spatial predictor sets using either the "moran.i" or "p.value" method. Higher scores indicate better trade-offs between spatial autocorrelation reduction, model performance, and parsimony.
optimization_function( x = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, optimization.method = "moran.i" )optimization_function( x = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, optimization.method = "moran.i" )
x |
Data frame containing optimization metrics for candidate spatial predictor sets. Generated internally by |
weight.r.squared |
Numeric value between 0 and 1 specifying the weight for R-squared in the optimization score. Higher values prioritize model performance. |
weight.penalization.n.predictors |
Numeric value between 0 and 1 specifying the weight for penalizing the number of spatial predictors. Higher values favor more parsimonious models. |
optimization.method |
Character string specifying the optimization method: |
This function balances three objectives when selecting spatial predictors:
Reduce spatial autocorrelation: Maximize 1 - Moran's I to minimize residual spatial autocorrelation
Maintain model performance: Account for model R-squared
Favor parsimony: Penalize models with many spatial predictors
Optimization methods:
The "moran.i" method computes:
score = (1 - Moran's I) + w1 × R² - w2 × penalization
where all components are rescaled to the range 0 to 1, w1 = weight.r.squared, and w2 = weight.penalization.n.predictors.
The "p.value" method computes:
score = max(1 - Moran's I, binary p-value) + w1 × R² - w2 × penalization
where the binary p-value is 1 if p equal or lower than 0.05 (no significant spatial autocorrelation), and 0 otherwise.
Practical differences:
The "moran.i" method uses continuous Moran's I values and typically selects more spatial predictors to achieve lower spatial autocorrelation
The "p.value" method uses binary significance thresholds and typically selects fewer predictors, stopping once spatial autocorrelation becomes non-significant
The optimal model is the one with the highest optimization score.
Numeric vector of optimization scores, one per row in x. Higher scores indicate better solutions. All values are rescaled between 0 and 1 for comparability.
select_spatial_predictors_recursive(), select_spatial_predictors_sequential(), moran()
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
## Not run: # This function is typically called internally during spatial predictor selection # Example showing the structure of input data: # Simulated optimization data frame opt_data <- data.frame( moran.i = c(0.5, 0.3, 0.2, 0.15), r.squared = c(0.6, 0.65, 0.68, 0.69), penalization.per.variable = c(0.1, 0.2, 0.3, 0.4), p.value.binary = c(0, 0, 1, 1) ) # Compute optimization scores scores_moran <- optimization_function( x = opt_data, weight.r.squared = 0.5, weight.penalization.n.predictors = 0.5, optimization.method = "moran.i" ) # Compare methods scores_pvalue <- optimization_function( x = opt_data, weight.r.squared = 0.5, weight.penalization.n.predictors = 0.5, optimization.method = "p.value" ) # Higher score indicates better solution which.max(scores_moran) which.max(scores_pvalue) ## End(Not run)## Not run: # This function is typically called internally during spatial predictor selection # Example showing the structure of input data: # Simulated optimization data frame opt_data <- data.frame( moran.i = c(0.5, 0.3, 0.2, 0.15), r.squared = c(0.6, 0.65, 0.68, 0.69), penalization.per.variable = c(0.1, 0.2, 0.3, 0.4), p.value.binary = c(0, 0, 1, 1) ) # Compute optimization scores scores_moran <- optimization_function( x = opt_data, weight.r.squared = 0.5, weight.penalization.n.predictors = 0.5, optimization.method = "moran.i" ) # Compare methods scores_pvalue <- optimization_function( x = opt_data, weight.r.squared = 0.5, weight.penalization.n.predictors = 0.5, optimization.method = "p.value" ) # Higher score indicates better solution which.max(scores_moran) which.max(scores_pvalue) ## End(Not run)
Computes principal components from a numeric matrix or data frame with automatic scaling and zero-variance removal. Returns all principal components as a data frame. Wrapper for stats::prcomp().
pca(x = NULL, colnames.prefix = "pca_factor")pca(x = NULL, colnames.prefix = "pca_factor")
x |
Numeric matrix or data frame to decompose into principal components. |
colnames.prefix |
Character string used as prefix for column names in the output. Default: |
This function performs Principal Component Analysis (PCA) to create uncorrelated linear combinations of the original variables. The PCA process:
Removes columns with zero variance (constant values)
Scales all remaining variables to mean = 0 and standard deviation = 1
Computes principal components using singular value decomposition
Returns all principal components ordered by decreasing variance explained
Usage in spatial analysis:
PCA is useful for dimension reduction when working with spatial distance matrices or multiple correlated spatial predictors. It creates orthogonal (uncorrelated) variables that capture the main patterns of variation while reducing dimensionality.
For spatial modeling with rf_spatial(), principal components of distance matrices can serve as alternative spatial predictors to Moran's Eigenvector Maps (MEMs). Use pca_multithreshold() to compute PCA across multiple distance thresholds for multi-scale spatial modeling.
Data frame where each column is a principal component, ordered by decreasing variance explained. Columns are named with the pattern <prefix>_<number> (e.g., "pca_factor_1", "pca_factor_2"). The number of rows matches the number of rows in x.
pca_multithreshold(), mem(), stats::prcomp()
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_distance) # Compute principal components from distance matrix pca_components <- pca(x = plants_distance) # View structure head(pca_components) dim(pca_components) # Check column names colnames(pca_components)[1:5] # Custom column prefix pca_custom <- pca( x = plants_distance, colnames.prefix = "distance_pc" ) colnames(pca_custom)[1:3]data(plants_distance) # Compute principal components from distance matrix pca_components <- pca(x = plants_distance) # View structure head(pca_components) dim(pca_components) # Check column names colnames(pca_components)[1:5] # Custom column prefix pca_custom <- pca( x = plants_distance, colnames.prefix = "distance_pc" ) colnames(pca_custom)[1:3]
Computes principal components of a distance matrix at multiple distance thresholds to generate multi-scale spatial predictors for rf_spatial(). Each distance threshold defines a different neighborhood scale, and PCA is applied to the weighted distance matrix at each scale.
pca_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )pca_multithreshold( distance.matrix = NULL, distance.thresholds = NULL, max.spatial.predictors = NULL )
distance.matrix |
Numeric distance matrix between observations. |
distance.thresholds |
Numeric vector of distance thresholds defining different neighborhood scales. Each threshold specifies the maximum distance for spatial neighbors at that scale. If |
max.spatial.predictors |
Integer specifying the maximum number of spatial predictors to retain. If the total number of generated predictors exceeds this value, only the first |
This function generates multi-scale spatial predictors by applying PCA to distance matrices at different neighborhood scales. The process for each distance threshold:
Converts the distance matrix to weights using weights_from_distance_matrix(), where distances above the threshold are set to zero
Applies pca() to the weighted distance matrix to extract principal components
Names the resulting predictors with the distance threshold for identification
Filters out predictors with all near-zero values
Multi-scale spatial modeling:
Different distance thresholds capture spatial patterns at different scales. Combining predictors from multiple thresholds allows rf_spatial() to account for spatial autocorrelation operating at multiple spatial scales simultaneously. This is analogous to mem_multithreshold() but uses PCA instead of Moran's Eigenvector Maps.
Comparison with MEMs:
Both pca_multithreshold() and mem_multithreshold() generate spatial predictors from distance matrices, but differ in their approach:
PCA: Captures the main patterns of variation in the weighted distance matrix without considering spatial autocorrelation structure
MEMs: Explicitly extracts spatial patterns with specific autocorrelation scales (positive and negative eigenvalues)
In practice, MEMs are generally preferred for spatial modeling because they explicitly target spatial autocorrelation patterns, but PCA can serve as a simpler alternative or for comparison.
Data frame where each column is a spatial predictor derived from PCA at a specific distance threshold. Columns are named with the pattern spatial_predictor_<distance>_<number> (e.g., "spatial_predictor_1000_1", "spatial_predictor_5000_2"), where <distance> is the distance threshold and <number> is the principal component rank. The number of rows matches the number of observations in distance.matrix.
pca(), mem_multithreshold(), weights_from_distance_matrix(), default_distance_thresholds()
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_distance) # Compute PCA spatial predictors at multiple distance thresholds pca_predictors <- pca_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View structure head(pca_predictors) dim(pca_predictors) # Check predictor names (show scale information) colnames(pca_predictors)[1:6] # Limit number of predictors to save memory pca_limited <- pca_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000), max.spatial.predictors = 20 ) ncol(pca_limited) # At most 20 predictorsdata(plants_distance) # Compute PCA spatial predictors at multiple distance thresholds pca_predictors <- pca_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000) ) # View structure head(pca_predictors) dim(pca_predictors) # Check predictor names (show scale information) colnames(pca_predictors)[1:6] # Limit number of predictors to save memory pca_limited <- pca_multithreshold( distance.matrix = plants_distance, distance.thresholds = c(0, 1000, 5000), max.spatial.predictors = 20 ) ncol(pca_limited) # At most 20 predictors
Vascular plant species richness for American ecoregions as defined in Ecoregions 2017.
data(plants_df)data(plants_df)
A data frame with 227 rows and 22 columns:
ecoregion_id: Ecoregion identifier.
x: Longitude in degrees (WGS84).
y: Latitude in degrees (WGS84).
richness_species_vascular: Number of vascular plant species (response variable).
bias_area_km2: Ecoregion area in square kilometers.
bias_species_per_record: Species count divided by GBIF spatial records (sampling bias metric).
climate_aridity_index_average: Average aridity index.
climate_hypervolume: Climatic envelope volume computed with hypervolume.
climate_velocity_lgm_average: Average climate velocity since the Last Glacial Maximum.
neighbors_count: Number of immediate neighbors (connectivity metric).
neighbors_percent_shared_edge: Percentage of shared edge with neighbors (connectivity metric).
human_population_density: Human population density.
topography_elevation_average: Average elevation.
landcover_herbs_percent_average: Average herb cover from MODIS Vegetation Continuous Fields.
fragmentation_cohesion: Cohesion index computed with landscapemetrics.
fragmentation_division: Division index computed with landscapemetrics.
neighbors_area: Total area of immediate neighbors.
human_population: Total human population.
human_footprint_average: Average human footprint index.
climate_bio1_average: Average mean annual temperature.
climate_bio15_minimum: Minimum precipitation seasonality.
Other data:
plants_distance,
plants_predictors,
plants_response,
plants_rf,
plants_rf_spatial,
plants_xy
Distance matrix (in km) between the edges of American ecoregions in plants_df.
data(plants_distance)data(plants_distance)
Numeric matrix with 227 rows and 227 columns.
Other data:
plants_df,
plants_predictors,
plants_response,
plants_rf,
plants_rf_spatial,
plants_xy
Character vector of predictor variable names from plants_df (columns 5 to 21).
data(plants_predictors)data(plants_predictors)
A character vector of length 17.
Other data:
plants_df,
plants_distance,
plants_response,
plants_rf,
plants_rf_spatial,
plants_xy
Character string containing the name of the response variable in plants_df: "richness_species_vascular".
data(plants_response)data(plants_response)
A character string of length 1.
Other data:
plants_df,
plants_distance,
plants_predictors,
plants_rf,
plants_rf_spatial,
plants_xy
Fitted random forest model using plants_df. Provided for testing and examples without requiring model fitting. Fitted with reduced complexity for faster computation and smaller object size.
data(plants_rf)data(plants_rf)
An object of class rf fitted with the following parameters:
data: plants_df
dependent.variable.name: plants_response ("richness_species_vascular")
predictor.variable.names: plants_predictors (17 variables)
distance.matrix: plants_distance
xy: plants_xy
distance.thresholds: c(100, 1000, 2000, 4000)
num.trees: 50
min.node.size: 30
n.cores: 1
This model uses reduced complexity (50 trees, min.node.size = 30) to keep object size small for package distribution. For actual analyses, use higher values (e.g., num.trees = 500, min.node.size = 5).
rf(), plants_df, plants_response, plants_predictors
Other data:
plants_df,
plants_distance,
plants_predictors,
plants_response,
plants_rf_spatial,
plants_xy
Fitted spatial random forest model using plants_df with spatial predictors from Moran's Eigenvector Maps. Provided for testing and examples without requiring model fitting. Fitted with reduced complexity for faster computation and smaller object size.
data(plants_rf_spatial)data(plants_rf_spatial)
An object of class rf fitted with the following parameters:
data: plants_df
dependent.variable.name: plants_response ("richness_species_vascular")
predictor.variable.names: plants_predictors (17 variables)
distance.matrix: plants_distance
xy: plants_xy
distance.thresholds: c(100, 1000, 2000, 4000)
method: "mem.effect.recursive"
num.trees: 50
min.node.size: 30
n.cores: 14
This spatial model includes spatial predictors (Moran's Eigenvector Maps) selected using the recursive method to minimize residual spatial autocorrelation. Uses reduced complexity (50 trees, min.node.size = 30) to keep object size small for package distribution. For actual analyses, use higher values (e.g., num.trees = 500, min.node.size = 5).
rf_spatial(), rf(), plants_rf, plants_df, plants_response, plants_predictors
Other data:
plants_df,
plants_distance,
plants_predictors,
plants_response,
plants_rf,
plants_xy
Spatial coordinates (longitude and latitude) extracted from plants_df for use in spatial modeling functions.
data(plants_xy)data(plants_xy)
A data frame with 227 rows and 2 columns:
x: Longitude in degrees (WGS84).
y: Latitude in degrees (WGS84).
Other data:
plants_df,
plants_distance,
plants_predictors,
plants_response,
plants_rf,
plants_rf_spatial
Creates boxplots comparing model performance metrics across training, testing, and full datasets from spatial cross-validation performed by rf_evaluate(). Displays distributions of R-squared, RMSE, and other metrics across all spatial folds.
plot_evaluation( model, fill.color = viridis::viridis(3, option = "F", alpha = 0.8, direction = -1), line.color = "gray30", verbose = TRUE, notch = TRUE )plot_evaluation( model, fill.color = viridis::viridis(3, option = "F", alpha = 0.8, direction = -1), line.color = "gray30", verbose = TRUE, notch = TRUE )
model |
Model fitted with |
fill.color |
Character vector with three colors (one for each model type: Testing, Training, Full) or a function that generates a color palette. Accepts hexadecimal codes (e.g., |
line.color |
Character string specifying the color of boxplot borders. Default: |
verbose |
Logical. If |
notch |
Logical. If |
This function visualizes the distribution of performance metrics across spatial folds, with separate boxplots for three model variants:
Testing: Performance on spatially independent testing folds (most reliable estimate of generalization)
Training: Performance on training folds (typically optimistic)
Full: Performance on the complete dataset (reference baseline)
Interpreting the plot:
The boxplots show the distribution of each metric across all spatial folds. Ideally:
Testing performance should be reasonably close to training performance (indicates good generalization)
Large gaps between training and testing suggest overfitting
Low variance across folds indicates stable, consistent model performance
High variance suggests performance depends strongly on spatial location
The plot includes a title showing the number of spatial folds used in the evaluation.
Available metrics:
Displayed metrics depend on the response variable type:
Continuous response: R-squared, RMSE (Root Mean Squared Error), NRMSE (Normalized RMSE)
Binary response: AUC (Area Under ROC Curve), pseudo R-squared
ggplot object that can be further customized or saved. The plot displays boxplots of performance metrics (R-squared, RMSE, NRMSE, pseudo R-squared, or AUC depending on model type) across spatial folds, faceted by metric.
rf_evaluate(), get_evaluation(), print_evaluation()
Other visualization:
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
if(interactive()){ data(plants_rf, plants_xy) # Perform spatial cross-validation plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) # Visualize evaluation results plot_evaluation(plants_rf) # Without notches for simpler boxplots plot_evaluation(plants_rf, notch = FALSE) # Custom colors plot_evaluation( plants_rf, fill.color = c("#E64B35FF", "#4DBBD5FF", "#00A087FF") ) # Print summary statistics print_evaluation(plants_rf) # Extract evaluation data for custom analysis evaluation_data <- get_evaluation(plants_rf) head(evaluation_data) }if(interactive()){ data(plants_rf, plants_xy) # Perform spatial cross-validation plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) # Visualize evaluation results plot_evaluation(plants_rf) # Without notches for simpler boxplots plot_evaluation(plants_rf, notch = FALSE) # Custom colors plot_evaluation( plants_rf, fill.color = c("#E64B35FF", "#4DBBD5FF", "#00A087FF") ) # Print summary statistics print_evaluation(plants_rf) # Extract evaluation data for custom analysis evaluation_data <- get_evaluation(plants_rf) head(evaluation_data) }
Creates a visualization of variable importance scores from models fitted with rf(), rf_repeat(), or rf_spatial(). For single-run models (rf(), rf_spatial()), displays points ordered by importance. For repeated models (rf_repeat()), displays violin plots showing the distribution of importance scores across model repetitions.
plot_importance( model, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", verbose = TRUE )plot_importance( model, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", verbose = TRUE )
model |
Model fitted with |
fill.color |
Character vector of colors or a function generating a color palette. Accepts hexadecimal codes (e.g., |
line.color |
Character string specifying the color of point borders (single-run models) or violin plot outlines (repeated models). Default: |
verbose |
Logical. If |
This function creates different visualizations depending on the model type:
Single-run models (rf(), rf_spatial() without repetitions):
Displays points showing the importance value for each variable
Variables ordered top-to-bottom by importance (most important at top)
Point color represents importance magnitude using a continuous gradient
Repeated models (rf_repeat(), rf_spatial() with repetitions):
Displays violin plots showing the distribution of importance across repetitions
Variables ordered top-to-bottom by median importance (most important at top)
The median line within each violin shows the center of the distribution
Width of violin reflects the density of importance values at each level
Each variable receives a distinct fill color
Importance metric:
The x-axis shows permutation importance, which measures the increase in prediction error when a variable's values are randomly shuffled. Higher values indicate more important variables. Importance is computed on out-of-bag (OOB) samples, providing an unbiased estimate of variable contribution.
Spatial predictors:
In rf_spatial() models, all spatial predictors (MEMs or PCA factors) are grouped into a single category labeled "spatial_predictors" to simplify comparison with non-spatial predictors.
Note on violin plots:
Violin plots display kernel density estimates. The median line shown is the median of the density estimate, which may differ slightly from the actual data median. However, variables are always ordered by the true median importance to ensure accurate ranking.
Cross-validated importance:
This function does not plot results from rf_importance(). For cross-validated importance plots, access model$importance$cv.per.variable.plot after running rf_importance().
ggplot object that can be further customized or saved. The plot displays variable importance on the x-axis and variable names on the y-axis, ordered by importance (highest at top).
print_importance(), get_importance(), rf_importance()
Other visualization:
plot_evaluation(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
if(interactive()){ data(plants_rf, plants_rf_spatial) # Plot importance from Random Forest model plot_importance(plants_rf) # Plot importance from Spatial Random Forest model plot_importance(plants_rf_spatial) }if(interactive()){ data(plants_rf, plants_rf_spatial) # Plot importance from Random Forest model plot_importance(plants_rf) # Plot importance from Spatial Random Forest model plot_importance(plants_rf_spatial) }
Plots the results of spatial autocorrelation tests for a variety of functions within the package. The x axis represents the Moran's I estimate, the y axis contains the values of the distance thresholds, the dot sizes represent the p-values of the Moran's I estimate, and the red dashed line represents the theoretical null value of the Moran's I estimate.
plot_moran( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), line.color = "gray30", option = 1, ncol = 1, verbose = TRUE )plot_moran( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), line.color = "gray30", option = 1, ncol = 1, verbose = TRUE )
model |
A model fitted with |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
option |
Integer, type of plot. If |
ncol |
Number of columns of the plot. Only relevant when |
verbose |
Logical, if |
A ggplot.
moran(), moran_multithreshold()
Other visualization:
plot_evaluation(),
plot_importance(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
data(plants_rf) plot_moran(plants_rf) plot_moran(plants_rf, option = 2)data(plants_rf) plot_moran(plants_rf) plot_moran(plants_rf, option = 2)
Plots optimization data frames produced by select_spatial_predictors_sequential()
and select_spatial_predictors_recursive().
plot_optimization( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), verbose = TRUE )plot_optimization( model, point.color = viridis::viridis( 100, option = "F", direction = -1 ), verbose = TRUE )
model |
A model produced by |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
verbose |
Logical, if |
The function returns NULL invisibly (without plotting) when:
The method used to fit a model with rf_spatial() is "hengl" (no optimization required)
No spatial predictors were selected during model fitting
The model is non-spatial
A ggplot, or NULL invisibly if no optimization data is available.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
data(plants_rf_spatial) plot_optimization(plants_rf_spatial)data(plants_rf_spatial) plot_optimization(plants_rf_spatial)
Plots normality and autocorrelation tests of model residuals.
plot_residuals_diagnostics( model, point.color = viridis::viridis(100, option = "F"), line.color = "gray10", fill.color = viridis::viridis(4, option = "F", alpha = 0.95)[2], option = 1, ncol = 1, verbose = TRUE )plot_residuals_diagnostics( model, point.color = viridis::viridis(100, option = "F"), line.color = "gray10", fill.color = viridis::viridis(4, option = "F", alpha = 0.95)[2], option = 1, ncol = 1, verbose = TRUE )
model |
A model produced by |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
fill.color |
Character string, fill color of the bars produced by |
option |
(argument of |
ncol |
(argument of |
verbose |
Logical, if |
A patchwork object.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
data(plants_rf) plot_residuals_diagnostics(plants_rf)data(plants_rf) plot_residuals_diagnostics(plants_rf)
Plots the response curves of models fitted with rf(), rf_repeat(), or rf_spatial().
plot_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, line.color = viridis::viridis(length(quantiles), option = "F", end = 0.9), ncol = 2, show.data = FALSE, verbose = TRUE )plot_response_curves( model = NULL, variables = NULL, quantiles = c(0.1, 0.5, 0.9), grid.resolution = 200, line.color = viridis::viridis(length(quantiles), option = "F", end = 0.9), ncol = 2, show.data = FALSE, verbose = TRUE )
model |
A model fitted with |
variables |
Character vector, names of predictors to plot. If |
quantiles |
Numeric vector with values between 0 and 1, argument |
grid.resolution |
Integer between 20 and 500. Resolution of the plotted curve Default: |
line.color |
Character vector with colors, or function to generate colors for the lines representing |
ncol |
Integer, argument of wrap_plots. Defaults to the rounded squared root of the number of plots. Default: |
show.data |
Logical, if |
verbose |
Logical, if TRUE the plot is printed. Default: |
All variables that are not plotted in a particular response curve are set to the values of their respective quantiles, and the response curve for each one of these quantiles is shown in the plot. When the input model was fitted with rf_repeat() with keep.models = TRUE, then the plot shows the median of all model runs, and each model run separately as a thinner line. The output list can be plotted all at once with patchwork::wrap_plots(p) or cowplot::plot_grid(plotlist = p), or one by one by extracting each plot from the list.
A list with slots named after the selected variables, with one ggplot each.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
data(plants_rf) plot_response_curves( model = plants_rf, variables = "climate_bio1_average" ) plot_response_curves( model = plants_rf, variables = "climate_bio1_average", show.data = TRUE )data(plants_rf) plot_response_curves( model = plants_rf, variables = "climate_bio1_average" ) plot_response_curves( model = plants_rf, variables = "climate_bio1_average", show.data = TRUE )
Plots response surfaces for any given pair of predictors in a rf(), rf_repeat(), or rf_spatial() model.
plot_response_surface( model = NULL, a = NULL, b = NULL, quantiles = 0.5, grid.resolution = 100, point.size.range = c(0.5, 2.5), point.alpha = 1, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.9), point.color = "gray30", verbose = TRUE )plot_response_surface( model = NULL, a = NULL, b = NULL, quantiles = 0.5, grid.resolution = 100, point.size.range = c(0.5, 2.5), point.alpha = 1, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.9), point.color = "gray30", verbose = TRUE )
model |
A model fitted with |
a |
Character string, name of a model predictor. If |
b |
Character string, name of a model predictor. If |
quantiles |
Numeric vector between 0 and 1. Argument |
grid.resolution |
Integer between 20 and 500. Resolution of the plotted surface Default: |
point.size.range |
Numeric vector of length 2 with the range of point sizes used by geom_point. Using |
point.alpha |
Numeric between 0 and 1, transparency of the points. Setting it to |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
point.color |
Character vector with a color name (e.g. "red4"). Default: |
verbose |
Logical, if TRUE the plot is printed. Default: |
All variables that are not a or b in a response curve are set to the values of their respective quantiles to plot the response surfaces. The output list can be plotted all at once with patchwork::wrap_plots(p) or cowplot::plot_grid(plotlist = p), or one by one by extracting each plot from the list.
A list with slots named after the selected quantiles, each one with a ggplot.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_training_df(),
plot_training_df_moran(),
plot_tuning()
data(plants_rf) plot_response_surface( model = plants_rf, a = "climate_bio1_average", b = "human_population", grid.resolution = 50 )data(plants_rf) plot_response_surface( model = plants_rf, a = "climate_bio1_average", b = "human_population", grid.resolution = 50 )
Plots the dependent variable against each predictor.
plot_training_df( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, ncol = 4, method = "loess", point.color = viridis::viridis(100, option = "F"), line.color = "gray30" )plot_training_df( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, ncol = 4, method = "loess", point.color = viridis::viridis(100, option = "F"), line.color = "gray30" )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
ncol |
Number of columns of the plot. Argument |
method |
Method for geom_smooth, one of: "lm", "glm", "gam", "loess", or a function, for example |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
A wrap_plots object.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df_moran(),
plot_tuning()
data( plants_df, plants_response, plants_predictors ) plot_training_df( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors[1:4] )data( plants_df, plants_response, plants_predictors ) plot_training_df( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors[1:4] )
Plots the the Moran's I test of the response and the predictors in a training data frame.
plot_training_df_moran( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1), point.color = "gray30" )plot_training_df_moran( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1), point.color = "gray30" )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector, distances below each value are set to 0 on separated copies of the distance matrix for the computation of Moran's I at different neighborhood distances. If |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
point.color |
Character vector with a color name (e.g. "red4"). Default: |
A ggplot2 object.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_tuning()
data( plants_df, plants_response, plants_predictors, plants_distance ) plot_training_df_moran( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors[1:4], distance.matrix = plants_distance, distance.thresholds = c(1000, 2000, 4000) )data( plants_df, plants_response, plants_predictors, plants_distance ) plot_training_df_moran( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors[1:4], distance.matrix = plants_distance, distance.thresholds = c(1000, 2000, 4000) )
rf_tuning()
Plots the tuning of the hyperparameters num.trees, mtry, and min.node.size performed by rf_tuning().
plot_tuning( model, point.color = viridis::viridis( 100, option = "F" ), verbose = TRUE )plot_tuning( model, point.color = viridis::viridis( 100, option = "F" ), verbose = TRUE )
model |
A model fitted with |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
verbose |
Logical, if |
A ggplot.
Other visualization:
plot_evaluation(),
plot_importance(),
plot_moran(),
plot_optimization(),
plot_residuals_diagnostics(),
plot_response_curves(),
plot_response_surface(),
plot_training_df(),
plot_training_df_moran()
if(interactive()){ data( plants_rf, plants_xy ) plants_rf_tuned <- rf_tuning( model = plants_rf, num.trees = c(25, 50), mtry = c(5, 10), min.node.size = c(10, 20), xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_tuning(plants_rf_tuned) }if(interactive()){ data( plants_rf, plants_xy ) plants_rf_tuned <- rf_tuning( model = plants_rf, num.trees = c(25, 50), mtry = c(5, 10), min.node.size = c(10, 20), xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_tuning(plants_rf_tuned) }
Prepares variable importance data frames and plots for models fitted with rf_spatial().
prepare_importance_spatial(model)prepare_importance_spatial(model)
model |
An importance data frame with spatial predictors, or a model fitted with |
A list with importance data frames in different formats depending on whether the model was fitted with rf() or rf_repeat().
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
data(plants_rf_spatial) prepare_importance_spatial(plants_rf_spatial) %>% head()data(plants_rf_spatial) prepare_importance_spatial(plants_rf_spatial) %>% head()
Prints the results of an spatial cross-validation performed with rf_evaluate().
print_evaluation(model)print_evaluation(model)
model |
A model resulting from |
A table printed to the standard output.
plot_evaluation(), get_evaluation()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_importance(),
print_moran(),
print_performance()
if(interactive()){ data( plants_rf, plants_xy ) plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) print_evaluation(plants_rf) }if(interactive()){ data( plants_rf, plants_xy ) plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) print_evaluation(plants_rf) }
Prints variable importance scores from rf, rf_repeat, and rf_spatial models.
print_importance( model, verbose = TRUE )print_importance( model, verbose = TRUE )
model |
A model fitted with rf, rf_repeat, or rf_spatial. |
verbose |
Logical, if |
A table printed to the standard output.
plot_importance(), get_importance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_moran(),
print_performance()
data(plants_rf) print_importance(plants_rf)data(plants_rf) print_importance(plants_rf)
Prints the results of a Moran's I test on the residuals of a model.
print_moran( model, caption = NULL, verbose = TRUE )print_moran( model, caption = NULL, verbose = TRUE )
model |
A model fitted with |
caption |
Character, caption of the output table, Default: |
verbose |
Logical, if |
Prints a table in the console using the huxtable package.
moran(), moran_multithreshold(), get_moran(), plot_moran()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_performance()
data(plants_rf) print_moran(plants_rf)data(plants_rf) print_moran(plants_rf)
Prints the performance slot of a model fitted with rf(), rf_repeat(), or rf_spatial(). For models fitted with rf_repeat() it shows the median and the median absolute deviation of each performance measure.
print_performance(model)print_performance(model)
model |
Model fitted with |
Prints model performance scores to the console.
print_performance(), get_performance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print.rf(),
print_evaluation(),
print_importance(),
print_moran()
data(plants_rf) print_performance(plants_rf)data(plants_rf) print_performance(plants_rf)
Custom print method for models fitted with rf(), rf_repeat(), and rf_spatial().
## S3 method for class 'rf' print(x, ...)## S3 method for class 'rf' print(x, ...)
x |
A model fitted with |
... |
Additional arguments for print methods. |
Prints model details to the console.
print_evaluation(), print_importance(), print_moran(), print_performance()
Other model_info:
get_evaluation(),
get_importance(),
get_importance_local(),
get_moran(),
get_performance(),
get_predictions(),
get_residuals(),
get_response_curves(),
get_spatial_predictors(),
print_evaluation(),
print_importance(),
print_moran(),
print_performance()
data(plants_rf) print(plants_rf) #or plants_rfdata(plants_rf) print(plants_rf) #or plants_rf
Ranks spatial predictors generated by mem_multithreshold() or pca_multithreshold() by their effect in reducing the Moran's I of the model residuals (ranking.method = "effect"), or by their own Moran's I (ranking.method = "moran").
In the former case, one model of the type y ~ predictors + spatial_predictor_X is fitted per spatial predictor, and the Moran's I of this model's residuals is compared with the one of the model without spatial predictors (y ~ predictors), to finally rank the spatial predictor from maximum to minimum difference in Moran's I.
In the latter case, the spatial predictors are ordered by their Moran's I alone (this is the faster option).
In both cases, spatial predictors that are redundant with others at a Pearson correlation > 0.5 and spatial predictors with no effect (no reduction of Moran's I or Moran's I of the spatial predictor equal or lower than 0) are removed.
This function has been designed to be used internally by rf_spatial() rather than directly by a user.
rank_spatial_predictors( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, ranking.method = c("moran", "effect"), reference.moran.i = 1, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )rank_spatial_predictors( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, ranking.method = c("moran", "effect"), reference.moran.i = 1, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
List with ranger arguments. See rf or rf_repeat for further details. |
spatial.predictors.df |
Data frame of spatial predictors. |
ranking.method |
Character, method used by to rank spatial predictors. The method "effect" ranks spatial predictors according how much each predictor reduces Moran's I of the model residuals, while the method "moran" ranks them by their own Moran's I. Default: |
reference.moran.i |
Moran's I of the residuals of the model without spatial predictors. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with four slots:
method: Character, name of the method used to rank the spatial predictors.
criteria: Data frame with two different configurations depending on the ranking method. If ranking.method = "effect", the columns contain the names of the spatial predictors, the r-squared of the model, the Moran's I of the model residuals, the difference between the Moran's I of the model including the given spatial predictor, and the Moran's I of the model fitted without spatial predictors, and the interpretation of the Moran's I value. If ranking.method = "moran", only the name of the spatial predictor and it's Moran's I are in the output data frame.
ranking: Ordered character vector with the names of the spatial predictors selected.
spatial.predictors.df: data frame with the selected spatial predictors in the order of the ranking.
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
if(interactive()){ data( plants_df, plants_response, plants_distance ) #subset to speed up example idx <- 50:90 plants_distance_sub <- plants_distance[idx, idx] y <- mem( distance.matrix = plants_distance_sub, distance.threshold = 1000 ) #rank spatial predictors by Moran's I y_rank <- rank_spatial_predictors( distance.matrix = plants_distance_sub, distance.thresholds = 1000, spatial.predictors.df = y, ranking.method = "moran", n.cores = 1 ) y_rank$criteria y_rank$ranking #rank spatial predictors by association with response y_rank <- rank_spatial_predictors( data = plants_df[idx, ], dependent.variable.name = plants_response, distance.matrix = plants_distance_sub, distance.thresholds = 1000, spatial.predictors.df = y, ranking.method = "effect", n.cores = 1 ) y_rank$criteria y_rank$ranking }if(interactive()){ data( plants_df, plants_response, plants_distance ) #subset to speed up example idx <- 50:90 plants_distance_sub <- plants_distance[idx, idx] y <- mem( distance.matrix = plants_distance_sub, distance.threshold = 1000 ) #rank spatial predictors by Moran's I y_rank <- rank_spatial_predictors( distance.matrix = plants_distance_sub, distance.thresholds = 1000, spatial.predictors.df = y, ranking.method = "moran", n.cores = 1 ) y_rank$criteria y_rank$ranking #rank spatial predictors by association with response y_rank <- rank_spatial_predictors( data = plants_df[idx, ], dependent.variable.name = plants_response, distance.matrix = plants_distance_sub, distance.thresholds = 1000, spatial.predictors.df = y, ranking.method = "effect", n.cores = 1 ) y_rank$criteria y_rank$ranking }
Rescales a numeric vector to a new range.
rescale_vector( x = NULL, new.min = 0, new.max = 1, integer = FALSE )rescale_vector( x = NULL, new.min = 0, new.max = 1, integer = FALSE )
x |
Numeric vector. Default: |
new.min |
New minimum value. Default: |
new.max |
New maximum value. Default: |
integer |
Logical, if |
A numeric vector of the same length as x, but with its values rescaled between new.min and new.max.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
y <- rescale_vector( x = rnorm(100), new.min = 0, new.max = 100, integer = TRUE ) yy <- rescale_vector( x = rnorm(100), new.min = 0, new.max = 100, integer = TRUE ) y
Applies a Shapiro-Wilks test to a numeric vector, and plots the qq plot and the histogram.
residuals_diagnostics(residuals, predictions)residuals_diagnostics(residuals, predictions)
residuals |
Numeric vector, model residuals. |
predictions |
Numeric vector, model predictions. |
The function shapiro.test() has a hard limit of 5000 cases. If the model residuals have more than 5000 cases, then sample(x = residuals, size = 5000) is applied to the model residuals before the test.
A list with four slots:
/item w W statistic returned by shapiro.test().
/item p.value p-value of the Shapiro test.
/item interpretation Character vector, one of "x is normal", "x is not normal".
/item plot A patchwork plot with the qq plot and the histogram of x.
ggplot,aes,geom_qq_line,ggtheme,labs,geom_freqpoly,geom_abline
plot_annotation
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_test(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
data(plants_rf) y <- residuals_diagnostics( residuals = get_residuals(plants_rf), predictions = get_predictions(plants_rf) ) ydata(plants_rf) y <- residuals_diagnostics( residuals = get_residuals(plants_rf), predictions = get_predictions(plants_rf) ) y
Applies a Shapiro-Wilks test to a numeric vector, and returns a list with the statistic W, its p-value, and a character string with the interpretation.
residuals_test(residuals)residuals_test(residuals)
residuals |
Numeric vector, model residuals. |
A list with four slots:
/item w W statistic returned by shapiro.test().
/item p.value p-value of the Shapiro test.
/item interpretation Character vector, one of "x is normal", "x is not normal".
/item plot A patchwork plot with the qq plot and the histogram of x.
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
select_spatial_predictors_recursive(),
select_spatial_predictors_sequential()
residuals_test(residuals = runif(100))residuals_test(residuals = runif(100))
Fits a random forest model using ranger and extends it with spatial diagnostics: residual autocorrelation (Moran's I) at multiple distance thresholds, performance metrics (RMSE, NRMSE via root_mean_squared_error()), and variable importance scores computed on scaled data (via scale).
rf( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be a column name in |
predictor.variable.names |
Character vector with predictor variable names. All names must be columns in |
distance.matrix |
Square matrix with pairwise distances between observations in |
distance.thresholds |
Numeric vector of distance thresholds for spatial autocorrelation analysis. For each threshold, distances below that value are set to zero when computing Moran's I. If |
xy |
Data frame or matrix with two columns containing coordinates, named "x" and "y". Not used by this function but stored in the model for use by |
ranger.arguments |
Named list with ranger arguments. Arguments for this function can also be passed here. The default importance method is 'permutation' instead of ranger's default 'none'. The |
scaled.importance |
If |
seed |
Random seed for reproducibility. Default: |
verbose |
If |
n.cores |
Number of cores for parallel execution. Default: |
cluster |
Cluster object from |
See ranger documentation for additional details. The formula interface is supported via ranger.arguments, but variable interactions are not permitted. For feature engineering including interactions, see the_feature_engineer().
A ranger model object with additional slots:
ranger.arguments: Arguments used to fit the model.
importance: List with global importance data frame (predictors ranked by importance), importance plot, and local importance scores (per-observation difference in accuracy between permuted and non-permuted predictors, based on out-of-bag data).
performance: Model performance metrics including R-squared (out-of-bag and standard), pseudo R-squared, RMSE, and NRMSE.
residuals: Model residuals with normality diagnostics (residuals_diagnostics()) and spatial autocorrelation (moran_multithreshold()).
Other main_models:
rf_spatial()
data( plants_df, plants_response, plants_predictors, plants_distance ) m <- rf( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), ranger.arguments = list( num.trees = 50, min.node.size = 20 ), verbose = FALSE, n.cores = 1 ) class(m) #variable importance m$importance$per.variable m$importance$per.variable.plot #model performance m$performance #autocorrelation of residuals m$residuals$autocorrelation$per.distance m$residuals$autocorrelation$plot #model predictions m$predictions$values #predictions for new data (using stats::predict) y <- stats::predict( object = m, data = plants_df[1:5, ], type = "response" )$predictions #alternative: pass arguments via ranger.arguments list args <- list( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), num.trees = 50, min.node.size = 20, num.threads = 1 ) m <- rf( ranger.arguments = args, verbose = FALSE )data( plants_df, plants_response, plants_predictors, plants_distance ) m <- rf( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), ranger.arguments = list( num.trees = 50, min.node.size = 20 ), verbose = FALSE, n.cores = 1 ) class(m) #variable importance m$importance$per.variable m$importance$per.variable.plot #model performance m$performance #autocorrelation of residuals m$residuals$autocorrelation$per.distance m$residuals$autocorrelation$plot #model predictions m$predictions$values #predictions for new data (using stats::predict) y <- stats::predict( object = m, data = plants_df[1:5, ], type = "response" )$predictions #alternative: pass arguments via ranger.arguments list args <- list( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), num.trees = 50, min.node.size = 20, num.threads = 1 ) m <- rf( ranger.arguments = args, verbose = FALSE )
Uses rf_evaluate() to compare the performance of several models on independent spatial folds via spatial cross-validation.
rf_compare( models = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.8), line.color = "gray30", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_compare( models = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 0.8), line.color = "gray30", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
models |
Named list with models resulting from |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". Default: |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metrics |
Character vector, names of the performance metrics selected. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with three slots:
comparison.df: Data frame with one performance value per spatial fold, metric, and model.
spatial.folds: List with the indices of the training and testing records for each evaluation repetition.
plot: Violin-plot of comparison.df.
Other model_workflow:
rf_evaluate(),
rf_importance(),
rf_repeat(),
rf_tuning()
if(interactive()){ data( plants_rf, plants_rf_spatial, plants_xy ) comparison <- rf_compare( models = list( `Non spatial` = plants_rf, Spatial = plants_rf_spatial ), repetitions = 5, xy = plants_xy, metrics = "rmse", n.cores = 1 ) }if(interactive()){ data( plants_rf, plants_rf_spatial, plants_xy ) comparison <- rf_compare( models = list( `Non spatial` = plants_rf, Spatial = plants_rf_spatial ), repetitions = 5, xy = plants_xy, metrics = "rmse", n.cores = 1 ) }
Evaluates the performance of random forest on unseen data over independent spatial folds.
rf_evaluate( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, grow.testing.folds = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_evaluate( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metrics = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, grow.testing.folds = FALSE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
Model fitted with |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metrics |
Character vector, names of the performance metrics selected. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
grow.testing.folds |
Logic. By default, this function grows contiguous training folds to keep the spatial structure of the data as intact as possible. However, when setting |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The evaluation algorithm works as follows: the number of repetitions and the input dataset (stored in model$ranger.arguments$data) are used as inputs for the function thinning_til_n(), that applies thinning() to the input data until as many cases as repetitions are left, and as separated as possible. Each of these remaining records will be used as a "fold center". From that point, the fold grows, until a number of points equal (or close) to training.fraction is reached. The indices of the records within the grown spatial fold are stored as "training" in the output list, and the remaining ones as "testing". Then, for each spatial fold, a "training model" is fitted using the cases corresponding with the training indices, and predicted over the cases corresponding with the testing indices. The model predictions on the "unseen" data are compared with the observations, and the performance measures (R squared, pseudo R squared, RMSE and NRMSE) computed.
A model of the class "rf_evaluate" with a new slot named "evaluation", that is a list with the following slots:
training.fraction: Value of the argument training.fraction.
spatial.folds: Result of applying make_spatial_folds() on the data coordinates. It is a list with as many slots as repetitions are indicated by the user. Each slot has two slots named "training" and "testing", each one having the indices of the cases used on the training and testing models.
per.fold: Data frame with the evaluation results per spatial fold (or repetition). It contains the ID of each fold, it's central coordinates, the number of training and testing cases, and the training and testing performance measures: R squared, pseudo R squared (cor(observed, predicted)), rmse, and normalized rmse.
per.model: Same data as above, but organized per fold and model ("Training", "Testing", and "Full").
aggregated: Same data, but aggregated by model and performance measure.
Other model_workflow:
rf_compare(),
rf_importance(),
rf_repeat(),
rf_tuning()
if(interactive()){ data( plants_rf, plants_xy ) plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_evaluation(plants_rf, notch = FALSE) print_evaluation(plants_rf) get_evaluation(plants_rf) }if(interactive()){ data( plants_rf, plants_xy ) plants_rf <- rf_evaluate( model = plants_rf, xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_evaluation(plants_rf, notch = FALSE) print_evaluation(plants_rf) get_evaluation(plants_rf) }
Evaluates the contribution of the predictors to model transferability via spatial cross-validation. The function returns the median increase or decrease in a given evaluation metric (R2, pseudo R2, RMSE, nRMSE, or AUC) when a variable is introduced in a model, by comparing and evaluating via spatial cross-validation models with and without the given variable. This function was devised to provide importance scores that would be less sensitive to spatial autocorrelation than those computed internally by random forest on the out-of-bag data. This function is experimental.
rf_importance( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metric = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_importance( model = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, metric = c("r.squared", "pseudo.r.squared", "rmse", "nrmse", "auc"), distance.step = NULL, distance.step.x = NULL, distance.step.y = NULL, fill.color = viridis::viridis(100, option = "F", direction = -1, alpha = 1, end = 0.9), line.color = "white", seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
Model fitted with |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
metric |
Character, nams of the performance metric to use. The possible values are: "r.squared" ( |
distance.step |
Numeric, argument |
distance.step.x |
Numeric, argument |
distance.step.y |
Numeric, argument |
fill.color |
Character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
line.color |
Character string, color of the line produced by |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The input model with new data in its "importance" slot. The new importance scores are included in the data frame model$importance$per.variable, under the column names "importance.cv" (median contribution to transferability over spatial cross-validation repetitions), "importance.cv.mad" (median absolute deviation of the performance scores over spatial cross-validation repetitions), "importance.cv.percent" ("importance.cv" expressed as a percent, taking the full model's performance as baseline), and "importance.cv.mad" (median absolute deviation of "importance.cv"). The plot is stored as "cv.per.variable.plot".
Other model_workflow:
rf_compare(),
rf_evaluate(),
rf_repeat(),
rf_tuning()
if(interactive()){ data(plants_rf) m_importance <- rf_importance( model = plants_rf, repetitions = 5 ) }if(interactive()){ data(plants_rf) m_importance <- rf_importance( model = plants_rf, repetitions = 5 ) }
Fits several random forest models on the same data in order to capture the effect of the algorithm's stochasticity on the variable importance scores, predictions, residuals, and performance measures. The function relies on the median to aggregate performance and importance values across repetitions. It is recommended to use it after a model is fitted (rf() or rf_spatial()), tuned (rf_tuning()), and/or evaluated (rf_evaluate()). This function is designed to be used after fitting a model with rf() or rf_spatial(), tuning it with rf_tuning() and evaluating it with rf_evaluate().
rf_repeat( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, repetitions = 10, keep.models = TRUE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_repeat( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = FALSE, repetitions = 10, keep.models = TRUE, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
xy |
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
scaled.importance |
Logical. If |
repetitions |
Integer, number of random forest models to fit. Default: |
keep.models |
Logical, if |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A ranger model with several new slots:
ranger.arguments: Stores the values of the arguments used to fit the ranger model.
importance: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores.
performance: out-of-bag performance scores: R squared, pseudo R squared, RMSE, and normalized RMSE (NRMSE).
pseudo.r.squared: computed as the correlation between the observations and the predictions.
residuals: residuals, normality test of the residuals computed with residuals_test(), and spatial autocorrelation of the residuals computed with moran_multithreshold().
Other model_workflow:
rf_compare(),
rf_evaluate(),
rf_importance(),
rf_tuning()
if(interactive()){ data(plants_rf) m_repeat <- rf_repeat( model = plants_rf, repetitions = 5, n.cores = 1 ) #performance scores across repetitions m_repeat$performance print_performance(m_repeat) #variable importance plot_importance(m_repeat) #response curves plot_response_curves( model = m_repeat, variables = "climate_bio1_average", quantiles = 0.5 ) }if(interactive()){ data(plants_rf) m_repeat <- rf_repeat( model = plants_rf, repetitions = 5, n.cores = 1 ) #performance scores across repetitions m_repeat$performance print_performance(m_repeat) #variable importance plot_importance(m_repeat) #response curves plot_response_curves( model = m_repeat, variables = "climate_bio1_average", quantiles = 0.5 ) }
Fits spatial random forest models using different methods to generate, rank, and select spatial predictors acting as proxies of spatial processes not considered by the non-spatial predictors. The end goal is providing the model with information about the spatial structure of the data to minimize the spatial correlation (Moran's I) of the model residuals and generate honest variable importance scores.
rf_spatial( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = TRUE, method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive", "hengl", "hengl.moran.sequential", "hengl.effect.sequential", "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential", "pca.effect.recursive"), max.spatial.predictors = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_spatial( model = NULL, data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, xy = NULL, ranger.arguments = NULL, scaled.importance = TRUE, method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive", "hengl", "hengl.moran.sequential", "hengl.effect.sequential", "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential", "pca.effect.recursive"), max.spatial.predictors = NULL, weight.r.squared = NULL, weight.penalization.n.predictors = NULL, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with distances in the same units as |
xy |
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
scaled.importance |
Logical. If |
method |
Character, method to build, rank, and select spatial predictors. One of:
|
max.spatial.predictors |
Integer, maximum number of spatial predictors to generate. Useful when memory problems arise due to a large number of spatial predictors, Default: |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the selection of spatial components. See Details, Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for adding an increasing number of spatial predictors during selection. Default: |
seed |
Integer, random seed to facilitate reproducibility. Default: |
verbose |
Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
The function uses three different methods to generate spatial predictors ("hengl", "pca", and "mem"), two methods to rank them in order to define in what order they are introduced in the model ("effect" and "moran), and two methods to select the spatial predictors that minimize the spatial correlation of the model residuals ("sequential" and "recursive"). All method names but "hengl" (that uses the complete distance matrix as predictors in the spatial model) are named by combining a method to generate the spatial predictors, a method to rank them, and a method to select them, separated by a point. Examples are "mem.moran.sequential" or "mem.effect.recursive". All combinations are not possible, since the ranking method "moran" cannot be used with the selection method "recursive" (because the logics behind them are very different, see below). Methods to generate spatial predictors:
"hengl": named after the method RFsp presented in the paper "Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables", by Hengl et al. (2018), where the authors propose to use the distance matrix among records as predictors in spatial random forest models (RFsp method). In this function, all methods starting with "hengl" use either the complete distance matrix, or select columns of the distance matrix as spatial predictors.
"mem": Generates Moran's Eigenvector Maps, that is, the eigenvectors of the double-centered weights of the distance matrix. The method is described in "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014).
"pca": Computes spatial predictors from the principal component analysis of a weighted distance matrix (see weights_from_distance_matrix()). This is an experimental method, use with caution.
Methods to rank spatial predictors (see rank_spatial_predictors()):
"moran": Computes the Moran's I of each spatial predictor, selects the ones with positive values, and ranks them from higher to lower Moran's I.
"effect": If a given non-spatial random forest model is defined as y = p1 + ... + pn, being p1 + ... + pn the set of predictors, for every spatial predictor generated (spX) a spatial model y = p1 + ... + pn + spX is fitted, and the Moran's I of its residuals is computed. The spatial predictors are then ranked by how much they help to reduce spatial autocorrelation between the non-spatial and the spatial model.
Methods to select spatial predictors:
"sequential" (see select_spatial_predictors_sequential()): The spatial predictors are added one by one in the order they were ranked, and once all spatial predictors are introduced, the best first n predictors are selected. This method is similar to the one employed in the MEM methodology (Moran's Eigenvector Maps) described in the paper "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014). This method generally introduces tens of predictors into the model, but usually offers good results.
"recursive" (see select_spatial_predictors_recursive()): This method tries to find the smallest combination of spatial predictors that reduce the spatial correlation of the model's residuals the most. The algorithm goes as follows: 1. The first ranked spatial predictor is introduced into the model; 2. the remaining predictors are ranked again using the "effect" method, using the model in 1. as reference. The first spatial predictor in the resulting ranking is then introduced into the model, and the steps 1. and 2. are repeated until spatial predictors stop having an effect in reducing the Moran's I of the model residuals. This method takes longer to compute, but generates smaller sets of spatial predictors. This is an experimental method, use with caution.
Once ranking procedure is completed, an algorithm is used to select the minimal subset of spatial predictors that reduce the most the Moran's I of the residuals: for each new spatial predictor introduced in the model, the Moran's I of the residuals, it's p-value, a binary version of the p-value (0 if < 0.05 and 1 if >= 0.05), the R-squared of the model, and a penalization linear with the number of spatial predictors introduced (computed as (1 / total spatial predictors) * introduced spatial predictors) are rescaled between 0 and 1. Then, the optimization criteria is computed as max(1 - Moran's I, p-value binary) + (weight.r.squared * R-squared) - (weight.penalization.n.predictors * penalization). The predictors from the first one to the one with the highest optimization criteria are then selected as the best ones in reducing the spatial correlation of the model residuals, and used along with data to fit the final spatial model.
A ranger model with several new slots:
ranger.arguments: Values of the arguments used to fit the ranger model.
importance: A list containing the vector of variable importance as originally returned by ranger (scaled or not depending on the value of 'scaled.importance'), a data frame with the predictors ordered by their importance, and a ggplot showing the importance values.
performance: With the out-of-bag R squared, pseudo R squared, RMSE and NRMSE of the model.
residuals: residuals, normality test of the residuals computed with residuals_test(), and spatial autocorrelation of the residuals computed with moran_multithreshold().
spatial: A list with four slots:
method: Character, method used to generate, rank, and select spatial predictors.
names: Character vector with the names of the selected spatial predictors. Not returned if the method is "hengl".
optimization: Criteria used to select the spatial predictors. Not returned if the method is "hengl".
plot: Plot of the criteria used to select the spatial predictors. Not returned if the method is "hengl".
Other main_models:
rf()
if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:100 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #fit spatial model from scratch m_spatial <- rf_spatial( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), method = "mem.moran.sequential", ranger.arguments = list(num.trees = 30), n.cores = 1 ) plot_residuals_diagnostics(m_spatial) #optimization of MEM selection plot_optimization(m_spatial) #from non-spatial to spatial model m_spatial <- rf_spatial( model = plants_rf ) }if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:100 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #fit spatial model from scratch m_spatial <- rf_spatial( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = c(100, 1000, 2000), method = "mem.moran.sequential", ranger.arguments = list(num.trees = 30), n.cores = 1 ) plot_residuals_diagnostics(m_spatial) #optimization of MEM selection plot_optimization(m_spatial) #from non-spatial to spatial model m_spatial <- rf_spatial( model = plants_rf ) }
Finds the optimal set of random forest hyperparameters num.trees, mtry, and min.node.size via grid search by maximizing the model's R squared, or AUC, if the response variable is binomial, via spatial cross-validation performed with rf_evaluate().
rf_tuning( model = NULL, num.trees = NULL, mtry = NULL, min.node.size = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )rf_tuning( model = NULL, num.trees = NULL, mtry = NULL, min.node.size = NULL, xy = NULL, repetitions = 30, training.fraction = 0.75, seed = 1, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
model |
A model fitted with |
num.trees |
Numeric integer vector with the number of trees to fit on each model repetition. Default: |
mtry |
Numeric integer vector, number of predictors to randomly select from the complete pool of predictors on each tree split. Default: |
min.node.size |
Numeric integer, minimal number of cases in a terminal node. Default: |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If |
repetitions |
Integer, number of independent spatial folds to use during the cross-validation. Default: |
training.fraction |
Proportion between 0.2 and 0.9 indicating the number of records to be used in model training. Default: |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A model with a new slot named tuning, with a data frame with the results of the tuning analysis.
Other model_workflow:
rf_compare(),
rf_evaluate(),
rf_importance(),
rf_repeat()
if(interactive()){ data( plants_rf, plants_xy ) plants_rf_tuned <- rf_tuning( model = plants_rf, num.trees = c(25, 50), mtry = c(5, 10), min.node.size = c(10, 20), xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_tuning(plants_rf_tuned) }if(interactive()){ data( plants_rf, plants_xy ) plants_rf_tuned <- rf_tuning( model = plants_rf, num.trees = c(25, 50), mtry = c(5, 10), min.node.size = c(10, 20), xy = plants_xy, repetitions = 5, n.cores = 1 ) plot_tuning(plants_rf_tuned) }
Computes the rmse or normalized rmse (nrmse) between two numeric vectors of the same length representing observations and model predictions.
root_mean_squared_error( o, p, normalization = c("rmse", "all", "mean", "sd", "maxmin", "iq") )root_mean_squared_error( o, p, normalization = c("rmse", "all", "mean", "sd", "maxmin", "iq") )
o |
Numeric vector with observations, must have the same length as |
p |
Numeric vector with predictions, must have the same length as |
normalization |
character, normalization method, Default: "rmse" (see Details). |
The normalization methods go as follows:
"rmse": RMSE with no normalization.
"mean": RMSE dividied by the mean of the observations (rmse/mean(o)).
"sd": RMSE dividied by the standard deviation of the observations (rmse/sd(o)).
"maxmin": RMSE divided by the range of the observations (rmse/(max(o) - min(o))).
"iq": RMSE divided by the interquartile range of the observations (rmse/(quantile(o, 0.75) - quantile(o, 0.25)))
Named numeric vector with either one or 5 values, as selected by the user.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
root_mean_squared_error( o = runif(10), p = runif(10) )root_mean_squared_error( o = runif(10), p = runif(10) )
Selects spatial predictors following these steps:
Gets the spatial predictors ranked by rank_spatial_predictors() and fits a model of the form y ~ predictors + best_spatial_predictor_1. The Moran's I of the residuals of this model is used as reference value for the next step.
The remaining spatial predictors are introduced again into rank_spatial_predictors(), and the spatial predictor with the highest ranking is introduced in a new model of the form y ~ predictors + best_spatial_predictor_1 + best_spatial_predictor_2.
Steps 1 and 2 are repeated until the Moran's I doesn't improve for a number of repetitions equal to the 20 percent of the total number of spatial predictors introduced in the function.
This method allows to select the smallest set of spatial predictors that have the largest joint effect in reducing the spatial correlation of the model residuals, while maintaining the model's R-squared as high as possible. As a consequence of running rank_spatial_predictors() on each iteration, this method includes less spatial predictors in the final model than the sequential method implemented in select_spatial_predictors_sequential() would do, while minimizing spatial correlation and maximizing the R squared of the model as much as possible.
select_spatial_predictors_recursive( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.25, weight.penalization.n.predictors = 0, n.cores = parallel::detectCores() - 1, cluster = NULL )select_spatial_predictors_recursive( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.25, weight.penalization.n.predictors = 0, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
spatial.predictors.df |
Data frame of spatial predictors. |
spatial.predictors.ranking |
Ranking of predictors returned by |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the optimization index. Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for the number of spatial predictors added in the optimization index. Default: |
n.cores |
Integer, number of cores to use. Default: |
cluster |
A cluster definition generated by |
The algorithm works as follows. If the function rank_spatial_predictors() returns 10 ranked spatial predictors (sp1 to sp10, being sp7 the best one), select_spatial_predictors_recursive() is going to first fit the model y ~ predictors + sp7. Then, the spatial predictors sp2 to sp9 are again ranked with rank_spatial_predictors() using the model y ~ predictors + sp7 as reference (at this stage, some of the spatial predictors might be dropped due to lack of effect). When the new ranking of spatial predictors is ready (let's say they are sp5, sp3, and sp4), the best one (sp5) is included in the model y ~ predictors + sp7 + sp5, and the remaining ones go again to rank_spatial_predictors() to repeat the process until spatial predictors are depleted.
A list with two slots: optimization, a data frame with the index of the spatial predictor added on each iteration, the spatial correlation of the model residuals, and the R-squared of the model, and best.spatial.predictors, that is a character vector with the names of the spatial predictors that minimize the Moran's I of the residuals and maximize the R-squared of the model.
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_sequential()
if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:20 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #generate spatial predictors mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = 100 ) #rank them from higher to lower moran mems.rank <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = mems, reference.moran.i = plants_rf$residuals$autocorrelation$max.moran, distance.matrix = plants_distance, distance.thresholds = 100, n.cores = 1 ) #select best subset via sequential addition selection <- select_spatial_predictors_recursive( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = 0, spatial.predictors.df = mems, spatial.predictors.ranking = mems.rank, ranger.arguments = list(num.trees = 30), n.cores = 1 ) #names of selected spatial predictors selection$best.spatial.predictors #optimization plot plot_optimization(selection$optimization) }if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:20 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #generate spatial predictors mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = 100 ) #rank them from higher to lower moran mems.rank <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = mems, reference.moran.i = plants_rf$residuals$autocorrelation$max.moran, distance.matrix = plants_distance, distance.thresholds = 100, n.cores = 1 ) #select best subset via sequential addition selection <- select_spatial_predictors_recursive( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = 0, spatial.predictors.df = mems, spatial.predictors.ranking = mems.rank, ranger.arguments = list(num.trees = 30), n.cores = 1 ) #names of selected spatial predictors selection$best.spatial.predictors #optimization plot plot_optimization(selection$optimization) }
Selects spatial predictors by adding them sequentially into a model while monitoring the Moran's I of the model residuals and the model's R-squared. Once all the available spatial predictors have been added to the model, the function identifies the first n predictors that minimize the spatial correlation of the residuals and maximize R-squared, and returns the names of the selected spatial predictors and a data frame with the selection criteria.
select_spatial_predictors_sequential( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.75, weight.penalization.n.predictors = 0.25, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )select_spatial_predictors_sequential( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, distance.matrix = NULL, distance.thresholds = NULL, ranger.arguments = NULL, spatial.predictors.df = NULL, spatial.predictors.ranking = NULL, weight.r.squared = 0.75, weight.penalization.n.predictors = 0.25, verbose = FALSE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of |
distance.matrix |
Squared matrix with the distances among the records in |
distance.thresholds |
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
spatial.predictors.df |
Data frame of spatial predictors. |
spatial.predictors.ranking |
Ranking of the spatial predictors returned by |
weight.r.squared |
Numeric between 0 and 1, weight of R-squared in the optimization index. Default: |
weight.penalization.n.predictors |
Numeric between 0 and 1, weight of the penalization for the number of spatial predictors added in the optimization index. Default: |
verbose |
Logical, ff |
n.cores |
Integer, number of cores to use. Default: |
cluster |
A cluster definition generated by |
The algorithm works as follows: If the function rank_spatial_predictors returns 10 spatial predictors (sp1 to sp10, ordered from best to worst), select_spatial_predictors_sequential is going to fit the models y ~ predictors + sp1, y ~ predictors + sp1 + sp2, until all spatial predictors are used in y ~ predictors + sp1 ... sp10. The model with lower Moran's I of the residuals and higher R-squared (computed on the out-of-bag data) is selected, and its spatial predictors returned.
A list with two slots: optimization, a data frame with the index of the spatial predictor added on each iteration, the spatial correlation of the model residuals, and the R-squared of the model, and best.spatial.predictors, that is a character vector with the names of the spatial predictors that minimize the Moran's I of the residuals and maximize the R-squared of the model.
Other spatial_analysis:
filter_spatial_predictors(),
mem(),
mem_multithreshold(),
moran(),
moran_multithreshold(),
pca(),
pca_multithreshold(),
rank_spatial_predictors(),
residuals_diagnostics(),
residuals_test(),
select_spatial_predictors_recursive()
if(interactive()){ data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:20 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #generate spatial predictors mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = 100 ) #rank them from higher to lower moran mems.rank <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = mems, reference.moran.i = plants_rf$residuals$autocorrelation$max.moran, distance.matrix = plants_distance, distance.thresholds = 100, n.cores = 1 ) #select best subset via sequential addition selection <- select_spatial_predictors_sequential( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = 0, spatial.predictors.df = mems, spatial.predictors.ranking = mems.rank, ranger.arguments = list(num.trees = 30), n.cores = 1 ) #names of selected spatial predictors selection$best.spatial.predictors #optimization plot plot_optimization(selection$optimization) }if(interactive()){ data( plants_df, plants_response, plants_predictors, plants_distance, plants_rf ) #subset to speed up example idx <- 1:20 plants_df <- plants_df[idx, ] plants_distance <- plants_distance[idx, idx] #generate spatial predictors mems <- mem_multithreshold( distance.matrix = plants_distance, distance.thresholds = 100 ) #rank them from higher to lower moran mems.rank <- rank_spatial_predictors( ranking.method = "moran", spatial.predictors.df = mems, reference.moran.i = plants_rf$residuals$autocorrelation$max.moran, distance.matrix = plants_distance, distance.thresholds = 100, n.cores = 1 ) #select best subset via sequential addition selection <- select_spatial_predictors_sequential( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = plants_predictors, distance.matrix = plants_distance, distance.thresholds = 0, spatial.predictors.df = mems, spatial.predictors.ranking = mems.rank, ranger.arguments = list(num.trees = 30), n.cores = 1 ) #names of selected spatial predictors selection$best.spatial.predictors #optimization plot plot_optimization(selection$optimization) }
Internal helper to manage parallel backend setup with support for user-managed backends, external clusters, and internal clusters.
setup_parallel_execution(cluster = NULL, n.cores = parallel::detectCores() - 1)setup_parallel_execution(cluster = NULL, n.cores = parallel::detectCores() - 1)
cluster |
A cluster object from parallel::makeCluster(), or NULL |
n.cores |
Number of cores for internal cluster creation |
A list with:
cluster: The cluster object to pass to child functions (or NULL)
mode: One of "user_backend", "external_cluster", "internal_cluster", "sequential"
cleanup: A function to call in on.exit() for proper cleanup
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
standard_error(),
statistical_mode(),
thinning(),
thinning_til_n()
Computes the standard error of the mean of a numeric vector as round(sqrt(var(x)/length(x)), 3)
standard_error(x)standard_error(x)
x |
A numeric vector. |
The function removes NA values before computing the standard error, and rounds the result to 3 decimal places.
A numeric value.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
statistical_mode(),
thinning(),
thinning_til_n()
standard_error(x = runif(10))standard_error(x = runif(10))
Computes the mode of a numeric or character vector
statistical_mode(x)statistical_mode(x)
x |
Numeric or character vector. |
Statistical mode of x.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
thinning(),
thinning_til_n()
statistical_mode(x = c(10, 9, 10, 8))statistical_mode(x = c(10, 9, 10, 8))
Suggests candidate variable interactions and composite features able to improve predictive accuracy over data not used to train the model via spatial cross-validation with rf_evaluate(). For a pair of predictors a and b, interactions are build via multiplication (a * b), while composite features are built by extracting the first factor of a principal component analysis performed with pca(), after rescaling a and b between 1 and 100. Interactions and composite features are named a..x..b and a..pca..b respectively.
Candidate variables a and b are selected from those predictors in predictor.variable.names with a variable importance above importance.threshold (set by default to the median of the importance scores).
For each interaction and composite feature, a model including all the predictors plus the interaction or composite feature is fitted, and it's R squared (or AUC if the response is binary) computed via spatial cross-validation (see rf_evaluate()) is compared with the R squared of the model without interactions or composite features.
From all the potential interactions screened, only those with a positive increase in R squared (or AUC when the response is binomial) of the model, a variable importance above the median, and a maximum correlation among themselves and with the predictors in predictor.variable.names not higher than cor.threshold (set to 0.5 by default) are selected. Such a restrictive set of rules ensures that the selected interactions can be used right away for modeling purposes without increasing model complexity unnecessarily. However, the suggested variable interactions might not make sense from a domain expertise standpoint, so please, examine them with care.
The function returns the criteria used to select the interactions, and the data required to use these interactions a model.
the_feature_engineer( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, xy = NULL, ranger.arguments = NULL, repetitions = 30, training.fraction = 0.75, importance.threshold = 0.75, cor.threshold = 0.75, point.color = viridis::viridis(100, option = "F", alpha = 0.8), seed = NULL, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )the_feature_engineer( data = NULL, dependent.variable.name = NULL, predictor.variable.names = NULL, xy = NULL, ranger.arguments = NULL, repetitions = 30, training.fraction = 0.75, importance.threshold = 0.75, cor.threshold = 0.75, point.color = viridis::viridis(100, option = "F", alpha = 0.8), seed = NULL, verbose = TRUE, n.cores = parallel::detectCores() - 1, cluster = NULL )
data |
Data frame with a response variable and a set of predictors. Default: |
dependent.variable.name |
Character string with the name of the response variable. Must be in the column names of |
predictor.variable.names |
Character vector with the names of the predictive variables, or object of class |
xy |
Data frame or matrix with two columns containing coordinates and named "x" and "y". If not provided, the comparison between models with and without variable interactions is not done. |
ranger.arguments |
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function. |
repetitions |
Integer, number of spatial folds to use during cross-validation. Must be lower than the total number of rows available in the model's data. Default: |
training.fraction |
Proportion between 0.5 and 0.9 indicating the proportion of records to be used as training set during spatial cross-validation. Default: |
importance.threshold |
Numeric between 0 and 1, quantile of variable importance scores over which to select individual predictors to explore interactions among them. Larger values reduce the number of potential interactions explored. Default: |
cor.threshold |
Numeric, maximum Pearson correlation between any pair of the selected interactions, and between any interaction and the predictors in |
point.color |
Colors of the plotted points. Can be a single color name (e.g. "red4"), a character vector with hexadecimal codes (e.g. "#440154FF" "#21908CFF" "#FDE725FF"), or function generating a palette (e.g. |
seed |
Integer, random seed to facilitate reproduciblity. If set to a given number, the results of the function are always the same. Default: |
verbose |
Logical. If |
n.cores |
Integer, number of cores to use for parallel execution. Creates a socket cluster with |
cluster |
A cluster definition generated with |
A list with seven slots:
screening: Data frame with selection scores of all the interactions considered.
selected: Data frame with selection scores of the selected interactions.
df: Data frame with the computed interactions.
plot: List of plots of the selected interactions versus the response variable. The output list can be plotted all at once with patchwork::wrap_plots(p) or cowplot::plot_grid(plotlist = p), or one by one by extracting each plot from the list.
data: Data frame with the response variable, the predictors, and the selected interactions, ready to be used as data argument in the package functions.
dependent.variable.name: Character, name of the response.
predictor.variable.names: Character vector with the names of the predictors and the selected interactions.
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
weights_from_distance_matrix()
if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_xy, plants_rf ) #get five most important predictors from plants_rf to speed-up example predictors <- get_importance(plants_rf)[1:5, "variable"] #subset to speed-up example idx <- 1:30 plants_df <- plants_df[idx, ] plants_xy <- plants_xy[idx, ] #data subsetted to speed-up example runtime y <- the_feature_engineer( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = predictors, xy = plants_xy, repetitions = 5, n.cores = 1, ranger.arguments = list( num.trees = 30 ), verbose = TRUE ) #all tested interactions y$screening #selected interaction (same as above in this case) y$selected #new column added to data head(y$data[, y$selected$interaction.name]) }if (interactive()) { data( plants_df, plants_response, plants_predictors, plants_xy, plants_rf ) #get five most important predictors from plants_rf to speed-up example predictors <- get_importance(plants_rf)[1:5, "variable"] #subset to speed-up example idx <- 1:30 plants_df <- plants_df[idx, ] plants_xy <- plants_xy[idx, ] #data subsetted to speed-up example runtime y <- the_feature_engineer( data = plants_df, dependent.variable.name = plants_response, predictor.variable.names = predictors, xy = plants_xy, repetitions = 5, n.cores = 1, ranger.arguments = list( num.trees = 30 ), verbose = TRUE ) #all tested interactions y$screening #selected interaction (same as above in this case) y$selected #new column added to data head(y$data[, y$selected$interaction.name]) }
Resamples a set of points with x and y coordinates to impose a minimum distance among nearby points.
thinning(xy, minimum.distance = NULL)thinning(xy, minimum.distance = NULL)
xy |
A data frame with columns named "x" and "y" representing geographic coordinates. |
minimum.distance |
Numeric, minimum distance to be set between nearby points, in the same units as the coordinates of xy. |
Generally used to remove redundant points that could produce pseudo-replication, and to limit sampling bias by disaggregating clusters of points.
A data frame with the same columns as xy with points separated by the defined minimum distance.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning_til_n()
data(plants_xy) y <- thinning( xy = plants_xy, minimum.distance = 10 ) if (interactive()) { plot( plants_xy[, c("x", "y")], col = "blue", pch = 15 ) points( y[, c("x", "y")], col = "red", pch = 15 ) }data(plants_xy) y <- thinning( xy = plants_xy, minimum.distance = 10 ) if (interactive()) { plot( plants_xy[, c("x", "y")], col = "blue", pch = 15 ) points( y[, c("x", "y")], col = "red", pch = 15 ) }
Resamples a set of points with x and y coordinates by increasing the distance step by step until a given sample size is obtained.
thinning_til_n( xy, n = 30, distance.step = NULL )thinning_til_n( xy, n = 30, distance.step = NULL )
xy |
A data frame with columns named "x" and "y" representing geographic coordinates. Default: |
n |
Integer, number of samples to obtain. Must be lower than |
distance.step |
Numeric, distance step used during the thinning iterations. If |
A data frame with the same columns as xy with a row number close to n.
Other utilities:
.vif_to_df(),
auc(),
beowulf_cluster(),
objects_size(),
optimization_function(),
prepare_importance_spatial(),
rescale_vector(),
root_mean_squared_error(),
setup_parallel_execution(),
standard_error(),
statistical_mode(),
thinning()
data(plants_xy) y <- thinning_til_n( xy = plants_xy, n = 10 ) if (interactive()) { plot( plants_xy[, c("x", "y")], col = "blue", pch = 15 ) points( y[, c("x", "y")], col = "red", pch = 15, cex = 1.5 ) }data(plants_xy) y <- thinning_til_n( xy = plants_xy, n = 10 ) if (interactive()) { plot( plants_xy[, c("x", "y")], col = "blue", pch = 15 ) points( y[, c("x", "y")], col = "red", pch = 15, cex = 1.5 ) }
Transforms a distance matrix into weights (1/distance.matrix) normalized by the row sums. Used to compute Moran's I values and Moran's Eigenvector Maps. Allows to apply a threshold to the distance matrix before computing the weights.
weights_from_distance_matrix( distance.matrix = NULL, distance.threshold = 0 )weights_from_distance_matrix( distance.matrix = NULL, distance.threshold = 0 )
distance.matrix |
Distance matrix. Default: |
distance.threshold |
Numeric, positive, in the range of values of |
A weighted distance matrix.
Other preprocessing:
auto_cor(),
auto_vif(),
case_weights(),
default_distance_thresholds(),
double_center_distance_matrix(),
is_binary(),
make_spatial_fold(),
make_spatial_folds(),
the_feature_engineer()
data(plants_distance) y <- weights_from_distance_matrix( distance.matrix = plants_distance ) y[1:5, 1:5]data(plants_distance) y <- weights_from_distance_matrix( distance.matrix = plants_distance ) y[1:5, 1:5]