| Title: | Quantifying Ecological Memory in Palaeoecological Datasets and Other Long Time-Series |
|---|---|
| Description: | Quantifies ecological memory in long time-series using Random Forest models ('Benito', 'Gil-Romera', and 'Birks' 2019 <doi:10.1111/ecog.04772>) fitted with 'ranger' (Wright and Ziegler 2017 <doi:10.18637/jss.v077.i01>). Ecological memory is assessed by modeling a response variable as a function of lagged predictors, distinguishing endogenous memory (lagged response) from exogenous memory (lagged environmental drivers). Designed for palaeoecological datasets and simulated pollen curves from 'virtualPollen', but applicable to any long time-series with environmental drivers and a biotic response. |
| Authors: | Blas M. Benito [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-5105-7232>) |
| Maintainer: | Blas M. Benito <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-06-10 07:46:25 UTC |
| Source: | https://github.com/blasbenito/memoria |
Aligns multiple time series datasets to a common temporal resolution using LOESS interpolation and joins them into a single dataframe. This is useful when combining datasets with different sampling intervals.
alignTimeSeries( datasets.list = NULL, time.column = NULL, interpolation.interval = NULL ) mergePalaeoData( datasets.list = NULL, time.column = NULL, interpolation.interval = NULL )alignTimeSeries( datasets.list = NULL, time.column = NULL, interpolation.interval = NULL ) mergePalaeoData( datasets.list = NULL, time.column = NULL, interpolation.interval = NULL )
datasets.list |
list of dataframes, as in |
time.column |
character string, name of the time column of the datasets provided in |
interpolation.interval |
numeric, temporal resolution of the output data, in the same units as the time columns of the input data. Default: |
This function fits a loess model of the form y ~ x, where y is any numeric column in the input datasets and x is the column given by the time.column argument. The model is used to interpolate column y on a regular time series of intervals equal to interpolation.interval. All numeric columns in every provided dataset go through this process to generate the final data with samples separated by regular time intervals. Non-numeric columns are ignored and absent from the output dataframe.
A dataframe with every column of the initial dataset interpolated to a regular time grid of resolution defined by interpolation.interval. Column names follow the form datasetName.columnName, so the origin of columns can be tracked.
Blas M. Benito <[email protected]>
Other data_preparation:
lagTimeSeries()
#loading data data(pollen) data(climate) x <- alignTimeSeries( datasets.list = list( pollen=pollen, climate=climate ), time.column = "age", interpolation.interval = 0.2 )#loading data data(pollen) data(climate) x <- alignTimeSeries( datasets.list = list( pollen=pollen, climate=climate ), time.column = "age", interpolation.interval = 0.2 )
A dataframe containing palaeoclimate data at 1 ky temporal resolution with the following columns:
data(climate)data(climate)
dataframe with 6 columns and 800 rows.
age in kiloyears before present (ky BP).
temperatureAverage average annual temperature in degrees Celsius.
rainfallAverage average annual precipitation in millimetres per day (mm/day).
temperatureWarmestMonth average temperature of the warmest month, in degrees Celsius.
temperatureColdestMonth average temperature of the coldest month, in degrees Celsius.
oxigenIsotope delta O18, global ratio of stable isotopes in the sea floor, see http://lorraine-lisiecki.com/stack.html for further details.
Blas M. Benito <[email protected]>
Other example_data:
palaeodata,
palaeodataLagged,
palaeodataMemory,
pollen
Takes the output of prepareLaggedData to fit the following model with Random Forest:
where:
is a driver (several drivers can be added).
is the time of any given value of the response p.
is the lag number 1 (in time units).
represents the endogenous component of ecological memory.
represents the exogenous component of ecological memory.
represents the concurrent effect of the driver over the response.
represents a column of random values, used to test the significance of the variable importance scores returned by Random Forest.
computeMemory( lagged.data = NULL, response = NULL, drivers = NULL, random.mode = "autocorrelated", repetitions = 10, subset.response = "none", num.threads = 2 )computeMemory( lagged.data = NULL, response = NULL, drivers = NULL, random.mode = "autocorrelated", repetitions = 10, subset.response = "none", num.threads = 2 )
lagged.data |
a lagged dataset resulting from |
response |
character string, name of the response variable. Not required if 'lagged.data' was generated with [prepareLaggedData]. Default: |
drivers |
a character string or character vector with variables to be used as predictors in the model. Not required if 'lagged.data' was generated with [prepareLaggedData]. Important: |
random.mode |
either "none", "white.noise" or "autocorrelated". See details. Default: |
repetitions |
integer, number of random forest models to fit. Default: |
subset.response |
character string with values "up", "down" or "none", triggers the subsetting of the input dataset. "up" only models memory on cases where the response's trend is positive, "down" selects cases with negative trends, and "none" selects all cases. Default: |
num.threads |
integer, number of cores ranger can use for multithreading. Default: |
This function uses the ranger package to fit Random Forest models. Please, check the help of the ranger function to better understand how Random Forest is parameterized in this package. This function fits the model explained above as many times as defined in the argument repetitions.
To test the statistical significance of the variable importance scores returned by random forest, on each repetition the model is fitted with a different r (random) term, unless random.mode = "none". If random.mode equals "autocorrelated", the random term will have a temporal autocorrelation, and if it equals "white.noise", it will be a pseudo-random sequence of numbers generated with rnorm, with no temporal autocorrelation. The importance of the random sequence in predicting the response is stored for each model run, and used as a benchmark to assess the importance of the other predictors.
Importance values of other predictors that are above the median of the importance of the random term should be interpreted as non-random, and therefore, significant.
A list with 5 slots:
response character, response variable name.
drivers character vector, driver variable names.
memory dataframe with six columns:
median numeric, median importance across repetitions of the given variable according to Random Forest.
sd numeric, standard deviation of the importance values of the given variable across repetitions.
min and max numeric, percentiles 0.05 and 0.95 of importance values of the given variable across repetitions.
variable character, names of the different variables used to model ecological memory.
lag numeric, time lag values.
R2 vector, values of pseudo R-squared value obtained for the Random Forest model fitted on each repetition. Pseudo R-squared is the Pearson correlation between the observed and predicted data.
prediction dataframe, with the same columns as the dataframe in the slot memory, with the median and confidence intervals of the predictions of all random forest models fitted.
Blas M. Benito <[email protected]>
Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. doi:10.18637/jss.v077.i01.
Breiman, L. (2001). Random forests. Mach Learn, 45:5-32. doi:10.1023/A:1010933404324.
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning. Springer, New York. 2nd edition.
plotMemory, extractMemoryFeatures
Other memoria:
extractMemoryFeatures(),
plotMemory()
#loading data data(palaeodataLagged) # Simplified call - response and drivers auto-detected from attributes memory.output <- computeMemory( lagged.data = palaeodataLagged, random.mode = "autocorrelated", repetitions = 10 ) str(memory.output) str(memory.output$memory) #plotting output plotMemory(memory.output = memory.output)#loading data data(palaeodataLagged) # Simplified call - response and drivers auto-detected from attributes memory.output <- computeMemory( lagged.data = palaeodataLagged, random.mode = "autocorrelated", repetitions = 10 ) str(memory.output) str(memory.output$memory) #plotting output plotMemory(memory.output = memory.output)
runExperiment into a long table.Takes the output of runExperiment, extracts the dataframes containing the ecological memory patterns generated by computeMemory, and binds them together into a single dataframe ready for further analyses or plotting.
experimentToTable(experiment.output = NULL, parameters.file = NULL)experimentToTable(experiment.output = NULL, parameters.file = NULL)
experiment.output |
list, output of |
parameters.file |
dataframe of simulation parameters. Default: |
This function is used internally by plotExperiment, but it is also available to users in case they want to do other kinds of analyses or plots with the data.
A dataframe.
Blas M. Benito <[email protected]>
Other virtualPollen:
plotExperiment(),
runExperiment()
computeMemory.Computes the following features of the ecological memory patterns returned by computeMemory:
memory strength maximum difference in relative importance between each component (endogenous, exogenous, and concurrent) and the median of the random component. This is computed for exogenous, endogenous, and concurrent effect.
memory length proportion of lags over which the importance of a memory component is above the median of the random component. This is only computed for endogenous and exogenous memory.
dominance proportion of the lags above the median of the random term over which a memory component has a higher importance than the other component. This is only computed for endogenous and exogenous memory.
extractMemoryFeatures( memory.pattern = NULL, exogenous.component = NULL, endogenous.component = NULL, scale.strength = TRUE )extractMemoryFeatures( memory.pattern = NULL, exogenous.component = NULL, endogenous.component = NULL, scale.strength = TRUE )
memory.pattern |
either a list resulting from |
exogenous.component |
character string or character vector,
name of the variable or variables defining the exogenous component.
When |
endogenous.component |
character string, name of the variable defining
the endogenous component.
When |
scale.strength |
boolean. If |
Warning: this function only works when only one exogenous component (driver) is used to define the model in computeMemory. If more than one driver is provided through the argument exogenous.component, the maximum importance scores of all exogenous variables is considered. In other words, the importance of exogenous variables is not additive.
A dataframe with 8 columns and 1 row if memory.pattern is the output of computeMemory and 13 columns and as many rows as taxa are in the input if it is the output of experimentToTable. The columns are:
label character string to identify the taxon. It either inherits its values from experimentToTable, or sets the default ID as "1".
strength.endogenous numeric, difference between the maximum importance of the endogenous component at any lag and the median of the random component (see details in computeMemory). When scale.strength = TRUE (default), values are scaled to [0, 1]; otherwise values are in importance units (percentage of increment in MSE).
strength.exogenous numeric, same as above, but for the exogenous component.
strength.concurrent numeric, same as above, but for the concurrent component (driver at lag 0).
length.endogenous numeric in the range [0, 1], proportion of lags over which the importance of the endogenous memory component is above the median of the random component.
length.exogenous numeric in the range [0, 1], same as above but for the exogenous memory component.
dominance.endogenous numeric in the range [0, 1], proportion of the lags above the median of the random term over which a the endogenous memory component has a higher importance than the exogenous component.
dominance.exogenous, opposite as above.
maximum.age, numeric. As every column after this one, only provided if memory.pattern is the output of experimentToTable. Trait of the given taxon.
fecundity numeric, trait of the given taxon.
niche.mean numeric, trait of the given taxon.
niche.sd numeric, trait of the given taxon.
Blas M. Benito <[email protected]>
Other memoria:
computeMemory(),
plotMemory()
# Loading example data (output of computeMemory) data(palaeodataMemory) # Simplified call - components auto-detected from computeMemory output memory.features <- extractMemoryFeatures( memory.pattern = palaeodataMemory ) # Explicit call - still supported for backwards compatibility memory.features <- extractMemoryFeatures( memory.pattern = palaeodataMemory, exogenous.component = c( "climate.temperatureAverage", "climate.rainfallAverage" ), endogenous.component = "pollen.pinus" )# Loading example data (output of computeMemory) data(palaeodataMemory) # Simplified call - components auto-detected from computeMemory output memory.features <- extractMemoryFeatures( memory.pattern = palaeodataMemory ) # Explicit call - still supported for backwards compatibility memory.features <- extractMemoryFeatures( memory.pattern = palaeodataMemory, exogenous.component = c( "climate.temperatureAverage", "climate.rainfallAverage" ), endogenous.component = "pollen.pinus" )
Takes a multivariate time series and creates time-lagged columns for modeling. This generates one new column per lag and variable, enabling analysis of how past values influence current observations.
lagTimeSeries( input.data = NULL, response = NULL, drivers = NULL, time = NULL, oldest.sample = "first", lags = NULL, time.zoom = NULL, scale = FALSE ) prepareLaggedData( input.data = NULL, response = NULL, drivers = NULL, time = NULL, oldest.sample = "first", lags = NULL, time.zoom = NULL, scale = FALSE )lagTimeSeries( input.data = NULL, response = NULL, drivers = NULL, time = NULL, oldest.sample = "first", lags = NULL, time.zoom = NULL, scale = FALSE ) prepareLaggedData( input.data = NULL, response = NULL, drivers = NULL, time = NULL, oldest.sample = "first", lags = NULL, time.zoom = NULL, scale = FALSE )
input.data |
a dataframe with one time series per column. Default: |
response |
character string, name of the numeric column to be used as response in the model. Default: |
drivers |
character vector, names of the numeric columns to be used as predictors in the model. Default: |
time |
character vector, name of the numeric column with the time. Default: |
oldest.sample |
character string, either "first" or "last". When "first", the first row taken as the oldest case of the time series and the last row is taken as the newest case, so ecological memory flows from the first to the last row of |
lags |
numeric vector, lags to be used in the equation, in the same units as |
time.zoom |
numeric vector of two values from the range of the |
scale |
boolean, if TRUE, applies the |
The function interprets the time column as an index representing the temporal position of each sample. It uses the lag function from the zoo package to shift columns by the specified lags, generating one new column per lag and variable.
A dataframe with columns representing time-delayed values of the drivers and the response. Column names have the lag number as a suffix. Has the attributes 'response' and 'drivers', later used by [computeMemory()].
Blas M. Benito <[email protected]>
Other data_preparation:
alignTimeSeries()
#loading data data(palaeodata) #adding lags lagged.data <- lagTimeSeries( input.data = palaeodata, response = "pollen.pinus", drivers = c("climate.temperatureAverage", "climate.rainfallAverage"), time = "age", oldest.sample = "last", lags = seq(0.2, 1, by=0.2) ) str(lagged.data) # Check attributes (used by computeMemory) attributes(lagged.data)#loading data data(palaeodata) #adding lags lagged.data <- lagTimeSeries( input.data = palaeodata, response = "pollen.pinus", drivers = c("climate.temperatureAverage", "climate.rainfallAverage"), time = "age", oldest.sample = "last", lags = seq(0.2, 1, by=0.2) ) str(lagged.data) # Check attributes (used by computeMemory) attributes(lagged.data)
A dataframe with a regular time grid of 0.2 ky resolution resulting from applying mergePalaeoData to the datasets climate and pollen:
data(palaeodata)data(palaeodata)
dataframe with 10 columns and 7986 rows.
age in ky before present (ky BP).
pollen.pinus pollen percentages of Pinus.
pollen.quercus pollen percentages of Quercus.
pollen.poaceae pollen percentages of Poaceae.
pollen.artemisia pollen percentages of Artemisia.
climate.temperatureAverage average annual temperature in degrees Celsius.
climate.rainfallAverage average annual precipitation in millimetres per day (mm/day).
climate.temperatureWarmestMonth average temperature of the warmest month, in degrees Celsius.
climate.temperatureColdestMonth average temperature of the coldest month, in degrees Celsius.
climate.oxigenIsotope delta O18, global ratio of stable isotopes in the sea floor, see http://lorraine-lisiecki.com/stack.html for further details.
Blas M. Benito <[email protected]>
Other example_data:
climate,
palaeodataLagged,
palaeodataMemory,
pollen
prepareLaggedData.A dataframe resulting from the application of prepareLaggedData to the dataset palaeodata. The dataframe columns are named using the pattern VariableName__LagValue:
data(palaeodataLagged)data(palaeodataLagged)
dataframe with 19 columns and 3988 rows.
pollen.pinus__0 numeric, values of the response variable (pollen counts of Pinus) at lag 0 (current time). This column is used as the response variable by computeMemory.
pollen.pinus__0.2-1 numeric, time-delayed values of the response for lags 0.2 to 1 (in ky). These columns represent the endogenous ecological memory.
climate.temperatureAverage__0 numeric, temperature values at lag 0 (concurrent effect).
climate.rainfallAverage__0 numeric, rainfall values at lag 0 (concurrent effect).
climate.temperatureAverage__0.2-1 numeric, time-delayed temperature values for lags 0.2 to 1 (exogenous memory).
climate.rainfallAverage__0.2-1 numeric, time-delayed rainfall values for lags 0.2 to 1 (exogenous memory).
time numeric, the time/age column.
The dataframe has attributes response and drivers that are automatically used by computeMemory.
Blas M. Benito <[email protected]>
Other example_data:
climate,
palaeodata,
palaeodataMemory,
pollen
computeMemory
List containing the output of computeMemory applied to palaeodataLagged. Its slots are:
data(palaeodataMemory)data(palaeodataMemory)
List with five slots.
response character, response variable name.
drivers character vector, driver variable names.
memory dataframe with five columns:
variable character, names of the different variables used to model ecological memory.
lag numeric, time lag values.
median numeric, median importance across repetitions of the given variable according to Random Forest.
sd numeric, standard deviation of the importance values of the given variable across repetitions.
min and max numeric, percentiles 0.05 and 0.95 of importance values of the given variable across repetitions.
R2 vector, values of pseudo R-squared value obtained for the Random Forest model fitted on each repetition. Pseudo R-squared is the Pearson correlation between the observed and predicted data.
prediction dataframe, with the same columns as the dataframe in the slot memory, with the median and confidence intervals of the predictions of all random forest models fitted.
Blas M. Benito <[email protected]>
Other example_data:
climate,
palaeodata,
palaeodataLagged,
pollen
runExperiment.Takes the output of runExperiment, and generates plots of ecological memory patterns for a large number of simulated pollen curves.
plotExperiment( experiment.output = NULL, parameters.file = NULL, ribbon = FALSE )plotExperiment( experiment.output = NULL, parameters.file = NULL, ribbon = FALSE )
experiment.output |
list, output of |
parameters.file |
dataframe of simulation parameters. Default: |
ribbon |
logical, switches plotting of confidence intervals on (TRUE) and off (FALSE). Default: |
A ggplot2 object.
Blas M. Benito <[email protected]>
Other virtualPollen:
experimentToTable(),
runExperiment()
computeMemory
Plots the ecological memory pattern yielded by computeMemory.
plotMemory( memory.output = NULL, ribbon = FALSE, legend.position = "right", ... )plotMemory( memory.output = NULL, ribbon = FALSE, legend.position = "right", ... )
memory.output |
list, output of |
ribbon |
logical, switches plotting of confidence intervals on (TRUE) and off (FALSE). Default: |
legend.position |
character, position of the legend. Default: |
... |
additional arguments for internal use. |
A ggplot object.
Blas M. Benito <[email protected]>
Other memoria:
computeMemory(),
extractMemoryFeatures()
#loading data data(palaeodataMemory) #plotting memory pattern plotMemory(memory.output = palaeodataMemory) #with confidence ribbon plotMemory(memory.output = palaeodataMemory, ribbon = TRUE)#loading data data(palaeodataMemory) #plotting memory pattern plotMemory(memory.output = palaeodataMemory) #with confidence ribbon plotMemory(memory.output = palaeodataMemory, ribbon = TRUE)
A dataframe with the following columns:
data(pollen)data(pollen)
dataframe with 5 columns and 639 rows.
age in kiloyears before present (ky BP).
pinus pollen counts of Pinus.
quercus pollen counts of Quercus.
poaceae pollen counts of Poaceae.
artemisia pollen counts of Artemisia.
Blas M. Benito <[email protected]>
Other example_data:
climate,
palaeodata,
palaeodataLagged,
palaeodataMemory
virtualPollen package.Applies computeMemory to assess ecological memory on a large set of virtual pollen curves.
runExperiment( simulations.file = NULL, selected.rows = NULL, selected.columns = NULL, parameters.file = NULL, parameters.names = NULL, driver.column = NULL, response.column = "Pollen", subset.response = "none", time.column = "Time", time.zoom = NULL, lags = NULL, repetitions = 10 )runExperiment( simulations.file = NULL, selected.rows = NULL, selected.columns = NULL, parameters.file = NULL, parameters.names = NULL, driver.column = NULL, response.column = "Pollen", subset.response = "none", time.column = "Time", time.zoom = NULL, lags = NULL, repetitions = 10 )
simulations.file |
List of dataframes produced by |
selected.rows |
Numeric vector indicating which virtual taxa (list elements)
from |
selected.columns |
Numeric vector indicating which sampling schemes (columns)
from |
parameters.file |
Dataframe of simulation parameters produced by
|
parameters.names |
Character vector of column names from |
driver.column |
Character vector of column names representing environmental
drivers in the simulation dataframes. Common choices: |
response.column |
Character string naming the response variable column in the
simulation dataframes. Use |
subset.response |
character string, one of "up", "down" or "none", triggers the subsetting of the input dataset. "up" only models ecological memory on cases where the response's trend is positive, "down" selects cases with negative trends, and "none" selects all cases. Default: |
time.column |
character string, name of the time/age column. Usually, "Time". Default: |
time.zoom |
numeric vector with two numbers defining the time/age extremes of the time interval of interest. Default: |
lags |
numeric vector, lags to be used in the equation, in the same units as |
repetitions |
integer, number of random forest models to fit. Default: |
A list with 2 slots:
names matrix of character strings, with as many rows and columns as simulations.file. Each cell holds a simulation name to be used afterwards, when plotting the results of the ecological memory analysis.
output a list with as many rows and columns as simulations.file. Each slot holds a an output of computeMemory.
memory dataframe with five columns:
Variable character, names and lags of the different variables used to model ecological memory.
median numeric, median importance across repetitions of the given Variable according to Random Forest.
sd numeric, standard deviation of the importance values of the given Variable across repetitions.
min and max numeric, percentiles 0.05 and 0.95 of importance values of the given Variable across repetitions.
R2 vector, values of pseudo R-squared value obtained for the Random Forest model fitted on each repetition. Pseudo R-squared is the Pearson correlation between the observed and predicted data.
prediction dataframe, with the same columns as the dataframe in the slot memory, with the median and confidence intervals of the predictions of all random forest models fitted.
multicollinearity multicollinearity analysis on the input data performed with vif_df. A vif value higher than 5 indicates that the given variable is highly correlated with other variables.
Blas M. Benito <[email protected]>
Other virtualPollen:
experimentToTable(),
plotExperiment()