---
title: "Feature Selection with Blocked Cross-Validation"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Feature Selection with Blocked Cross-Validation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  dev = "svglite",
  fig.ext = "svg"
)
library(BORG)
library(ggplot2)

theme_borg <- theme_minimal(base_size = 12) +
  theme(
    panel.background = element_rect(fill = "transparent", colour = NA),
    plot.background = element_rect(fill = "transparent", colour = NA),
    legend.background = element_rect(fill = "transparent", colour = NA)
  )
theme_set(theme_borg)
```

# Introduction

Feature selection is a standard step in predictive modeling. Remove irrelevant
predictors and the model becomes simpler, faster to fit, and easier to interpret.
In most textbook treatments the procedure is straightforward: wrap your selection
criterion inside cross-validation, pick the subset that maximizes held-out
performance, and move on. The problem is that textbook cross-validation assumes
independence between observations. Spatial, temporal, and grouped data violate
that assumption, and the consequences for feature selection are worse than you
might expect.

When observations are autocorrelated, random CV allows nearby points to leak
information between training and test folds. This inflates the apparent
performance of every candidate model. But the inflation is not uniform across
variables: features that happen to correlate with the spatial or temporal
structure of the data receive a disproportionate boost. The result is that random
CV selects variables that track spatial pattern rather than the true
signal-to-noise relationship. A model built on those variables will look
excellent in cross-validation and fail on genuinely new locations or time periods.

BORG addresses this by routing every evaluation through blocked CV. When
coordinates, time indices, or group labels are present, the package generates
folds that separate dependent observations into distinct blocks. Feature
selection then operates on performance estimates that reflect true
generalization capacity rather than within-block interpolation. This vignette
covers the full suite of feature-level tools: block-permutation importance for
post-hoc variable ranking, forward feature selection for building parsimonious
models, best subset selection for exhaustive search when the number of candidates
is small, and spatial SHAP values for observation-level attribution.

The distinction between "spatially aware" and "standard" feature selection is
not a minor technical footnote. Published benchmarks (Roberts et al., 2017;
Meyer and Pebesma, 2021) show that random CV can inflate apparent R-squared by
0.1 to 0.4 units on spatially autocorrelated datasets. When feature selection is
embedded inside that inflated evaluation loop, the selected variables encode the
inflation rather than the signal. The downstream model inherits those bad
choices and fails silently in deployment, producing predictions that look
correct on paper but miss the mark on genuinely new locations.

All examples use synthetic data generated with `borg_simulate()`, where the true
data-generating process is known. This lets us verify that each method recovers
the correct signal variables and rejects noise.

# Simulated data

Throughout this vignette we use a single spatial dataset with five predictors, a
spatially autocorrelated response, and known ground truth. The simulation places
300 observations on a two-dimensional grid with moderate spatial autocorrelation
(rho = 0.5) and a signal-to-noise ratio of 0.3.

```{r simulate}
sim <- borg_simulate(
  n = 300,
  type = "spatial",
  n_predictors = 5,
  signal_strength = 0.3,
  autocorrelation = 0.5,
  seed = 42
)
d <- sim$data
```

The spatial simulation creates coordinate columns `x` and `y`, predictors `x1`
through `x5`, and a response column. Because R's `data.frame()` renames the
response `y` to `y.1` to avoid colliding with the coordinate column `y`, we
refer to the target as `"y.1"` in all subsequent code.

```{r inspect-data}
names(d)
cat("True R-squared:", round(sim$true_r2, 4), "\n")
cat("Observations:", nrow(d), "\n")
```

We also generate blocked CV folds up front. These folds will be reused across
multiple feature selection methods, keeping comparisons consistent.

```{r create-folds}
folds <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5, verbose = FALSE)
folds
```

# Block-permutation importance

## The problem with standard permutation importance

Standard permutation importance (Breiman, 2001) measures a variable's
contribution by shuffling its values across all observations and recording the
increase in prediction error. The logic is sound for independent data: shuffling
breaks the association between the variable and the response, and the resulting
performance drop measures how much the model relies on that variable.

With autocorrelated data, row-level shuffling does something subtler and more
damaging. When you permute a spatially structured variable by randomly
reassigning values across all locations, you destroy both the
variable-to-response association *and* the spatial structure of the variable
itself. The permuted variable no longer resembles any realistic spatial field.
This creates an artificially large contrast between the original (spatially
smooth) and permuted (spatially random) versions, inflating importance for any
variable that has spatial structure, regardless of whether that structure carries
predictive information. Noise variables that happen to be spatially
autocorrelated will appear important; truly predictive variables that lack
spatial pattern will be undervalued.

## How block-permutation importance works

Block-permutation importance (Bieganek et al., 2024; see also the approach used
in the CAST package) solves this by permuting values across spatial blocks rather
than across individual rows. BORG's implementation works as follows:

1. Partition observations into spatial blocks using k-means clustering on
   coordinates.
2. For each candidate variable, shuffle the block assignments rather than
   individual values. Each observation receives the variable value from the
   corresponding position in a different block.
3. Predict with the modified dataset and record the change in error.
4. Repeat the permutation multiple times to estimate variability.

The block permutation preserves within-block spatial structure while breaking
between-block associations. This gives a cleaner estimate of whether a variable
carries predictive information beyond its spatial pattern.

The `n_blocks` parameter controls the granularity of the permutation. More
blocks produce finer-grained permutations that are closer to row-level shuffling;
fewer blocks preserve more of the original spatial structure. The default of 10
is a reasonable starting point for most datasets. If the study area is large
with distinct environmental zones, increase `n_blocks` to 15 or 20. If the area
is small or the autocorrelation range is long relative to the study extent,
reduce `n_blocks` to 5. The `n_rep` parameter controls how many times the
permutation is repeated for each variable; more repeats produce more stable
importance estimates at the cost of additional computation. Ten repeats is
usually sufficient for a rough ranking; increase to 25 or 50 if you need
precise standard errors on the importance values.

## Worked example

We first fit a simple linear model using all five predictors, then compute
block-permutation importance.

```{r fit-model}
model <- lm(y.1 ~ x1 + x2 + x3 + x4 + x5, data = d)
```

```{r importance}
imp <- borg_importance(
  model = model,
  data = d,
  target = "y.1",
  coords = c("x", "y"),
  n_blocks = 10,
  n_rep = 10,
  metric = "rmse",
  seed = 42
)
imp
```

What you get back is a data frame with columns `variable`, `importance` (mean
increase in RMSE after permutation), `importance_sd` (standard deviation across
repeats), and `rank`. Variables with higher importance contribute more to the model's
predictions after accounting for spatial structure.

```{r importance-plot, fig.height = 4}
autoplot(imp)
```

The lollipop plot shows each variable's importance with error bars indicating
variability across permutation repeats. Variables whose error bars cross zero
contribute negligibly. The `max_vars` argument controls how many variables are
shown (default 20), which matters when working with high-dimensional data.

## Comparing block vs. row permutation

To see the difference that blocking makes, we can run the same analysis without
coordinates, which falls back to standard row permutation.

```{r importance-compare}
imp_row <- borg_importance(
  model = model,
  data = d,
  target = "y.1",
  coords = NULL,
  n_rep = 10,
  metric = "rmse",
  seed = 42
)

cat("Block-permutation ranking:", paste(imp$variable, collapse = " > "), "\n")
cat("Row-permutation ranking:  ", paste(imp_row$variable, collapse = " > "), "\n")
```

With autocorrelated data, the two rankings will often disagree. Row permutation
tends to inflate the importance of variables whose values are spatially smooth,
because replacing a smooth surface with random noise creates a larger contrast
than replacing a rough surface. Block permutation avoids this bias by keeping the
within-block spatial texture intact. The block-permutation ranking is the one you
should trust for decisions about which variables to retain.

# Forward feature selection

## Why greedy search with blocked CV

Forward feature selection (FFS) builds a model incrementally: starting from zero
variables, it evaluates every candidate addition, picks the one that most
improves the CV metric, and repeats until no further improvement is possible.
This greedy strategy avoids the combinatorial explosion of exhaustive search
while producing a parsimonious model. It is the same strategy used by the CAST
package's `ffs()` function, which introduced the idea of coupling forward
selection with spatial CV to the ecological modeling community (Meyer et al.,
2018).

The critical difference between BORG's implementation and standard stepwise
selection is the evaluation backbone. Each candidate model is scored using
blocked cross-validation, not random CV or in-sample criteria like AIC. This
means the selection procedure penalizes variables that exploit spatial leakage,
not just variables that fail to reduce residual variance.

`borg_forward_selection()` accepts a pre-built fold object from `borg_cv()`, a
custom fitting function, and a metric to optimize. It stops adding variables when
the CV metric no longer improves, or when all candidates have been added.

## Step-by-step example

```{r ffs}
ffs <- borg_forward_selection(
  data = d,
  target = "y.1",
  predictors = c("x1", "x2", "x3", "x4", "x5"),
  folds = folds,
  metric = "rmse",
  fit_fun = stats::lm,
  verbose = FALSE
)
ffs
```

This returns an object of class `borg_ffs` containing:

- `selected`: the variables chosen, in the order they were added.
- `history`: a data frame recording each step's metric value and which variable
  was added.
- `best_metric`: the best CV metric achieved by the final selected set.
- `all_vars`: all candidate variables that were considered.

```{r ffs-history}
ffs$history
```

The history table shows how the RMSE changed as variables were added. Each row
records the step number, the variable added at that step, the mean CV RMSE, and
the total number of variables in the model. The selection stops when adding
another variable fails to improve the metric, preventing overfitting to noise
variables.

## Using a custom model

The `fit_fun` argument accepts any function with the signature
`function(formula, data)` that returns an object supporting `predict()`. This
makes FFS compatible with any modeling framework that follows R's standard
formula interface.

```{r ffs-custom, eval = requireNamespace("ranger", quietly = TRUE)}
# ranger's predict() uses `data` not `newdata`, and returns a list.
# Wrap it so that stats::predict(object, newdata = ...) returns a vector.
ranger_fit <- function(formula, data) {
  fit <- ranger::ranger(formula, data = data, num.trees = 200)
  training_data <- data  # needed for predict method
  structure(
    list(fit = fit, training_data = training_data),
    class = "ranger_lm_compat"
  )
}

# Register the S3 method so stats::predict() dispatches correctly
predict.ranger_lm_compat <- function(object, newdata = NULL, ...) {
  if (is.null(newdata)) newdata <- object$training_data
  ranger::predictions(stats::predict(object$fit, data = newdata))
}
registerS3method("predict", "ranger_lm_compat", predict.ranger_lm_compat)

ffs_rf <- borg_forward_selection(
  data = d,
  target = "y.1",
  predictors = c("x1", "x2", "x3", "x4", "x5"),
  folds = folds,
  metric = "rmse",
  fit_fun = ranger_fit,
  verbose = FALSE
)
cat("RF selected:", paste(ffs_rf$selected, collapse = " + "), "\n")
cat("RF best RMSE:", round(ffs_rf$best_metric, 4), "\n")
```

Different model classes will select different variable subsets because linear
models and tree-based models interact differently with spatial structure. Run FFS
with the model class you intend to deploy whenever computational cost allows it.

## Interpreting the selection path

The selection history reveals how much each additional variable contributes to
out-of-block predictive accuracy. A steeply declining RMSE in the first few
steps, followed by flat or increasing RMSE, suggests that only the first few
variables carry real information and the rest add noise. If RMSE declines
gradually through all steps, most variables are contributing and parsimony may
not be appropriate.

Watch for cases where FFS selects a variable that block-permutation importance
ranked low. This happens when a variable's marginal contribution (what FFS
measures) differs from its importance conditional on all other variables (what
permutation importance measures). FFS asks "does this variable help when added
to what we already have?" while permutation importance asks "does this variable
matter in the full model?" The two questions have different answers whenever
collinearity is present.

Another pattern to watch for is the "diminishing returns" profile, where the
first variable produces a large improvement but each subsequent variable adds
only a tiny increment. This profile is common in spatial data where a single
environmental gradient dominates the response and the remaining variables
represent secondary gradients or noise. In these situations, a model with one or
two variables may be both more parsimonious and more transferable than a model
with all five. The blocked CV metric makes this judgment call transparent: if
adding the third variable improves the RMSE by 0.001, that improvement is
probably within the noise margin and the simpler model is preferable.

# Best subset selection

When the number of candidate predictors is small (15 or fewer), exhaustive
evaluation of all possible subsets is feasible. `borg_best_subset()` evaluates
every combination from size 1 to size p using blocked cross-validation and
returns the results ranked by metric.

```{r best-subset}
bss <- borg_best_subset(
  data = d,
  target = "y.1",
  predictors = c("x1", "x2", "x3", "x4", "x5"),
  folds = folds,
  metric = "rmse",
  verbose = FALSE
)
bss
```

Each row in the output corresponds to one evaluated subset, containing the
variable combination (as a `+`-separated string), the number of variables, the
mean CV metric, and the rank. The top row is the best subset.

```{r best-subset-top}
head(bss, 10)
```

Best subset selection is the gold standard for variable selection because it
guarantees finding the globally optimal combination, not just a greedy
approximation. However, it evaluates 2^p - 1 models (31 for p = 5, 32767 for
p = 15), each requiring a full round of blocked CV. The function enforces a hard
limit of 15 predictors to prevent accidental combinatorial explosions.

For problems with more than 15 candidate variables, use forward selection as the
primary method and permutation importance as a filter to pre-screen candidates
down to a manageable set before running best subset on the survivors.

The `max_vars` argument limits the maximum subset size, which can reduce
computation when you know that smaller models are preferred. Setting
`max_vars = 3` evaluates only single, pair, and triple combinations.

One advantage of best subset over forward selection is that it can detect
complementary variable pairs that are individually weak but strong together. FFS
would never pick the first member of such a pair because its marginal
contribution at step 1 is small. Best subset evaluates the pair directly and
can find it. This matters in ecological modeling where interaction effects
between environmental gradients are common: elevation alone may not predict
species occurrence, and precipitation alone may not either, but the combination
of both captures an environmental niche that neither captures individually.

# SHAP values

## Beyond global importance

Permutation importance and forward selection answer global questions: which
variables matter across the entire dataset? SHAP (SHapley Additive exPlanations)
values answer a local question: for a specific observation, how much did each
variable contribute to the prediction?

This distinction matters for spatial data because a variable's contribution can
vary across space. Temperature might be the dominant predictor in mountain
regions but irrelevant in lowland areas. Global importance averages over these
spatial differences; SHAP values preserve them.

## Spatial SHAP

Standard SHAP implementations approximate Shapley values by marginalizing over
background data points sampled uniformly from the dataset. For spatial data this
creates impossible feature combinations: a mountain observation might be paired
with lowland covariate values that never co-occur in reality, biasing the
contribution estimates.

BORG's `borg_shap()` addresses this by using block-conditional marginal
expectations. Instead of sampling background points from the full dataset, it
partitions observations into spatial blocks and samples only from blocks that are
spatially separated from the observation being explained. This preserves
realistic covariate relationships within each block while still breaking the
association between the focal variable and the response.

The algorithm proceeds as follows for each observation and each feature:

1. Identify the spatial block containing the observation.
2. Sample background points from other blocks.
3. Replace the focal feature's value with background values while keeping all
   other features fixed.
4. Compute the difference between the original prediction and the mean
   prediction with the replaced feature.

This approximation converges to the true Shapley value as the number of blocks
and background samples increases, while respecting spatial structure throughout.

## Worked example

```{r shap}
shap <- borg_shap(
  model = model,
  data = d,
  target = "y.1",
  coords = c("x", "y"),
  n_blocks = 10,
  n_samples = 50,
  explain_idx = 1:100,
  seed = 42
)
shap
```

`borg_shap()` returns a matrix of SHAP values (one row per explained
observation, one column per predictor), the baseline prediction (the model's
mean prediction), and a feature importance ranking based on mean absolute SHAP.

```{r shap-importance, fig.height = 4}
autoplot(shap, type = "importance")
```

The importance plot orders variables by their mean absolute SHAP contribution.
This should broadly agree with block-permutation importance, but small
differences are expected because the two methods use different estimation
strategies.

```{r shap-beeswarm, fig.height = 4}
autoplot(shap, type = "beeswarm")
```

The beeswarm plot shows the distribution of SHAP values for each variable across
all explained observations. Each point is one observation. Variables with wide
distributions have heterogeneous effects across space; variables with tight
distributions have consistent effects everywhere. Points far from zero indicate
observations where the variable strongly influences the prediction in either
direction.

## When to use SHAP vs. permutation importance

Use permutation importance when you need a quick global ranking of variables for
feature selection decisions. Use SHAP values when you need observation-level
attribution, spatial maps of variable contributions, or when you want to
understand how a model's behavior varies across your study area. SHAP is
computationally more expensive (it requires one prediction per background sample
per variable per explained observation), so the `explain_idx` argument lets you
select a representative subset of observations rather than explaining the entire
dataset.

A practical strategy is to explain a spatially stratified subsample rather than
a random subsample. If your study area contains distinct regions (e.g.,
different biomes, land cover types, or climate zones), sample observations from
each region to ensure the SHAP analysis covers the full environmental gradient.
Random subsampling can oversample dense regions and miss sparse but important
ones.

The `n_blocks` and `n_samples` parameters control the precision of the SHAP
approximation. More blocks produce more fine-grained spatial conditioning; more
background samples produce smoother marginal expectations. For a quick
exploratory analysis, 10 blocks and 30 background samples are sufficient. For
publication-quality results, increase to 15 blocks and 100 samples, and verify
that the SHAP-based importance ranking is stable across different random seeds.

# Comparing methods

The four feature-level tools in BORG answer overlapping but distinct questions.
Choosing the right one depends on your goal, the number of candidate variables,
and the computational budget available.

**Block-permutation importance** measures each variable's contribution to the
full model. It is the fastest method: only n_rep x p prediction rounds on the
full dataset, with no CV loop. Use it for post-hoc interpretation or as a
pre-screening filter before running more expensive selection methods. The main
limitation is that it evaluates variables in the context of the complete model,
so collinear variables may split importance in unpredictable ways.

**Forward feature selection** builds a model from scratch, adding variables
greedily at a total cost of O(p^2 x k) model fits where k is the number of folds.
This is the workhorse for building parsimonious models when exhaustive search is
infeasible, though the greedy selection order may not produce the globally optimal
combination.

**Best subset selection** evaluates all 2^p - 1 combinations with full blocked
CV at a cost of O(2^p x k) model fits. For small candidate sets (p <= 15), this
guarantees finding the optimal combination. Beyond that, it is computationally
prohibitive.

**Spatial SHAP** provides observation-level attribution rather than global
variable ranking. Total cost: O(n_explain x p x n_samples) predictions. SHAP is
not a selection method per se, but it can inform selection by revealing whether a
variable's contribution is concentrated in a small spatial region (suggesting it
might not generalize) or distributed across the study area (suggesting it is
broadly informative).

| Method | Output | Cost | Best for |
|---|---|---|---|
| `borg_importance()` | Global ranking | O(n_rep x p) | Post-hoc interpretation, pre-screening |
| `borg_forward_selection()` | Ordered variable set | O(p^2 x k) | Building parsimonious models |
| `borg_best_subset()` | All subsets ranked | O(2^p x k) | Small p (<= 15), guaranteed optimum |
| `borg_shap()` | Per-observation attribution | O(n x p x n_bg) | Spatial interpretation, heterogeneous effects |

In practice, a sensible workflow combines two or three methods. Start with
permutation importance to identify obviously irrelevant variables. Run forward
selection on the surviving candidates to get a parsimonious model. If the
candidate set is small enough after screening, use best subset to verify that the
greedy solution is actually optimal. Use SHAP for interpretation once the model
is finalized.

# Full pipeline

This section walks through a complete feature selection pipeline from raw data
to final model evaluation.

## Diagnose dependencies

```{r diagnose}
diag <- borg_diagnose(d, coords = c("x", "y"), target = "y.1")
diag
```

The diagnosis confirms that the data has spatial autocorrelation, which means
random CV would inflate performance metrics and bias feature selection toward
spatially structured noise. This step is not optional: skipping diagnosis and
proceeding directly to feature selection with random CV is the single most common
source of overfit variable sets in spatial modeling. The diagnosis result drives
the choice of CV strategy in the next step.

## Generate blocked folds

We already created folds earlier, but in a real pipeline this is where you would
generate them.

```{r pipeline-folds}
folds <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5, verbose = FALSE)
```

## Screen with permutation importance

```{r pipeline-screen}
model_full <- lm(y.1 ~ x1 + x2 + x3 + x4 + x5, data = d)
imp <- borg_importance(model_full, d, target = "y.1", coords = c("x", "y"),
                       n_rep = 10, seed = 42)
imp
```

Variables with near-zero or negative importance can be safely excluded. Negative
importance means the model performs *better* when the variable is permuted,
indicating that the variable adds noise.

## Forward selection on screened candidates

```{r pipeline-ffs}
# Keep variables with positive importance
good_vars <- imp$variable[imp$importance > 0]
cat("Candidates after screening:", paste(good_vars, collapse = ", "), "\n")

ffs <- borg_forward_selection(
  data = d,
  target = "y.1",
  predictors = good_vars,
  folds = folds,
  metric = "rmse",
  verbose = FALSE
)
ffs
```

## Fit the final model

```{r pipeline-final}
final_formula <- as.formula(
  paste("y.1 ~", paste(ffs$selected, collapse = " + "))
)
final_model <- lm(final_formula, data = d)
summary(final_model)
```

The final model uses only the variables that survived both permutation
importance screening and forward selection. Its blocked CV metric
(`ffs$best_metric`) gives an honest estimate of out-of-block prediction error.
Compare the R-squared from `summary()` (which is the in-sample value computed on
all 300 training points) with the blocked CV RMSE from FFS. The in-sample
R-squared will almost always be higher because the model is evaluated on the same
data it was fitted to, with no spatial separation between training and evaluation
points. The blocked CV metric is the honest one; the in-sample metric is
optimistic by an amount that depends on how strong the spatial autocorrelation is.

## Interpret with SHAP

```{r pipeline-shap, fig.height = 4}
shap <- borg_shap(final_model, d, target = "y.1", coords = c("x", "y"),
                  explain_idx = 1:100, seed = 42)
autoplot(shap, type = "beeswarm")
```

SHAP values on the final model reveal whether each selected variable contributes
uniformly across space or has localized effects. This informs how much to trust
the model's predictions in different parts of the study area. If a variable's
SHAP values are tightly concentrated near zero for most observations but spike
for a few, that variable is driving predictions only in a localized region. This
pattern can indicate that the variable captures a local environmental gradient
that may not transfer well to other study areas or time periods. In contrast, a
variable with broadly distributed SHAP values is contributing consistently across
the spatial domain and is more likely to generalize.

# Practical guidance

## How many features relative to sample size?

The number of variables a model can reliably learn from depends on the effective
sample size, not the raw observation count. With autocorrelated data, nearby
observations carry redundant information, shrinking the effective sample size
well below n. As a rule of thumb:

- For linear models, keep the number of predictors below n_eff / 10, where
  n_eff is the effective sample size. With strong spatial autocorrelation
  (rho > 0.7), n_eff can be as low as n / 5 to n / 10.
- For tree-based models, the constraint is looser because they handle
  irrelevant variables better, but more variables still increase the risk of
  overfitting to spatial structure.
- When in doubt, use BORG's forward selection as a guardrail. If FFS stops after
  2 or 3 variables out of 20 candidates, that is the model telling you it
  cannot reliably extract more signal given the effective sample size and
  dependency structure.

These are guidelines, not laws. The appropriate ratio depends on signal strength,
collinearity, and the model class. Use the blocked CV metric as the arbiter: if
adding a variable does not improve blocked CV performance, it is not earning its
place in the model regardless of what in-sample statistics suggest. A useful
diagnostic is to compare the number of variables selected by FFS under blocked
CV versus the number selected under random CV on the same data. Random
CV typically selects more variables because its inflated performance estimates
make each additional variable look like an improvement. If random CV selects 8
variables but blocked CV selects 3, the discrepancy is a direct measure of how
much the random evaluation was rewarding spatial artifacts rather than genuine
predictive information. The blocked selection count is the one to trust.

## Stability of selection

On spatial data, feature selection is inherently less stable than on independent
data because the effective sample size is smaller. Two runs with different random
seeds for fold generation or block formation can produce different selected
subsets, especially when several variables have similar importance.

To assess stability, repeat the selection procedure with different fold
configurations or different random seeds and compare the results.

```{r stability}
ffs_seeds <- lapply(c(1, 2, 3), function(s) {
  f <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5,
               verbose = FALSE)
  borg_forward_selection(d, target = "y.1",
                          predictors = c("x1", "x2", "x3", "x4", "x5"),
                          folds = f, metric = "rmse", verbose = FALSE)
})

cat("Run 1:", paste(ffs_seeds[[1]]$selected, collapse = " + "), "\n")
cat("Run 2:", paste(ffs_seeds[[2]]$selected, collapse = " + "), "\n")
cat("Run 3:", paste(ffs_seeds[[3]]$selected, collapse = " + "), "\n")
```

If the selected subsets are consistent across runs, you can be confident in the
selection. If they differ, focus on the variables that appear in every run (the
"core" set) and treat variables that appear intermittently with caution. They
may be marginally informative and should be retained or excluded based on domain
knowledge rather than purely data-driven criteria.

## Collinearity and importance

Collinear predictors cause problems for all feature selection methods, not just
in the spatial setting. When two variables carry the same information, their
importance is split between them. Permutation importance will undervalue both;
forward selection will pick one and ignore the other; SHAP values will
distribute attribution unevenly depending on the model class.

Before running any selection method, check pairwise correlations among your
candidates. Pairs above 0.8 are suspect. If two variables correlate that
strongly, consider removing one based on domain knowledge, or use variance
inflation factors (VIF) to identify multicollinearity in the fitted model more
systematically. BORG does not provide a built-in VIF function, but the standard
R approach works:

```{r vif, eval = FALSE}
car::vif(lm(y.1 ~ x1 + x2 + x3 + x4 + x5, data = d))
```

## When NOT to do feature selection

Sometimes the right move is to skip selection entirely. There are situations
where it can do more harm than good:

**Very small datasets.** When n_eff is small (say, below 50), the variance of
any selection procedure is so high that the selected subset is effectively
random. You are better off using domain knowledge to pick a small set of
variables and fitting a simple model without automated selection.

**When all variables are theoretically justified.** If every predictor has a
clear mechanistic reason to be in the model, removing variables based on
empirical CV performance may discard information that matters for interpretation
or for predictions in unsampled conditions. In this case, use permutation
importance and SHAP for interpretation, but keep the full model.

**When the goal is inference, not prediction.** Feature selection optimizes
predictive performance, which is a different objective from estimating causal
effects or testing hypotheses. A variable might be excluded by FFS because it
adds noise to predictions, but that same variable might be the one you need to
estimate the effect of. For inferential goals, use structured regression
approaches (e.g., mixed-effects models with spatial random fields) rather than
predictive feature selection.

**When the number of candidates is already small.** If you have five predictors
and 500 blocked observations, the potential benefit of removing one variable is
small and the risk of dropping something useful is real. Feature selection pays
off when the candidate set is large relative to the effective sample size.

**Stacking and ensembling.** Some modeling strategies (e.g., stacking multiple
learners, or using regularization methods like lasso or elastic net) perform
implicit feature selection during fitting. Running explicit feature selection
on top of these methods adds complexity without clear benefit. If you use a
regularized model, the regularization path itself is your feature selection.

**High signal-to-noise ratio with few predictors.** If you have three predictors
and all of them are strong, well-understood drivers of the response (e.g.,
temperature, precipitation, and elevation for species distribution modeling),
feature selection will retain all three. The computational
cost of running FFS or best subset is not justified when the outcome is
predetermined by domain knowledge. Save automated feature selection for
situations where the candidate set is large enough that human judgment alone
cannot reliably identify the useful variables.

## Reporting feature selection in publications

When reporting results from spatial feature selection, include:

1. The CV strategy used (spatial blocking, KNNDM, etc.) and the number of folds.
2. Which selection method was applied.
3. The variables selected, the order in which they were added (for FFS), and
   whether any candidates were pre-screened before selection.
4. The metric used (RMSE, MAE, R-squared) and whether it was minimized or
   maximized.
5. A stability assessment if feasible (how consistent is the selection across
   different fold configurations?).
6. The difference in apparent importance between blocked and random CV, if that
   comparison was made.

Transparency about the selection procedure prevents readers from mistaking an
optimized variable set for a pre-specified one, and it allows others to
reproduce or challenge your choices.

## Temporal and grouped data

Although this vignette focuses on spatial autocorrelation, every method described
here works identically with temporal or grouped dependencies. The only difference
is how `borg_cv()` generates folds. For temporal data, pass a `time` argument
instead of `coords`; for grouped data, pass a `groups` argument. The feature
selection functions (`borg_forward_selection()`, `borg_best_subset()`) accept any
`borg_cv` object regardless of the blocking strategy used to create it. Similarly,
`borg_importance()` and `borg_shap()` can operate without coordinates (falling
back to random block assignment), but the spatial variants should be preferred
whenever coordinate information is available.

The same principles apply: blocked folds prevent leakage from dependent
observations, and feature selection within those folds produces variable sets
that generalize to new time periods or new groups rather than overfitting to
within-group or within-period patterns. For temporal data specifically, the
leakage problem can be even more severe than for spatial data, because time
series often have strong short-range autocorrelation that makes adjacent
observations nearly identical. Random CV on such data will almost always select
too many variables, since every candidate predictor appears to improve
predictions when the test set contains observations from the same time window as
the training set.

# References

- Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5--32.
- Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? Estimating
  the area of applicability of spatial prediction models.
  *Methods in Ecology and Evolution*, 12(9), 1620--1633.
- Roberts, D. R., et al. (2017). Cross-validation strategies for data with
  temporal, spatial, hierarchical, or phylogenetic structure.
  *Ecography*, 40(8), 913--929.
- Linnenbrink, J., et al. (2024). kNNDM: k-fold Nearest Neighbour Distance
  Matching Cross-Validation for map accuracy estimation.
  *Geoscientific Model Development*, 17, 5897--5912.
- Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., & Nauss, T. (2018).
  Improving performance of spatio-temporal machine learning models using
  forward feature selection and target-oriented validation.
  *Environmental Modelling & Software*, 101, 1--9.
- Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting
  model predictions. *Advances in Neural Information Processing Systems*, 30.