---
title: "Framework Integration"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Framework Integration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "svglite",
  fig.ext = "svg",
  fig.width = 7,
  fig.height = 5
)
library(BORG)

transparent_theme <- ggplot2::theme(
  plot.background = ggplot2::element_rect(fill = "transparent", colour = NA),
  panel.background = ggplot2::element_rect(fill = "transparent", colour = NA),
  legend.background = ggplot2::element_rect(fill = "transparent", colour = NA)
)

# Check package availability
has_caret <- requireNamespace("caret", quietly = TRUE)
has_recipes <- requireNamespace("recipes", quietly = TRUE)
has_rsample <- requireNamespace("rsample", quietly = TRUE)
has_mlr3 <- requireNamespace("mlr3", quietly = TRUE)
```

## Why This Guide Exists

Most R modelling frameworks provide their own cross-validation utilities:
`caret::trainControl()`, `rsample::vfold_cv()`, `mlr3::rsmp("cv")`. These
work well when observations are independent. When they are not (spatially
autocorrelated, temporally ordered, or clustered by group), random CV inflates
performance estimates because nearby or related points leak information between
the training and test folds. The inflation can be large enough to reverse
conclusions about model quality.

BORG addresses this problem at two levels. First, it can **inspect** existing
framework objects (recipes, resampling schemes, fitted models) and report
whether information has leaked. Second, it provides **guarded wrappers** that
replace the standard CV constructors. These wrappers diagnose the data
structure before splitting, block random CV when it would be invalid, and
optionally switch to an appropriate blocking strategy. The result is a CV
object that plugs directly back into the framework you are already using.

This guide walks through integration with each framework, starting from
base R and progressing through caret, tidymodels, mlr3, and the two main
species distribution modelling (SDM) frameworks: ENMeval and biomod2. Each
section includes a comparison of the standard approach versus the BORG-guarded
alternative, worked examples with runnable code, and advice on when each
integration point matters most.


## Quick Reference

The table below maps each framework's native CV function to its BORG
equivalent. "Validate" means BORG inspects an object that already exists;
"Generate" means BORG creates the CV folds itself.

| Framework | Native Function | BORG Validate | BORG Generate |
|-----------|----------------|---------------|---------------|
| Base R | manual indices | `borg()` | `borg_cv()` |
| caret | `trainControl()` | `borg_inspect(preProcess)` | `borg_trainControl()` |
| tidymodels | `vfold_cv()` | `borg_inspect(recipe)` | `borg_vfold_cv()` |
| tidymodels | `group_vfold_cv()` | `borg_inspect(rset)` | `borg_group_vfold_cv()` |
| tidymodels | `initial_split()` | `borg_inspect(rsplit)` | `borg_initial_split()` |
| tidymodels | (rset output) | -- | `borg_rset()` |
| mlr3 | `rsmp("cv")` | `borg_inspect(task)` | `borg_to_mlr3()` |
| ENMeval | `ENMevaluate()` | -- | `borg_to_enmeval()` |
| biomod2 | `BIOMOD_Modeling()` | -- | `borg_to_biomod2()` |
| Any | pipeline object | `borg_pipeline()` | -- |


## Base R

The simplest integration uses manual index vectors. This is also the lowest
level of BORG's API, and every framework integration ultimately reduces to
this pattern: a data frame, a vector of training indices, and a vector of
test indices.

```{r base-r}
data <- iris
set.seed(42)
n <- nrow(data)
train_idx <- sample(n, 0.7 * n)
test_idx <- setdiff(1:n, train_idx)

borg(data, train_idx = train_idx, test_idx = test_idx)
```

### Safe Preprocessing Pattern

A common source of leakage in base R workflows is fitting preprocessing
parameters (means, standard deviations, PCA loadings) on the full dataset
instead of the training set alone. When test observations contribute to the
centering or scaling, the model has indirect access to test-set information
during training. The correct pattern is to compute statistics from training
data and apply them to both sets:

```{r preprocessing-pattern}
train_data <- data[train_idx, ]
train_means <- colMeans(train_data[, 1:4])
train_sds <- apply(train_data[, 1:4], 2, sd)

scaled_train <- scale(data[train_idx, 1:4], center = train_means, scale = train_sds)
scaled_test <- scale(data[test_idx, 1:4], center = train_means, scale = train_sds)
```

This same principle applies to any learned transformation: imputation values,
Box-Cox parameters, feature selection thresholds. If the transformation
depends on `y` or on test-set statistics, the evaluation is contaminated.


## caret

### When to Use BORG with caret

caret's `trainControl()` provides `method = "cv"` and `method = "repeatedcv"`
for standard cross-validation, plus `method = "LGOCV"` for leave-group-out.
None of these account for spatial or temporal structure unless you manually
supply fold indices via the `index` argument. If your data has autocorrelation
and you use `method = "cv"` without custom indices, nearby observations will
appear in both train and test folds, inflating metrics.

BORG's `borg_trainControl()` intercepts this by running `borg_diagnose()` on
your data before building the `trainControl` object. If it detects spatial,
temporal, or group dependencies, it blocks random CV and tells you to generate
proper folds with `borg_cv()`.

### Validating preProcess Objects

caret's `preProcess()` object stores the training-set statistics it learned.
`borg_inspect()` checks whether those statistics came from the full dataset
or only from the training partition.

```{r caret-preprocess, eval = has_caret}
library(caret)

data(mtcars)
train_idx <- 1:25
test_idx <- 26:32

# BAD: preProcess on full data (leaks test-set statistics)
pp_bad <- preProcess(mtcars[, -1], method = c("center", "scale"))
borg_inspect(pp_bad, train_idx, test_idx, data = mtcars)

# GOOD: preProcess on training data only
pp_good <- preProcess(mtcars[train_idx, -1], method = c("center", "scale"))
borg_inspect(pp_good, train_idx, test_idx, data = mtcars)
```

### Guarded trainControl

`borg_trainControl()` accepts the same `method` and `number` arguments as
`caret::trainControl()`, plus optional `coords`, `time`, and `groups`
arguments that trigger dependency checking. When dependencies are found
and `allow_override = FALSE` (the default), the function stops with an
actionable error message explaining the detected dependency type and
estimated metric inflation.

```{r caret-traincontrol-example, eval = FALSE}
sim <- borg_simulate(n = 300, type = "spatial", seed = 10)
d <- sim$data

# This will error because spatial autocorrelation is present
ctrl <- borg_trainControl(
  data = d,
  method = "cv",
  number = 5,
  coords = c("x", "y"),
  target = "y.1"
)
```

The error message suggests generating blocked folds via `borg_cv()` and
passing them to `trainControl(method = "cv", index = folds)`. Here is the
complete corrected workflow:

```{r caret-corrected, eval = FALSE}
# Generate spatially-blocked folds
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5,
              output = "caret")

# Use the BORG folds inside caret's trainControl
ctrl <- caret::trainControl(
  method = "cv",
  index = cv$index,
  indexOut = cv$indexOut
)

# Train as usual
model <- caret::train(y.1 ~ x1 + x2 + x3 + x4 + x5,
                      data = d,
                      method = "lm",
                      trControl = ctrl)
```

If you want caret to proceed anyway (for example, to compare random CV with
blocked CV), set `allow_override = TRUE`. BORG will emit a warning but
return a standard `trainControl` object.


### Pipeline-Level Validation with borg_pipeline()

For fitted `caret::train` objects, `borg_pipeline()` decomposes the pipeline
into stages (preprocessing, resampling, tuning, model fitting) and inspects
each for leakage independently.

```{r caret-pipeline, eval = has_caret}
data(mtcars)
ctrl <- caret::trainControl(method = "cv", number = 5)
model <- caret::train(mpg ~ ., data = mtcars[1:25, ], method = "lm",
                      trControl = ctrl,
                      preProcess = c("center", "scale"))

result <- borg_pipeline(model, train_idx = 1:25, test_idx = 26:32,
                        data = mtcars)
result
```

The output lists each stage with its status (OK or LEAK) and issue count.
The `leaking_stages` field names the stages that need attention, so you can
fix them individually without re-inspecting the entire pipeline.


## tidymodels (rsample + recipes)

### When to Use BORG with tidymodels

tidymodels separates preprocessing (recipes), resampling (rsample), model
specification (parsnip), and tuning (tune) into distinct packages. This
modularity is a strength, but it also means leakage can appear at multiple
points: a recipe prepped on the full dataset, an `initial_split()` that
ignores temporal order, or `vfold_cv()` folds that mix spatially adjacent
observations. BORG integrates at each of these points.

### Validating Recipe Objects

`borg_inspect()` accepts prepped recipe objects and checks whether the
training set used for `prep()` matches the training partition. If the recipe
was prepped on the full dataset, BORG flags it as a hard violation.

```{r tidymodels-recipes, eval = has_recipes && has_rsample}
library(recipes)
library(rsample)

data(mtcars)
set.seed(123)
split <- initial_split(mtcars, prop = 0.8)
train_idx <- split$in_id
test_idx <- setdiff(seq_len(nrow(mtcars)), train_idx)

# BAD: Recipe prepped on full data
rec_bad <- recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

borg_inspect(rec_bad, train_idx, test_idx, data = mtcars)

# GOOD: Recipe prepped on training only
rec_good <- recipe(mpg ~ ., data = training(split)) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

borg_inspect(rec_good, train_idx, test_idx, data = mtcars)
```

### Guarded rsample Functions

BORG provides drop-in replacements for three rsample constructors. Each
accepts the same core arguments as the original, plus optional `coords`,
`time`, `groups`, and `target` arguments for dependency detection.

**`borg_vfold_cv()`** replaces `rsample::vfold_cv()`. When no structure
hints are provided, it passes through to the standard rsample function. When
structure hints are present, it runs `borg_diagnose()` and takes one of
three actions depending on configuration:

1. **Block** (default): stop with an error describing the dependency.
2. **Auto-block** (`auto_block = TRUE`): silently switch to blocked CV.
3. **Override** (`allow_override = TRUE`): warn but return random folds.

```{r borg-vfold-example, eval = FALSE}
sim <- borg_simulate(n = 300, type = "spatial", seed = 10)
d <- sim$data

# Will block: spatial autocorrelation detected
folds <- borg_vfold_cv(d, coords = c("x", "y"), target = "y.1", v = 5)

# Auto-block: switches to spatial blocking and returns an rset
folds <- borg_vfold_cv(d, coords = c("x", "y"), target = "y.1",
                       v = 5, auto_block = TRUE)
```

**`borg_group_vfold_cv()`** replaces `rsample::group_vfold_cv()`. Group-based
CV is generally appropriate for clustered data, but when spatial or temporal
structure also exists, group-level folds may not be sufficient. BORG warns
if additional dependencies are found beyond the grouping variable.

```{r borg-group-vfold-example, eval = FALSE}
# Ecological monitoring: sites with repeated visits
site_data <- data.frame(
  site_id = rep(1:30, each = 10),
  lon = rep(runif(30, -10, 10), each = 10),
  lat = rep(runif(30, -10, 10), each = 10),
  abundance = rpois(300, lambda = 5)
)

folds <- borg_group_vfold_cv(
  data = site_data,
  group = "site_id",
  v = 5,
  coords = c("lon", "lat"),
  target = "abundance"
)
```

If the sites are spatially clustered (several nearby sites forming a regional
group), BORG will warn that group CV alone may not prevent spatial leakage.
In that case, switch to `borg_cv()` with `strategy = "spatial_block"`.

**`borg_initial_split()`** replaces `rsample::initial_split()`. When a
`time` column is specified, it sorts by time and splits chronologically,
ensuring that all training observations precede all test observations. For
spatial data, it warns if random splitting would cause leakage.

```{r borg-initial-split-example, eval = FALSE}
sim <- borg_simulate(n = 300, type = "temporal", seed = 7)
d <- sim$data

# Chronological split: 80% earliest for training, 20% latest for test
split <- borg_initial_split(d, prop = 0.8, time = "time")
```

### Converting BORG Folds to rset Objects

When you generate folds with `borg_cv()` and want to use them with
`tune::tune_grid()` or `tune::fit_resamples()`, convert them to an `rset`
object using `borg_rset()`. The returned object has class
`c("borg_rset", "rset", "tbl_df", ...)` and works directly with the
tidymodels tuning infrastructure.

```{r borg-rset-example, eval = has_rsample}
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data

cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
rset <- borg_rset(data = d, cv_obj = cv)
class(rset)
nrow(rset)
```

You can also pass raw fold lists (any list of lists with `$train` and `$test`
integer vectors) via the `folds` argument, which is useful if you built
custom folds outside of `borg_cv()`.

### Convenience Constructors: borg_spatial_cv() and borg_temporal_cv()

For the common case where you know your data is spatial or temporal and just
want properly blocked folds in rsample format, BORG provides two convenience
functions. These call `borg_cv()` internally with `output = "rsample"` and
the appropriate strategy.

```{r convenience-constructors, eval = has_rsample}
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data

# Spatial blocking, returned as rset
spatial_folds <- borg_spatial_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
class(spatial_folds)
```

`borg_temporal_cv()` works the same way for time-series data, accepting a
`time` column name and an optional `embargo` parameter that introduces a gap
between training and test windows.

### Validating Existing rsample Objects

If you have an rsample object from another source (for example,
`rsample::rolling_origin()` or `spatialsample::spatial_block_cv()`), you can
validate it with `borg_inspect()`:

```{r rsample-validation, eval = has_rsample}
ts_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 200),
  value = cumsum(rnorm(200))
)

rolling <- rolling_origin(
  data = ts_data,
  initial = 100,
  assess = 20,
  cumulative = FALSE
)

borg_inspect(rolling, train_idx = NULL, test_idx = NULL)
```


## mlr3

### When to Use BORG with mlr3

mlr3 provides built-in resamplings through `rsmp()`, including `"cv"`,
`"repeated_cv"`, and `"holdout"`. For spatial or temporal data, the
`mlr3spatiotempcv` extension package adds specialized resamplings. If you are
already using `mlr3spatiotempcv`, BORG's main value is validation rather than
generation. If you are not using that extension, BORG can generate
properly blocked folds and inject them into mlr3 via `borg_to_mlr3()`.

### Validating mlr3 Tasks and Resamplings

Pass an mlr3 task and the train/test indices from one fold to
`borg_inspect()`:

```{r mlr3-example, eval = has_mlr3}
library(mlr3)

task <- TaskClassif$new("iris", iris, target = "Species")
resampling <- rsmp("cv", folds = 5)
resampling$instantiate(task)

train_idx <- resampling$train_set(1)
test_idx <- resampling$test_set(1)
borg_inspect(task, train_idx, test_idx)
```

For spatial tasks, this check catches cases where random CV was used on
autocorrelated data. You can iterate over all folds and aggregate the
results if needed.

### Generating mlr3 Resamplings from BORG Folds

`borg_to_mlr3()` converts BORG CV folds into a native mlr3 `Resampling`
object. Internally, it uses `mlr3::rsmp("custom")` and pre-instantiates it
with the BORG fold indices, so calling `$instantiate()` is not needed (and
will not overwrite the existing splits).

```{r mlr3-generate, eval = has_mlr3}
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data

cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
resampling <- borg_to_mlr3(cv_obj = cv, data = d)
resampling$iters
```

The returned resampling can be used directly with `mlr3::resample()` and
`mlr3::benchmark()`:

```{r mlr3-benchmark, eval = FALSE}
task <- mlr3::TaskRegr$new("spatial", backend = d, target = "y.1")
learner <- mlr3::lrn("regr.rpart")

rr <- mlr3::resample(task, learner, resampling)
rr$aggregate(mlr3::msr("regr.rmse"))
```

Note that `borg_to_mlr3()` accepts either a `cv_obj` (a `borg_cv` object)
or raw `folds` and `data` arguments. When using `cv_obj`, the data argument
is still required because `borg_cv` objects do not store the original data
frame.


## Species Distribution Models

Species distribution models (SDMs) are among the most common spatial
modelling workflows in ecology, and they are particularly susceptible to
spatial CV inflation. Two widely used R frameworks, ENMeval and biomod2,
have their own partition formats. BORG can generate spatially blocked folds
and convert them to either format.

### ENMeval

ENMeval expects partitions as a named list with two elements: `occs.grp`
(an integer vector assigning each occurrence point to a fold) and `bg.grp`
(the same for background points). `borg_to_enmeval()` converts a `borg_cv`
object to this format.

```{r enmeval-example}
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data

cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 4)
parts <- borg_to_enmeval(cv)
table(parts$occs.grp)
```

The `bg.grp` vector is set to all zeros by default. If your ENMeval workflow
uses background points drawn from a separate data frame, generate a second
`borg_cv` object for those points and use its fold assignments for `bg.grp`.

In a full ENMeval workflow, you would pass the partitions like this:

```{r enmeval-workflow, eval = FALSE}
library(ENMeval)

# Assuming occs and bg are your occurrence and background data frames,
# and both have lon/lat columns.
cv_occs <- borg_cv(occs, coords = c("lon", "lat"), v = 4)
parts <- borg_to_enmeval(cv_occs)

results <- ENMevaluate(
  occs = occs[, c("lon", "lat")],
  envs = env_rasters,
  bg = bg[, c("lon", "lat")],
  algorithm = "maxent.jar",
  partitions = "user",
  user.grp = parts
)
```

The key advantage over ENMeval's built-in partitioning methods (block,
checkerboard) is that BORG uses autocorrelation-aware block sizes derived
from `borg_diagnose()`. This means the block size adapts to the actual
spatial structure in your data rather than relying on a fixed geometric
partition.

### biomod2

biomod2 expects a `DataSplitTable`, a logical matrix where each column is a
CV run and each row is an observation. Values are `TRUE` for calibration
(training) and `FALSE` for validation (testing). `borg_to_biomod2()`
produces exactly this format.

```{r biomod2-example}
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data

cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 4)
split_table <- borg_to_biomod2(cv)
dim(split_table)
head(split_table)
```

In a biomod2 workflow, pass this table to `BIOMOD_Modeling()` via the
`data.split.table` argument:

```{r biomod2-workflow, eval = FALSE}
library(biomod2)

bm_data <- BIOMOD_FormatingData(
  resp.var = presence_vector,
  expl.var = env_stack,
  resp.xy = coords_matrix,
  resp.name = "species"
)

bm_options <- BIOMOD_ModelingOptions()

bm_models <- BIOMOD_Modeling(
  bm.format = bm_data,
  modeling.id = "borg_blocked",
  models = c("GLM", "RF", "MAXENT"),
  bm.options = bm_options,
  data.split.table = split_table,
  metric.eval = c("TSS", "ROC")
)
```


## Temporal Data Workflows

For time-series and panel data, the order of observations carries
information. A model trained on future data and tested on past data will
appear to perform well but fail in production. BORG enforces chronological
ordering in several ways.

### Basic Temporal Validation

When you provide a `time` column to `borg()`, it checks that no test
observation precedes any training observation. If the split violates temporal
order, BORG flags it as a hard violation.

```{r temporal-basic}
set.seed(123)
n <- 365
ts_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = n),
  value = cumsum(rnorm(n)),
  feature = rnorm(n)
)

train_idx <- 1:252
test_idx <- 253:365

result <- borg(ts_data, train_idx = train_idx, test_idx = test_idx, time = "date")
result
```

### Rolling Origin with rsample

`rsample::rolling_origin()` creates expanding or sliding window splits that
respect temporal order by construction. You can still validate these with
`borg_inspect()` to confirm no leakage:

```{r rolling-origin, eval = has_rsample}
rolling <- rolling_origin(
  data = ts_data,
  initial = 200,
  assess = 30,
  cumulative = FALSE
)

borg_inspect(rolling, train_idx = NULL, test_idx = NULL)
```


## Spatial Data Workflows

Spatial autocorrelation means nearby observations share information not
captured by the predictors. When train and test folds contain neighboring
points, the model can "cheat" by memorizing local patterns rather than
learning generalizable relationships.

### Spatial Block Validation

```{r spatial-basic}
set.seed(456)
n <- 200
spatial_data <- data.frame(
  lon = runif(n, -10, 10),
  lat = runif(n, -10, 10),
  response = rnorm(n),
  predictor = rnorm(n)
)

train_idx <- which(spatial_data$lon < 0)
test_idx <- which(spatial_data$lon >= 0)

result <- borg(spatial_data,
               train_idx = train_idx,
               test_idx = test_idx,
               coords = c("lon", "lat"))
result
```

### Automatic Spatial CV Generation

When you omit `train_idx` and `test_idx`, `borg()` generates spatially
blocked folds automatically. The block size is calibrated to the range of
spatial autocorrelation estimated by `borg_diagnose()`.

```{r spatial-auto}
result <- borg(spatial_data, coords = c("lon", "lat"),
               target = "response", v = 5)
result$diagnosis@recommended_cv

length(result$folds)
```


## Grouped Data Workflows

For hierarchical data (patients, sites, species), observations within the
same group are not independent. Splitting a patient's visits across train and
test sets lets the model learn patient-specific patterns during training and
exploit them during testing. BORG's group-aware CV keeps all observations
from each group in the same fold.

```{r grouped-workflow}
clinical_data <- data.frame(
  patient_id = rep(1:50, each = 4),
  visit = rep(1:4, times = 50),
  outcome = rnorm(200)
)

result <- borg(clinical_data, groups = "patient_id",
               target = "outcome", v = 5)
result$diagnosis@recommended_cv

fold1 <- result$folds[[1]]
train_patients <- unique(clinical_data$patient_id[fold1$train])
test_patients <- unique(clinical_data$patient_id[fold1$test])
length(intersect(train_patients, test_patients))
```

The intersection is zero, confirming complete group separation.


## Complete Pipeline Validation

### borg_validate(): Workflow-Level Checks

`borg_validate()` accepts a list with `data`, `train_idx`, and `test_idx`
components and checks for index overlap, duplicate rows, and other structural
problems. It is framework-agnostic and works with any workflow that can be
described as a data frame plus index vectors.

```{r pipeline-validation}
data <- iris
set.seed(789)
n <- nrow(data)
train_idx <- sample(n, 0.7 * n)
test_idx <- setdiff(1:n, train_idx)

result <- borg_validate(list(
  data = data,
  train_idx = train_idx,
  test_idx = test_idx
))
result
```

### Catching Overlap

A common mistake is constructing indices that share rows between train and
test sets. This happens, for example, when using `sample()` without
`setdiff()`, or when off-by-one errors corrupt index boundaries.

```{r pipeline-bad}
bad_workflow <- list(
  data = iris,
  train_idx = 1:100,
  test_idx = 51:150
)

result <- borg_validate(bad_workflow)
result
```

### borg_pipeline(): Stage-by-Stage Decomposition

For richer pipelines (a fitted `caret::train` object, a tidymodels
`workflow`, or a list of named components), `borg_pipeline()` decomposes the
pipeline into stages and inspects each:

1. **Preprocessing**: recipe steps, `preProcess`, PCA, scaling
2. **Feature selection**: variable importance filtering
3. **Hyperparameter tuning**: inner CV resamples
4. **Model fitting**: training data scope and row counts
5. **Post-processing**: threshold optimization, calibration

Each stage receives its own BorgRisk assessment. The overall result
aggregates all risks across stages, and the `leaking_stages` field lists
the names of stages with hard violations.


## Automatic Repair with borg_assimilate()

`borg_assimilate()` attempts to fix detected leakage automatically. It can
handle cases like recipes prepped on the full data (by re-prepping on
training data only) or preprocessing objects fitted on too many rows. Some
problems, like index overlap, require a decision about which indices to keep
and cannot be fixed automatically.

```{r assimilate}
workflow <- list(
  data = iris,
  train_idx = 1:100,
  test_idx = 51:150
)

fixed <- borg_assimilate(workflow)

if (length(fixed$unfixable) > 0) {
  cat("Partial assimilation:", length(fixed$unfixable),
      "risk(s) require manual fix:",
      paste(fixed$unfixable, collapse = ", "), "\n")
} else {
  cat("Assimilation complete:", length(fixed$fixed),
      "risk(s) corrected\n")
}
```

Index overlap requires choosing a new split strategy and cannot be resolved
without user input.


## Migration Checklist

When migrating an existing modelling pipeline to use BORG-guarded CV, follow
these steps:

1. **Identify the CV step.** Find where your code calls `vfold_cv()`,
   `trainControl()`, `rsmp("cv")`, or builds manual index vectors.

2. **Add structure hints.** Determine which columns carry spatial coordinates,
   time stamps, or group identifiers, and pass them to the corresponding
   BORG wrapper.

3. **Replace the CV constructor.** Swap `vfold_cv()` for `borg_vfold_cv()`,
   `trainControl()` for `borg_trainControl()`, or generate folds with
   `borg_cv()` and inject them via `output = "rsample"`, `"caret"`, or
   `"mlr3"`.

4. **Validate preprocessing.** Run `borg_inspect()` on any prepped recipe or
   `preProcess` object to confirm it was fitted on training data only.

5. **Run borg_pipeline() on the fitted model.** This catches leakage in
   stages you may have overlooked (nested CV, threshold selection,
   post-processing).

6. **Compare metrics.** If your blocked CV metrics are substantially lower
   than the random CV metrics, the difference is the inflation that was
   hiding in your original evaluation. This is the real performance of your
   model.


## See Also

- `vignette("quickstart")` for basic usage and concepts

- `vignette("risk-taxonomy")` for the complete catalog of detectable risks