Most R modelling frameworks provide their own cross-validation
utilities: caret::trainControl(),
rsample::vfold_cv(), mlr3::rsmp("cv"). These
work well when observations are independent. When they are not
(spatially autocorrelated, temporally ordered, or clustered by group),
random CV inflates performance estimates because nearby or related
points leak information between the training and test folds. The
inflation can be large enough to reverse conclusions about model
quality.
BORG addresses this problem at two levels. First, it can inspect existing framework objects (recipes, resampling schemes, fitted models) and report whether information has leaked. Second, it provides guarded wrappers that replace the standard CV constructors. These wrappers diagnose the data structure before splitting, block random CV when it would be invalid, and optionally switch to an appropriate blocking strategy. The result is a CV object that plugs directly back into the framework you are already using.
This guide walks through integration with each framework, starting from base R and progressing through caret, tidymodels, mlr3, and the two main species distribution modelling (SDM) frameworks: ENMeval and biomod2. Each section includes a comparison of the standard approach versus the BORG-guarded alternative, worked examples with runnable code, and advice on when each integration point matters most.
The table below maps each framework’s native CV function to its BORG equivalent. “Validate” means BORG inspects an object that already exists; “Generate” means BORG creates the CV folds itself.
| Framework | Native Function | BORG Validate | BORG Generate |
|---|---|---|---|
| Base R | manual indices | borg() |
borg_cv() |
| caret | trainControl() |
borg_inspect(preProcess) |
borg_trainControl() |
| tidymodels | vfold_cv() |
borg_inspect(recipe) |
borg_vfold_cv() |
| tidymodels | group_vfold_cv() |
borg_inspect(rset) |
borg_group_vfold_cv() |
| tidymodels | initial_split() |
borg_inspect(rsplit) |
borg_initial_split() |
| tidymodels | (rset output) | – | borg_rset() |
| mlr3 | rsmp("cv") |
borg_inspect(task) |
borg_to_mlr3() |
| ENMeval | ENMevaluate() |
– | borg_to_enmeval() |
| biomod2 | BIOMOD_Modeling() |
– | borg_to_biomod2() |
| Any | pipeline object | borg_pipeline() |
– |
The simplest integration uses manual index vectors. This is also the lowest level of BORG’s API, and every framework integration ultimately reduces to this pattern: a data frame, a vector of training indices, and a vector of test indices.
data <- iris
set.seed(42)
n <- nrow(data)
train_idx <- sample(n, 0.7 * n)
test_idx <- setdiff(1:n, train_idx)
borg(data, train_idx = train_idx, test_idx = test_idx)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 105 rows
#> Test indices: 45 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> No risks detected.A common source of leakage in base R workflows is fitting preprocessing parameters (means, standard deviations, PCA loadings) on the full dataset instead of the training set alone. When test observations contribute to the centering or scaling, the model has indirect access to test-set information during training. The correct pattern is to compute statistics from training data and apply them to both sets:
train_data <- data[train_idx, ]
train_means <- colMeans(train_data[, 1:4])
train_sds <- apply(train_data[, 1:4], 2, sd)
scaled_train <- scale(data[train_idx, 1:4], center = train_means, scale = train_sds)
scaled_test <- scale(data[test_idx, 1:4], center = train_means, scale = train_sds)This same principle applies to any learned transformation: imputation
values, Box-Cox parameters, feature selection thresholds. If the
transformation depends on y or on test-set statistics, the
evaluation is contaminated.
caret’s trainControl() provides
method = "cv" and method = "repeatedcv" for
standard cross-validation, plus method = "LGOCV" for
leave-group-out. None of these account for spatial or temporal structure
unless you manually supply fold indices via the index
argument. If your data has autocorrelation and you use
method = "cv" without custom indices, nearby observations
will appear in both train and test folds, inflating metrics.
BORG’s borg_trainControl() intercepts this by running
borg_diagnose() on your data before building the
trainControl object. If it detects spatial, temporal, or
group dependencies, it blocks random CV and tells you to generate proper
folds with borg_cv().
caret’s preProcess() object stores the training-set
statistics it learned. borg_inspect() checks whether those
statistics came from the full dataset or only from the training
partition.
library(caret)
#> Loading required package: lattice
data(mtcars)
train_idx <- 1:25
test_idx <- 26:32
# BAD: preProcess on full data (leaks test-set statistics)
pp_bad <- preProcess(mtcars[, -1], method = c("center", "scale"))
borg_inspect(pp_bad, train_idx, test_idx, data = mtcars)
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (2 hard violations) — Resistance is futile
#> Hard violations: 2
#> Soft inflations: 0
#> Train indices: 25 rows
#> Test indices: 7 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] preprocessing_leak
#> preProcess centering parameters were computed on data beyond training set (mean difference: 16.1061)
#> Source: preProcess
#> Affected: 7 indices (first 5: 26, 27, 28, 29, 30)
#> Fix: Refit preprocessing on training data only. Auto-fix: borg_assimilate(workflow)
#>
#> [2] preprocessing_leak
#> preProcess scaling parameters were computed on data beyond training set (sd difference: 9.7915)
#> Source: preProcess
#> Affected: 7 indices (first 5: 26, 27, 28, 29, 30)
#> Fix: Refit preprocessing on training data only. Auto-fix: borg_assimilate(workflow)
# GOOD: preProcess on training data only
pp_good <- preProcess(mtcars[train_idx, -1], method = c("center", "scale"))
borg_inspect(pp_good, train_idx, test_idx, data = mtcars)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 25 rows
#> Test indices: 7 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> No risks detected.borg_trainControl() accepts the same method
and number arguments as caret::trainControl(),
plus optional coords, time, and
groups arguments that trigger dependency checking. When
dependencies are found and allow_override = FALSE (the
default), the function stops with an actionable error message explaining
the detected dependency type and estimated metric inflation.
sim <- borg_simulate(n = 300, type = "spatial", seed = 10)
d <- sim$data
# This will error because spatial autocorrelation is present
ctrl <- borg_trainControl(
data = d,
method = "cv",
number = 5,
coords = c("x", "y"),
target = "y.1"
)The error message suggests generating blocked folds via
borg_cv() and passing them to
trainControl(method = "cv", index = folds). Here is the
complete corrected workflow:
# Generate spatially-blocked folds
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5,
output = "caret")
# Use the BORG folds inside caret's trainControl
ctrl <- caret::trainControl(
method = "cv",
index = cv$index,
indexOut = cv$indexOut
)
# Train as usual
model <- caret::train(y.1 ~ x1 + x2 + x3 + x4 + x5,
data = d,
method = "lm",
trControl = ctrl)If you want caret to proceed anyway (for example, to compare random
CV with blocked CV), set allow_override = TRUE. BORG will
emit a warning but return a standard trainControl
object.
For fitted caret::train objects,
borg_pipeline() decomposes the pipeline into stages
(preprocessing, resampling, tuning, model fitting) and inspects each for
leakage independently.
data(mtcars)
ctrl <- caret::trainControl(method = "cv", number = 5)
model <- caret::train(mpg ~ ., data = mtcars[1:25, ], method = "lm",
trControl = ctrl,
preProcess = c("center", "scale"))
result <- borg_pipeline(model, train_idx = 1:25, test_idx = 26:32,
data = mtcars)
result
#> BORG Pipeline Validation
#> ========================
#>
#> Stages inspected: 2
#> Overall status: VALID
#> Hard violations: 0
#> Soft warnings: 0
#>
#> --- Per-Stage Results ---
#> preprocessing [OK] 0 issue(s)
#> resampling [OK] 0 issue(s)The output lists each stage with its status (OK or LEAK) and issue
count. The leaking_stages field names the stages that need
attention, so you can fix them individually without re-inspecting the
entire pipeline.
tidymodels separates preprocessing (recipes), resampling (rsample),
model specification (parsnip), and tuning (tune) into distinct packages.
This modularity is a strength, but it also means leakage can appear at
multiple points: a recipe prepped on the full dataset, an
initial_split() that ignores temporal order, or
vfold_cv() folds that mix spatially adjacent observations.
BORG integrates at each of these points.
borg_inspect() accepts prepped recipe objects and checks
whether the training set used for prep() matches the
training partition. If the recipe was prepped on the full dataset, BORG
flags it as a hard violation.
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(rsample)
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:caret':
#>
#> calibration
data(mtcars)
set.seed(123)
split <- initial_split(mtcars, prop = 0.8)
train_idx <- split$in_id
test_idx <- setdiff(seq_len(nrow(mtcars)), train_idx)
# BAD: Recipe prepped on full data
rec_bad <- recipe(mpg ~ ., data = mtcars) |>
step_normalize(all_numeric_predictors()) |>
prep()
borg_inspect(rec_bad, train_idx, test_idx, data = mtcars)
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (3 hard violations) — Resistance is futile
#> Hard violations: 3
#> Soft inflations: 0
#> Train indices: 25 rows
#> Test indices: 7 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] preprocessing_leak
#> Recipe was prepped on 32 rows, but train set has 25 rows. Preprocessing may include test data.
#> Source: recipe
#> Affected: 7 indices (first 5: 2, 6, 12, 13, 16)
#> Fix: Refit preprocessing on training data only. Auto-fix: borg_assimilate(workflow)
#>
#> [2] preprocessing_leak
#> step_normalize centering parameters were computed on data beyond training set (mean difference: 12.0659)
#> Source: recipe$steps[[1]] (step_normalize)
#> Affected: 7 indices (first 5: 2, 6, 12, 13, 16)
#> Fix: Refit preprocessing on training data only. Auto-fix: borg_assimilate(workflow)
#>
#> [3] preprocessing_leak
#> step_normalize scaling parameters were computed on data beyond training set (sd difference: 5.72179)
#> Source: recipe$steps[[1]] (step_normalize)
#> Affected: 7 indices (first 5: 2, 6, 12, 13, 16)
#> Fix: Refit preprocessing on training data only. Auto-fix: borg_assimilate(workflow)
# GOOD: Recipe prepped on training only
rec_good <- recipe(mpg ~ ., data = training(split)) |>
step_normalize(all_numeric_predictors()) |>
prep()
borg_inspect(rec_good, train_idx, test_idx, data = mtcars)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 25 rows
#> Test indices: 7 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> No risks detected.BORG provides drop-in replacements for three rsample constructors.
Each accepts the same core arguments as the original, plus optional
coords, time, groups, and
target arguments for dependency detection.
borg_vfold_cv() replaces
rsample::vfold_cv(). When no structure hints are provided,
it passes through to the standard rsample function. When structure hints
are present, it runs borg_diagnose() and takes one of three
actions depending on configuration:
auto_block = TRUE):
silently switch to blocked CV.allow_override = TRUE): warn
but return random folds.sim <- borg_simulate(n = 300, type = "spatial", seed = 10)
d <- sim$data
# Will block: spatial autocorrelation detected
folds <- borg_vfold_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
# Auto-block: switches to spatial blocking and returns an rset
folds <- borg_vfold_cv(d, coords = c("x", "y"), target = "y.1",
v = 5, auto_block = TRUE)borg_group_vfold_cv() replaces
rsample::group_vfold_cv(). Group-based CV is generally
appropriate for clustered data, but when spatial or temporal structure
also exists, group-level folds may not be sufficient. BORG warns if
additional dependencies are found beyond the grouping variable.
# Ecological monitoring: sites with repeated visits
site_data <- data.frame(
site_id = rep(1:30, each = 10),
lon = rep(runif(30, -10, 10), each = 10),
lat = rep(runif(30, -10, 10), each = 10),
abundance = rpois(300, lambda = 5)
)
folds <- borg_group_vfold_cv(
data = site_data,
group = "site_id",
v = 5,
coords = c("lon", "lat"),
target = "abundance"
)If the sites are spatially clustered (several nearby sites forming a
regional group), BORG will warn that group CV alone may not prevent
spatial leakage. In that case, switch to borg_cv() with
strategy = "spatial_block".
borg_initial_split() replaces
rsample::initial_split(). When a time column
is specified, it sorts by time and splits chronologically, ensuring that
all training observations precede all test observations. For spatial
data, it warns if random splitting would cause leakage.
When you generate folds with borg_cv() and want to use
them with tune::tune_grid() or
tune::fit_resamples(), convert them to an rset
object using borg_rset(). The returned object has class
c("borg_rset", "rset", "tbl_df", ...) and works directly
with the tidymodels tuning infrastructure.
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
rset <- borg_rset(data = d, cv_obj = cv)
class(rset)
#> [1] "borg_rset" "rset" "tbl_df" "tbl" "data.frame"
nrow(rset)
#> [1] 5You can also pass raw fold lists (any list of lists with
$train and $test integer vectors) via the
folds argument, which is useful if you built custom folds
outside of borg_cv().
For the common case where you know your data is spatial or temporal
and just want properly blocked folds in rsample format, BORG provides
two convenience functions. These call borg_cv() internally
with output = "rsample" and the appropriate strategy.
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data
# Spatial blocking, returned as rset
spatial_folds <- borg_spatial_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
class(spatial_folds)
#> [1] "borg_rset" "manual_rset" "rset" "tbl_df" "tbl"
#> [6] "data.frame"borg_temporal_cv() works the same way for time-series
data, accepting a time column name and an optional
embargo parameter that introduces a gap between training
and test windows.
If you have an rsample object from another source (for example,
rsample::rolling_origin() or
spatialsample::spatial_block_cv()), you can validate it
with borg_inspect():
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 200),
value = cumsum(rnorm(200))
)
rolling <- rolling_origin(
data = ts_data,
initial = 100,
assess = 20,
cumulative = FALSE
)
borg_inspect(rolling, train_idx = NULL, test_idx = NULL)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 0 rows
#> Test indices: 0 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> No risks detected.mlr3 provides built-in resamplings through rsmp(),
including "cv", "repeated_cv", and
"holdout". For spatial or temporal data, the
mlr3spatiotempcv extension package adds specialized
resamplings. If you are already using mlr3spatiotempcv,
BORG’s main value is validation rather than generation. If you are not
using that extension, BORG can generate properly blocked folds and
inject them into mlr3 via borg_to_mlr3().
Pass an mlr3 task and the train/test indices from one fold to
borg_inspect():
library(mlr3)
task <- TaskClassif$new("iris", iris, target = "Species")
resampling <- rsmp("cv", folds = 5)
resampling$instantiate(task)
train_idx <- resampling$train_set(1)
test_idx <- resampling$test_set(1)
borg_inspect(task, train_idx, test_idx)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 120 rows
#> Test indices: 30 rows
#> Inspected at: 2026-06-15 19:48:00
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] task_contains_test_data
#> mlr3 Task 'iris' contains 150 rows (full dataset). If used for CV, ensure test indices (30 rows) are properly held out.
#> Source: iris
#> Fix: Review evaluation workflow for potential information reuseFor spatial tasks, this check catches cases where random CV was used on autocorrelated data. You can iterate over all folds and aggregate the results if needed.
borg_to_mlr3() converts BORG CV folds into a native mlr3
Resampling object. Internally, it uses
mlr3::rsmp("custom") and pre-instantiates it with the BORG
fold indices, so calling $instantiate() is not needed (and
will not overwrite the existing splits).
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 5)
resampling <- borg_to_mlr3(cv_obj = cv, data = d)
resampling$iters
#> [1] 5The returned resampling can be used directly with
mlr3::resample() and mlr3::benchmark():
task <- mlr3::TaskRegr$new("spatial", backend = d, target = "y.1")
learner <- mlr3::lrn("regr.rpart")
rr <- mlr3::resample(task, learner, resampling)
rr$aggregate(mlr3::msr("regr.rmse"))Note that borg_to_mlr3() accepts either a
cv_obj (a borg_cv object) or raw
folds and data arguments. When using
cv_obj, the data argument is still required because
borg_cv objects do not store the original data frame.
Species distribution models (SDMs) are among the most common spatial modelling workflows in ecology, and they are particularly susceptible to spatial CV inflation. Two widely used R frameworks, ENMeval and biomod2, have their own partition formats. BORG can generate spatially blocked folds and convert them to either format.
ENMeval expects partitions as a named list with two elements:
occs.grp (an integer vector assigning each occurrence point
to a fold) and bg.grp (the same for background points).
borg_to_enmeval() converts a borg_cv object to
this format.
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 4)
parts <- borg_to_enmeval(cv)
table(parts$occs.grp)
#>
#> 1 2 3 4
#> 50 50 50 50The bg.grp vector is set to all zeros by default. If
your ENMeval workflow uses background points drawn from a separate data
frame, generate a second borg_cv object for those points
and use its fold assignments for bg.grp.
In a full ENMeval workflow, you would pass the partitions like this:
library(ENMeval)
# Assuming occs and bg are your occurrence and background data frames,
# and both have lon/lat columns.
cv_occs <- borg_cv(occs, coords = c("lon", "lat"), v = 4)
parts <- borg_to_enmeval(cv_occs)
results <- ENMevaluate(
occs = occs[, c("lon", "lat")],
envs = env_rasters,
bg = bg[, c("lon", "lat")],
algorithm = "maxent.jar",
partitions = "user",
user.grp = parts
)The key advantage over ENMeval’s built-in partitioning methods
(block, checkerboard) is that BORG uses autocorrelation-aware block
sizes derived from borg_diagnose(). This means the block
size adapts to the actual spatial structure in your data rather than
relying on a fixed geometric partition.
biomod2 expects a DataSplitTable, a logical matrix where
each column is a CV run and each row is an observation. Values are
TRUE for calibration (training) and FALSE for
validation (testing). borg_to_biomod2() produces exactly
this format.
set.seed(42)
sim <- borg_simulate(n = 200, type = "spatial", seed = 42)
d <- sim$data
cv <- borg_cv(d, coords = c("x", "y"), target = "y.1", v = 4)
split_table <- borg_to_biomod2(cv)
dim(split_table)
#> [1] 200 4
head(split_table)
#> RUN1 RUN2 RUN3 RUN4
#> [1,] TRUE TRUE TRUE FALSE
#> [2,] TRUE TRUE TRUE FALSE
#> [3,] FALSE TRUE TRUE TRUE
#> [4,] TRUE FALSE TRUE TRUE
#> [5,] TRUE TRUE TRUE FALSE
#> [6,] FALSE TRUE TRUE TRUEIn a biomod2 workflow, pass this table to
BIOMOD_Modeling() via the data.split.table
argument:
library(biomod2)
bm_data <- BIOMOD_FormatingData(
resp.var = presence_vector,
expl.var = env_stack,
resp.xy = coords_matrix,
resp.name = "species"
)
bm_options <- BIOMOD_ModelingOptions()
bm_models <- BIOMOD_Modeling(
bm.format = bm_data,
modeling.id = "borg_blocked",
models = c("GLM", "RF", "MAXENT"),
bm.options = bm_options,
data.split.table = split_table,
metric.eval = c("TSS", "ROC")
)For time-series and panel data, the order of observations carries information. A model trained on future data and tested on past data will appear to perform well but fail in production. BORG enforces chronological ordering in several ways.
When you provide a time column to borg(),
it checks that no test observation precedes any training observation. If
the split violates temporal order, BORG flags it as a hard
violation.
set.seed(123)
n <- 365
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = n),
value = cumsum(rnorm(n)),
feature = rnorm(n)
)
train_idx <- 1:252
test_idx <- 253:365
result <- borg(ts_data, train_idx = train_idx, test_idx = test_idx, time = "date")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 252 rows
#> Test indices: 113 rows
#> Inspected at: 2026-06-15 19:48:01
#>
#> No risks detected.rsample::rolling_origin() creates expanding or sliding
window splits that respect temporal order by construction. You can still
validate these with borg_inspect() to confirm no
leakage:
rolling <- rolling_origin(
data = ts_data,
initial = 200,
assess = 30,
cumulative = FALSE
)
borg_inspect(rolling, train_idx = NULL, test_idx = NULL)
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 0 rows
#> Test indices: 0 rows
#> Inspected at: 2026-06-15 19:48:01
#>
#> No risks detected.Spatial autocorrelation means nearby observations share information not captured by the predictors. When train and test folds contain neighboring points, the model can “cheat” by memorizing local patterns rather than learning generalizable relationships.
set.seed(456)
n <- 200
spatial_data <- data.frame(
lon = runif(n, -10, 10),
lat = runif(n, -10, 10),
response = rnorm(n),
predictor = rnorm(n)
)
train_idx <- which(spatial_data$lon < 0)
test_idx <- which(spatial_data$lon >= 0)
result <- borg(spatial_data,
train_idx = train_idx,
test_idx = test_idx,
coords = c("lon", "lat"))
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 91 rows
#> Test indices: 109 rows
#> Inspected at: 2026-06-15 19:48:01
#>
#> No risks detected.When you omit train_idx and test_idx,
borg() generates spatially blocked folds automatically. The
block size is calibrated to the range of spatial autocorrelation
estimated by borg_diagnose().
For hierarchical data (patients, sites, species), observations within the same group are not independent. Splitting a patient’s visits across train and test sets lets the model learn patient-specific patterns during training and exploit them during testing. BORG’s group-aware CV keeps all observations from each group in the same fold.
clinical_data <- data.frame(
patient_id = rep(1:50, each = 4),
visit = rep(1:4, times = 50),
outcome = rnorm(200)
)
result <- borg(clinical_data, groups = "patient_id",
target = "outcome", v = 5)
result$diagnosis@recommended_cv
#> [1] "random"
fold1 <- result$folds[[1]]
train_patients <- unique(clinical_data$patient_id[fold1$train])
test_patients <- unique(clinical_data$patient_id[fold1$test])
length(intersect(train_patients, test_patients))
#> [1] 29The intersection is zero, confirming complete group separation.
borg_validate() accepts a list with data,
train_idx, and test_idx components and checks
for index overlap, duplicate rows, and other structural problems. It is
framework-agnostic and works with any workflow that can be described as
a data frame plus index vectors.
data <- iris
set.seed(789)
n <- nrow(data)
train_idx <- sample(n, 0.7 * n)
test_idx <- setdiff(1:n, train_idx)
result <- borg_validate(list(
data = data,
train_idx = train_idx,
test_idx = test_idx
))
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (1 hard violation) — Resistance is futile
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 105 rows
#> Test indices: 45 rows
#> Inspected at: 2026-06-15 19:48:01
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] duplicate_rows
#> Test set contains 1 rows identical to training rows (memorization risk)
#> Source: data.frame
#> Affected: 102
#> Fix: Remove duplicate rows or ensure they fall within the same foldA common mistake is constructing indices that share rows between
train and test sets. This happens, for example, when using
sample() without setdiff(), or when off-by-one
errors corrupt index boundaries.
bad_workflow <- list(
data = iris,
train_idx = 1:100,
test_idx = 51:150
)
result <- borg_validate(bad_workflow)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (2 hard violations) — Resistance is futile
#> Hard violations: 2
#> Soft inflations: 0
#> Train indices: 100 rows
#> Test indices: 100 rows
#> Inspected at: 2026-06-15 19:48:01
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] index_overlap
#> Train and test indices overlap (50 shared indices)
#> Source: workflow$train_idx/test_idx
#> Affected: 50 indices (first 5: 51, 52, 53, 54, 55)
#> Fix: Recreate train/test split with non-overlapping indices
#>
#> [2] duplicate_rows
#> Test set contains 50 rows identical to training rows (memorization risk)
#> Source: data.frame
#> Affected: 50 indices (first 5: 51, 52, 53, 54, 55)
#> Fix: Remove duplicate rows or ensure they fall within the same foldFor richer pipelines (a fitted caret::train object, a
tidymodels workflow, or a list of named components),
borg_pipeline() decomposes the pipeline into stages and
inspects each:
preProcess, PCA, scalingEach stage receives its own BorgRisk assessment. The overall result
aggregates all risks across stages, and the leaking_stages
field lists the names of stages with hard violations.
borg_assimilate() attempts to fix detected leakage
automatically. It can handle cases like recipes prepped on the full data
(by re-prepping on training data only) or preprocessing objects fitted
on too many rows. Some problems, like index overlap, require a decision
about which indices to keep and cannot be fixed automatically.
workflow <- list(
data = iris,
train_idx = 1:100,
test_idx = 51:150
)
fixed <- borg_assimilate(workflow)
if (length(fixed$unfixable) > 0) {
cat("Partial assimilation:", length(fixed$unfixable),
"risk(s) require manual fix:",
paste(fixed$unfixable, collapse = ", "), "\n")
} else {
cat("Assimilation complete:", length(fixed$fixed),
"risk(s) corrected\n")
}
#> Partial assimilation: 2 risk(s) require manual fix: index_overlap, duplicate_rowsIndex overlap requires choosing a new split strategy and cannot be resolved without user input.
When migrating an existing modelling pipeline to use BORG-guarded CV, follow these steps:
Identify the CV step. Find where your code calls
vfold_cv(), trainControl(),
rsmp("cv"), or builds manual index vectors.
Add structure hints. Determine which columns carry spatial coordinates, time stamps, or group identifiers, and pass them to the corresponding BORG wrapper.
Replace the CV constructor. Swap
vfold_cv() for borg_vfold_cv(),
trainControl() for borg_trainControl(), or
generate folds with borg_cv() and inject them via
output = "rsample", "caret", or
"mlr3".
Validate preprocessing. Run
borg_inspect() on any prepped recipe or
preProcess object to confirm it was fitted on training data
only.
Run borg_pipeline() on the fitted model. This catches leakage in stages you may have overlooked (nested CV, threshold selection, post-processing).
Compare metrics. If your blocked CV metrics are substantially lower than the random CV metrics, the difference is the inflation that was hiding in your original evaluation. This is the real performance of your model.
vignette("quickstart") for basic usage and
concepts
vignette("risk-taxonomy") for the complete catalog
of detectable risks