--- title: "Quick Start" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, dev = "svglite", fig.ext = "svg" ) library(BORG) library(ggplot2) theme_borg <- theme_minimal(base_size = 12) + theme( panel.background = element_rect(fill = "transparent", colour = NA), plot.background = element_rect(fill = "transparent", colour = NA), legend.background = element_rect(fill = "transparent", colour = NA) ) theme_set(theme_borg) ``` # Why your test accuracy might be wrong A model shows 95% accuracy on test data, then drops to 60% in production. The culprit, almost always, is data leakage: information from the test set that contaminated training. Leakage takes many forms: preprocessing fitted on all data before splitting, features derived from the outcome, the same patient appearing in train and test, or random cross-validation applied to spatially autocorrelated observations. These mistakes are easy to make and hard to spot manually. BORG (Bounded Outcome Risk Guard) detects these problems before you compute metrics. It examines your data, your splits, your preprocessing objects, and your fitted model to flag violations that would inflate performance estimates. The package covers the full evaluation pipeline: data dependency detection (spatial autocorrelation via Moran's I, temporal autocorrelation via the Ljung-Box test, clustered structure via the ICC), cross-validation fold generation (10+ blocking strategies), leakage validation (index overlap, target leakage, preprocessing leakage, group leakage, temporal ordering violations), and post-hoc diagnostics (residual spatial structure, calibration, null testing, area of applicability). When dependencies exist in the data, BORG generates cross-validation folds that respect those dependencies, so your reported error actually predicts how the model will perform on genuinely new data. This vignette walks through the core workflow: 1. **Simulate** data with known dependencies so we can verify BORG's detection. 2. **Diagnose** the dependency structure (spatial, temporal, grouped, or mixed). 3. **Generate** CV folds that respect the detected dependencies. 4. **Validate** existing train/test splits for leakage. 5. **Fit a model** inside a guarded pipeline that checks every fold. 6. **Compare** random vs blocked CV empirically to quantify the inflation. 7. **Check power** to confirm the dataset is large enough for blocked CV. After the core workflow, short taste tests demonstrate additional features: area of applicability, block-permutation importance, residual diagnostics, calibration, null testing, and interactive maps. Each taste test links to a dedicated vignette with the full treatment. # Simulating data with known dependencies Before working with real data, it helps to run BORG on synthetic data where the ground truth is known. `borg_simulate()` generates datasets with controlled dependency structures, so you can verify that BORG's detection and blocking are doing the right thing. ```{r simulate-spatial} sim <- borg_simulate( n = 300, type = "spatial", n_predictors = 5, signal_strength = 0.3, autocorrelation = 0.5, seed = 42 ) ``` The returned object contains a data frame with spatial coordinates (`x`, `y`), five predictors, and a response variable. The `true_r2` field records the theoretical upper bound for an honest evaluation. Any model reporting R-squared above this value is overfitting or benefiting from leakage. ```{r simulate-inspect} names(sim$data) cat("True R-squared:", sim$true_r2, "\n") cat("Coordinate columns:", sim$coords, "\n") ``` The spatial simulation places coordinates in `x` and `y`, predictors in `x1` through `x5`, and the response in `y.1` (renamed by R to avoid collision with the `y` coordinate). Temporal simulations use `time` for the time column and `y` for the response. Six simulation types are available, covering the main ways evaluation can go wrong: | Type | What it generates | Use case | |---|---|---| | `"spatial"` | Gaussian spatial autocorrelation via a Matern-like covariance | Testing spatial blocking | | `"temporal"` | AR(1) time series with configurable lag-1 autocorrelation | Testing temporal blocking | | `"target_leak"` | Noisy copy of the response added as a feature | Testing leakage detection | | `"preprocessing_leak"` | Data pre-normalized on all rows before splitting | Testing preprocessing checks | | `"combined"` | Spatial autocorrelation + target leakage together | Worst-case scenario | | `"independent"` | IID observations, no dependencies | Control / sanity check | The `signal_strength` parameter (0 to 1) controls how much variance in the response is explained by the predictors. At 0.3, the true R-squared is around 0.3; any model reporting substantially higher values is exploiting autocorrelation or leakage. The `autocorrelation` parameter (0 to 1) controls the strength of spatial or temporal dependence. ```{r simulate-leaky} sim_leak <- borg_simulate(n = 200, type = "target_leak", seed = 1) cat("Leaked variables:", sim_leak$leaked_vars, "\n") ``` The leaky dataset contains a variable that correlates almost perfectly with the response. A model using this variable would appear to have near-perfect predictive power, but the feature would not be available at prediction time on genuinely new data. # Diagnosing data dependencies The entry point for most workflows is `borg()`. When called without train/test indices, it diagnoses the data and generates CV folds in one step. ```{r diagnose-spatial} result <- borg( sim$data, coords = c("x", "y"), target = "y.1", v = 5 ) result ``` The diagnosis reports what kind of dependency was detected (spatial, temporal, clustered, or mixed), how severe it is, and which CV strategy BORG recommends. The `result$diagnosis` slot holds the full `BorgDiagnosis` object with quantitative details. ```{r diagnosis-details} diag <- result$diagnosis cat("Dependency type:", diag@dependency_type, "\n") cat("Severity:", diag@severity, "\n") cat("Recommended CV:", diag@recommended_cv, "\n") cat("Moran's I:", round(diag@spatial$morans_i, 3), "(p =", formatC(diag@spatial$morans_p, format = "e", digits = 2), ")\n") cat("Effective N:", diag@spatial$effective_n, "out of", diag@n_obs, "\n") ``` Moran's I quantifies spatial autocorrelation: values near zero indicate no spatial pattern, values near 1 indicate strong positive autocorrelation. The effective sample size is the number of independent observations after accounting for spatial dependence. A dataset of 300 points with effective N of 80 means random CV would treat 300 observations as independent when only 80 carry unique information; error estimates from random CV would be far too optimistic. The inflation estimate in the diagnosis predicts how much random CV would overstate model performance. For example, an `auc_inflation` of 0.05 means random CV would report AUC roughly 5 points higher than what the model achieves on truly independent data. BORG derives this estimate from the autocorrelation range and the effective sample size, following the framework of Roberts et al. (2017). You can also call `borg_diagnose()` directly if you only need the diagnosis without generating folds. ```{r diagnose-direct} diag_only <- borg_diagnose( sim$data, coords = c("x", "y"), target = "y.1" ) cat("Same result:", diag_only@dependency_type, "\n") ``` ## Temporal and grouped data The same interface handles temporal and grouped data. BORG auto-detects common column names (`date`, `time`, `site`, `patient_id`, etc.), but you can always specify them explicitly. ```{r diagnose-temporal} sim_temporal <- borg_simulate(n = 300, type = "temporal", seed = 7) result_temporal <- borg( sim_temporal$data, time = "time", target = "y", v = 5 ) cat("Recommended CV:", result_temporal$diagnosis@recommended_cv, "\n") ``` For temporal data, BORG runs a Ljung-Box test to detect autocorrelation and estimates the decorrelation lag (the number of time steps after which observations become approximately independent). The recommended CV strategy ensures no future data leaks into past folds, either through temporal blocking or a purged CV scheme with an appropriate embargo period. ```{r diagnose-grouped} grouped_data <- data.frame( site = rep(1:30, each = 10), x1 = rnorm(300), measurement = rnorm(300) ) result_grouped <- borg(grouped_data, groups = "site", target = "measurement", v = 5) cat("Recommended CV:", result_grouped$diagnosis@recommended_cv, "\n") ``` For grouped data, BORG computes the intraclass correlation coefficient (ICC), which measures how much variance is explained by group membership. An ICC of 0.5 means half the variance in the response comes from between-group differences. With high ICC, random CV creates optimistic estimates because the same group's observations appear in both train and test. BORG switches to leave-group-out folds automatically. When multiple dependency types coexist (spatial coordinates plus a group column), BORG detects the combined structure and recommends a strategy that respects all of them. For instance, spatial-temporal data gets a spatial-temporal blocking strategy that separates folds in both space and time. # Working with CV folds The generated folds live in `result$folds`. Each fold is a list with `train` and `test` index vectors. ```{r folds-structure} length(result$folds) fold1 <- result$folds[[1]] cat("Fold 1 - Train:", length(fold1$train), " Test:", length(fold1$test), "\n") ``` The `borg_cv` object in `result$cv` carries metadata about the strategy and can be visualized with `autoplot()`. ```{r folds-autoplot, fig.width=7, fig.height=4} autoplot(result$cv, type = "sizes") ``` For spatial data, pass the original data and coordinate names to see the folds on a map. ```{r folds-spatial-map, fig.width=7, fig.height=5} autoplot(result$cv, type = "spatial", data = sim$data, coords = c("x", "y")) ``` Each colour represents a different fold. Spatially blocked folds group nearby points together, so every fold's test set is geographically separated from its training set. Compare this to random CV, where train and test points are scattered across the entire study area; in that case, a test point might be surrounded by training points only metres away, and the model can exploit local patterns instead of learning the signal. The strategy BORG chooses depends on the diagnosis. For spatial data, the default is k-means spatial blocking. Other strategies are available through `borg_cv()` directly: | Strategy | When to use it | |---|---| | `"spatial_block"` | Default for spatial data. K-means clustering on coordinates. | | `"environmental_block"` | When environmental gradients matter more than geography. Clusters on predictor values. | | `"spatial_plus"` | Two-stage: spatial blocks refined by environmental similarity. | | `"knndm"` | When prediction locations are known. Matches CV distance distribution to prediction distances. | | `"leave_location_out"` | True spatial LOO with optional buffer zone. Expensive but unbiased. | | `"temporal_block"` | Contiguous time blocks. | | `"temporal_expanding"` | Expanding window: training set grows, test set slides forward. | | `"temporal_sliding"` | Fixed-width sliding window. | | `"purged_cv"` | Temporal CV with an embargo period between train and test. | | `"group_fold"` | Leave-group-out. | ## Exporting folds to other frameworks By default, `borg_cv()` returns a list of index vectors. Set the `output` argument to convert folds into the format your framework expects. ```{r folds-rsample, eval = requireNamespace("rsample", quietly = TRUE)} cv_rsample <- borg_cv( sim$data, diagnosis = result$diagnosis, coords = c("x", "y"), v = 5, output = "rsample" ) class(cv_rsample) ``` Supported formats: `"rsample"` (tidymodels), `"caret"` (trainControl), `"mlr3"` (Resampling object). The conversion preserves the fold structure exactly; no resampling or re-blocking occurs. You get the same train/test splits regardless of output format. This makes it straightforward to integrate BORG's blocked folds into an existing modelling pipeline without changing the downstream code. See `vignette("frameworks")` for framework-specific examples with caret's `train()`, tidymodels' `tune_grid()`, and mlr3's `resample()`. # Validating existing splits When you already have train/test indices, pass them to `borg()` and it switches to validation mode. Instead of generating folds, it checks the split for leakage. ```{r validate-clean} set.seed(1) train_idx <- sample(300, 210) test_idx <- setdiff(1:300, train_idx) risk <- borg(sim$data, train_idx = train_idx, test_idx = test_idx) risk ``` No violations. The split is clean. The `BorgRisk` object reports `is_valid = TRUE`, meaning no hard violations were found. Now introduce some problems, one at a time, to see how BORG responds. ## Overlapping indices The most basic mistake: some row indices appear in both the training and test set. This means the model is tested on data it already saw during training. ```{r validate-overlap, error = TRUE} bad_risk <- borg(sim$data, train_idx = 1:200, test_idx = 150:300) ``` `borg()` refuses to continue when it finds overlapping indices. It throws an error with the number of shared indices and the first few values. This fail-fast behaviour prevents you from accidentally computing metrics on a contaminated split. For a non-stopping approach that returns a `BorgRisk` object instead of throwing, use `borg_inspect()`: ```{r validate-overlap-inspect} bad_risk <- borg_inspect(sim$data, train_idx = 1:200, test_idx = 150:300) bad_risk ``` The `BorgRisk` object reports `is_valid = FALSE` and lists the overlap as a hard violation. Metrics computed from this split are meaningless. ## Target leakage Target leakage occurs when a feature is derived from (or highly correlated with) the response variable. The classic example: including `days_since_diagnosis` as a predictor when the outcome is `has_disease`. The feature is only available after the outcome is known and would not exist at prediction time. ```{r validate-leakage} risk_leak <- borg( sim_leak$data, train_idx = 1:140, test_idx = 141:200, target = "y" ) risk_leak ``` The leaky feature (`leaked_feature`) was flagged because its correlation with the response exceeds 0.99. Removing it before modelling would fix the problem. BORG also flags "proxy leakage" (correlation between 0.95 and 0.99) as a soft warning, since these features might be legitimate strong predictors that require domain knowledge to judge. ## Preprocessing leakage Preprocessing leakage is subtler than the examples above. When you fit a scaler, PCA, or imputer on the entire dataset before splitting, the parameters (means, standard deviations, principal components) incorporate information from the test set. The test data's distribution influences how training data is transformed. This creates a feedback loop that inflates apparent performance. BORG inspects preprocessing objects directly. If you scaled the data before splitting, the test set's statistics contaminated the scaler, and BORG will flag this. ```{r validate-preprocess} full_scaled <- scale(sim$data[, c("x1", "x2", "x3")]) risk_pp <- borg_inspect( full_scaled, train_idx = 1:210, test_idx = 211:300, data = sim$data ) risk_pp ``` The correct pattern: fit the scaler on training data only, then apply the same transformation to both sets. This way, the test set is transformed using parameters it did not influence. BORG validates this pattern too: pass the correctly fitted preprocessing object and BORG confirms no leakage occurred. ```{r validate-preprocess-good} train_subset <- sim$data[1:210, c("x1", "x2", "x3")] mu <- colMeans(train_subset) s <- apply(train_subset, 2, sd) scaled_correct <- scale(sim$data[, c("x1", "x2", "x3")], center = mu, scale = s) risk_pp_good <- borg_inspect( scaled_correct, train_idx = 1:210, test_idx = 211:300, data = sim$data ) cat("Valid:", risk_pp_good@is_valid, "\n") ``` ## Accessing risk details Every `BorgRisk` object stores a list of individual risks with their type, severity, description, and (where applicable) the affected rows or features. The `is_valid` slot gives a quick binary answer: `TRUE` if no hard violations were found, `FALSE` otherwise. For programmatic access, convert the object to a data frame. ```{r risk-details} df <- as.data.frame(bad_risk) df[, c("type", "severity", "description")] ``` # The borg_check() shortcut For quick pipeline integration, `borg_check()` runs diagnosis and validation in one call, returning a tidy data frame. ```{r borg-check} checks <- borg_check( sim$data, target = "y.1", coords = c("x", "y"), v = 5 ) checks[, c("risk_type", "severity", "description")] ``` This is convenient for CI pipelines or scripts that need a single yes/no answer. The diagnosis and CV folds are attached as attributes if you need them later. ```{r borg-check-attr} diag_from_check <- attr(checks, "diagnosis") cat("Dependency type:", diag_from_check@dependency_type, "\n") ``` You can build automated gates on top of `borg_check()`. For example, in a CI pipeline: if any row has `severity == "hard_inflation"`, fail the build. # Fitting a model with borg_workflow() The functions we have seen so far (`borg()`, `borg_diagnose()`, `borg_cv()`) handle the data side: detecting dependencies and creating folds. The next step is to actually fit a model inside those folds and collect performance metrics. You could do this manually: loop over folds, subset the data, fit, predict, compute RMSE. `borg_workflow()` does exactly this loop, but adds three things: (1) it validates every fold's train/test split for leakage, (2) it stores all fitted models and predictions for later inspection, and (3) it handles the diagnosis and fold generation internally if you have not done it yourself. You supply the data, a formula, and a fitting function. ```{r workflow-lm} wf <- borg_workflow( sim$data, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, coords = c("x", "y"), v = 5, fit_fun = lm, metric = "rmse" ) wf ``` The workflow object reports per-fold RMSE (or whichever metric you chose) and flags any fold where leakage was detected. Every fold's train/test split is clean by construction because dependencies are handled at the fold-generation stage. High fold-to-fold variance in the performance data frame often signals that the dataset is too small or too spatially heterogeneous for stable estimates. ```{r workflow-performance} wf$performance ``` ```{r workflow-autoplot, fig.width=7, fig.height=4} autoplot(wf$performance) ``` Any custom fitting function works, as long as it accepts a `formula` and a `data` argument. This makes it easy to swap in different learners without changing the rest of the workflow. For example, to use `ranger` (random forest): ```{r workflow-ranger, eval = requireNamespace("ranger", quietly = TRUE)} wf_rf <- borg_workflow( sim$data, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, coords = c("x", "y"), v = 5, fit_fun = function(formula, data, ...) ranger::ranger(formula, data = data), metric = "rmse" ) wf_rf ``` # Comparing random vs blocked CV The diagnosis tells you that dependencies exist, and the workflow gives you error estimates from blocked CV. But neither directly answers the question reviewers will ask: "how much does it matter?" How much would random CV have inflated the performance estimate compared to the blocked approach? `borg_compare_cv()` answers this empirically. It fits the same model under both random and blocked CV, with multiple repeats, and tests whether the difference is statistically significant using a paired t-test. ```{r compare-cv} comparison <- borg_compare_cv( sim$data, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, coords = c("x", "y"), v = 5, repeats = 5, seed = 42 ) comparison ``` The inflation estimate tells you how much random CV underestimates prediction error (or overestimates R-squared) relative to blocked CV. The p-value comes from a paired t-test across repeated folds. Each repeat uses a different random seed for fold assignment, so the test accounts for the variability inherent in CV itself. ```{r compare-cv-plot, fig.width=7, fig.height=4} plot(comparison) ``` The boxplot shows the distribution of the chosen metric (RMSE by default) across repeats for both random and blocked CV. When the boxes do not overlap, the inflation is clear. Three possible outcomes and what to do with each: 1. **Significant inflation** (p < 0.05, blocked RMSE clearly higher). Random CV is unreliable for this dataset. Use blocked CV and report the empirical inflation in your methods section. 2. **Non-significant difference** (p >= 0.05). Blocking is still the safer choice, but you can report that the penalty was small for this dataset. 3. **Blocked CV reports lower error** (rare). This can happen when spatial structure helps rather than hurts, for instance when the signal is region-specific. Investigate whether the model generalizes to new regions. # Power analysis after blocking Blocked CV is more honest than random CV, but it comes at a cost: reduced statistical power. When observations within a block are correlated, the block contributes less information than the same number of independent observations would. The effective sample size shrinks, and with it the ability to distinguish a good model from a bad one. Before committing to blocked CV, check whether your dataset retains enough statistical power to detect the effect sizes you care about. ```{r power-analysis} pw <- borg_power( sim$data, coords = c("x", "y"), target = "y.1" ) pw ``` The design effect (DEFF) quantifies how much variance increases due to within-block correlation. A DEFF of 2.5 means you need 2.5 times as many observations to achieve the same power as independent data. The output reports power under both random and blocked CV, plus the minimum detectable effect size. ```{r power-summary} summary(pw) ``` If `sufficient` is `FALSE`, the dataset is too small for blocked CV to detect meaningful effects. Options in that case: - **Collect more data.** The most reliable fix. - **Reduce the number of folds.** Fewer folds means more training data per fold, which can restore power at the cost of higher variance in the CV estimate. - **Use leave-one-block-out** instead of k-fold, if the number of natural blocks is small. - **Use random CV with an explicit caveat** in the methods section, noting the power limitation and the expected direction of bias. # Putting it all together Here is the complete workflow from raw data to publication-ready output, using the spatial simulation as a stand-in for real data. Every step below would be identical with a real dataset; just replace `sim$data` with your own data frame. ```{r full-workflow} # 1. Simulate (or load) data sim_full <- borg_simulate(n = 400, type = "spatial", seed = 99) # 2. Diagnose + generate folds res <- borg(sim_full$data, coords = c("x", "y"), target = "y.1", v = 5) # 3. Fit a model across all folds wf_full <- borg_workflow( sim_full$data, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, coords = c("x", "y"), v = 5, fit_fun = lm, metric = "rmse" ) # 4. Summarize performance cat("Mean RMSE:", mean(wf_full$performance$rmse), "\n") cat("SD RMSE:", sd(wf_full$performance$rmse), "\n") ``` ```{r full-workflow-compare} # 5. Compare to random CV comp <- borg_compare_cv( sim_full$data, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, coords = c("x", "y"), v = 5, repeats = 5, seed = 99 ) comp ``` ```{r full-workflow-power} # 6. Power check pw_full <- borg_power(sim_full$data, coords = c("x", "y"), target = "y.1") cat("Sufficient power:", pw_full$sufficient, "\n") ``` ```{r full-workflow-methods} # 7. Generate methods text methods <- summary(res, comparison = comp) ``` Steps 2 through 7 took six function calls. The methods text from step 7 can go directly into a manuscript, including the statistical tests performed, the CV strategy chosen, and the empirical inflation estimate. This is the pattern we recommend for any spatial modelling analysis: diagnose first, block based on the diagnosis, quantify the inflation, check power, then report. If you need to customize the model fitting function or the error metric, the `fit_fun` and `metric` arguments to `borg_workflow()` accept arbitrary functions. The `borg_compare_cv()` function accepts `model_fn` and `predict_fn` for the same purpose, so you can compare random vs blocked CV with any learner. # Taste tests: more features Each of the features below has a dedicated vignette with a full worked example. Here we show just enough code to give the flavour. If a feature looks relevant to your work, follow the link to the dedicated vignette. ## Area of Applicability A model trained on data from lowland forests should not be trusted when predicting in alpine meadows. The training data simply does not cover that part of environmental space. `borg_aoa()` formalizes this intuition by computing a dissimilarity index (DI) for every new prediction location relative to the training data, following Meyer & Pebesma (2021). Points with DI above a threshold fall outside the area of applicability (AOA). Predictions at those points should be flagged or excluded from downstream analysis. ```{r aoa-taste, fig.width=7, fig.height=5} train_data <- sim$data[train_idx, ] test_data <- sim$data[test_idx, ] aoa <- borg_aoa( train = train_data, new = test_data, predictors = c("x1", "x2", "x3", "x4", "x5"), coords = c("x", "y") ) cat("Inside AOA:", sum(aoa$aoa), "/", nrow(aoa), "\n") autoplot(aoa) ``` The DI threshold is derived from the distribution of distances within the training set. Points beyond the threshold are in regions of predictor space that the training data does not cover. The `autoplot()` method shows a map (if coordinates are available) or a histogram of DI values, with the threshold marked. For a complementary view, `borg_extrapolation()` identifies which specific variables are outside their training range at each prediction point, and `borg_transferability()` estimates how well the model transfers to distinct geographic regions. See `vignette("aoa-extrapolation")` for the full treatment. ## Block-permutation variable importance Standard permutation importance shuffles a variable's values row by row and measures the drop in predictive performance. On autocorrelated data, this can be misleading: even after permuting one variable, nearby observations still share information through the remaining (non-permuted) spatially structured predictors. The result is that importance scores are deflated for the permuted variable and inflated for correlated neighbours. `borg_importance()` fixes this by permuting entire spatial blocks instead of individual rows. All observations within a block are permuted together, breaking the local spatial signal. ```{r importance-taste, fig.width=7, fig.height=4} # Fit a model first model <- lm(y.1 ~ x1 + x2 + x3 + x4 + x5, data = train_data) imp <- borg_importance( model = model, data = train_data, target = "y.1", coords = c("x", "y"), n_rep = 5 ) autoplot(imp) ``` The lollipop chart ranks variables by their block-permutation importance. Error bars reflect variability across repeated permutations (`n_rep`). Variables with importance near zero contribute nothing to prediction and can safely be dropped. BORG also provides `borg_forward_selection()` for stepwise variable selection using blocked CV at each step, and `borg_best_subset()` for exhaustive search over small predictor sets. See `vignette("feature-selection")` for the full treatment. ## Residual diagnostics After fitting a model, check whether the residuals still carry spatial structure. If they do, the model is missing a spatially structured predictor, or the CV strategy did not adequately separate spatial neighbours. Either way, error estimates are unreliable. ```{r residuals-taste, fig.width=7, fig.height=4} resid_check <- borg_check_residuals( model, data = train_data, coords = c("x", "y") ) resid_check autoplot(resid_check) ``` The variogram shows how residual semivariance changes with distance. A flat variogram (constant semivariance at all distances) means residuals are spatially independent, meaning the model has captured the spatial signal. An increasing variogram signals unmodelled spatial structure: nearby residuals are more similar than distant ones, which means the model is systematically missing a spatial pattern. `borg_check_residuals()` also reports Moran's I for the residuals and classifies the spatial signal as "clean", "mild", or "strong". See `vignette("diagnostics")` for spatial residual checks combined with posterior predictive checks and formal model comparison. ## Calibration Discrimination (does the model rank observations correctly?) gets most of the attention. Calibration (are the predicted values accurate in absolute terms?) is equally important but rarely checked. A model can have high AUC but still predict probabilities that are systematically too high or too low. ```{r calibration-taste, fig.width=7, fig.height=4} preds <- predict(model, newdata = test_data) cal <- borg_calibration( predicted = preds, actual = test_data$y.1, n_bins = 8 ) cal autoplot(cal) ``` The reliability diagram plots observed values against predicted values, binned into groups. A perfectly calibrated model falls on the 1:1 diagonal. Points above the diagonal indicate underprediction; points below indicate overprediction. The expected calibration error (ECE) summarizes the average deviation from the diagonal across all bins. Lower is better; ECE below 0.05 is generally considered well-calibrated. `borg_calibration()` also reports the reliability slope (ideal: 1.0) and intercept (ideal: 0.0), plus the Brier score for classification or calibration MSE for regression. Three binning strategies are available: `"uniform"` (equal bin width), `"quantile"` (equal bin count), and `"spatial"` (bins that respect spatial structure). The spatial binning strategy groups predictions by their geographic location, which matters when calibration varies across the study area. `vignette("diagnostics")` covers calibration diagnostics in full detail. ## Null test Before interpreting a model's predictive skill, verify that it beats random guessing. `borg_null_test()` permutes the response variable and re-evaluates performance many times to build a null distribution. This is worth checking whenever the signal is weak or the sample is small. ```{r null-taste} nt <- borg_null_test( data = sim$data, folds = result$folds, formula = y.1 ~ x1 + x2 + x3 + x4 + x5, n_null = 20 ) nt ``` If the observed metric falls within the null distribution, the model has no predictive skill beyond chance. The p-value counts what fraction of permuted performances equal or exceed the observed performance. Use `n_null = 100` or higher for publication-quality p-values; 20 is enough for a quick sanity check. The null test uses the same blocked folds you pass in, so the test respects the spatial (or temporal, or grouped) structure of the data. A standard permutation test with random folds would itself be subject to the same inflation problem BORG is designed to solve. ## Extrapolation detection A lighter-weight alternative to AOA, `borg_extrapolation()` flags prediction points that fall outside the training data's range in one or more variables. This is a univariate check (AOA uses multivariate dissimilarity), so it catches obvious range violations without the computational cost of a full DI computation. ```{r extrapolation-taste} extrap <- borg_extrapolation( train = train_data, new = test_data, predictors = c("x1", "x2", "x3", "x4", "x5"), coords = c("x", "y") ) cat("Extrapolating:", sum(extrap$extrapolating), "/", nrow(extrap), "\n") ``` Each row reports which variables (if any) are outside the training range at that prediction point, plus its Mahalanobis distance from the training centroid. Because the Mahalanobis distance accounts for correlations between predictors, a point within range on every individual variable can still be flagged if its combination of values is unusual relative to the training distribution. Use `borg_extrapolation()` as a quick pre-check before the full AOA analysis. No extrapolating points in any individual variable means the AOA will likely classify most points as inside the area of applicability. Many out-of-range points signal unreliable predictions before you invest computation time in the full multivariate analysis. ## Geographic distance analysis How well do the CV folds mimic the distance structure the model will face at prediction time? If nearest-neighbour distances in CV test-to-train splits are much shorter than prediction-to-train distances, the CV estimate is still too optimistic, even after spatial blocking. `borg_geodist()` visualizes both distance distributions and runs a Kolmogorov-Smirnov test for a formal check. ```{r geodist-taste, fig.width=7, fig.height=4} gd <- borg_geodist( data = sim$data, folds = result$folds, coords = c("x", "y") ) gd autoplot(gd) ``` A close match (small KS statistic, overlapping density curves) means the CV folds are representative of the actual prediction task. A large gap means the model will face harder predictions in deployment than the CV folds suggest, and the KNNDM strategy (`borg_cv(..., strategy = "knndm")`) is specifically designed to minimize this gap when prediction locations are known. ## Interactive maps For spatial data, `borg_leaflet()` creates an interactive Leaflet map where points are coloured by fold assignment. You can toggle layers to see training vs test points for each fold, which is useful for spotting spatial patterns in the blocking that a static scatter plot might miss. ```{r leaflet-taste, eval = requireNamespace("leaflet", quietly = TRUE), fig.width=7, fig.height=5} borg_leaflet(result, data = sim$data, coords = c("x", "y")) ``` Leaflet maps are most useful with real geographic coordinates (longitude/latitude). The simulated data here uses arbitrary x/y values, so the map places points at the origin. With real data, you get a full basemap with zoom and pan. For static publication figures, use `autoplot()` instead. For interactive exploration during analysis, `borg_leaflet()` is faster to iterate with because you can zoom into specific regions and hover over individual points to see their fold assignment and risk status. # S3 methods BORG objects support standard R generics. You rarely need to dig into S4 slots or list elements manually; the generics provide a consistent interface across all object types. | Method | Works on | What it does | |---|---|---| | `print()` | All BORG objects | One-screen summary with key numbers | | `summary()` | `borg_result`, `BorgDiagnosis`, `BorgRisk`, `borg_cv`, `borg_power`, `borg_workflow` | Detailed breakdown or publication-ready methods text | | `plot()` | `BorgRisk`, `borg_result`, `borg_comparison` | Base R visualizations (no ggplot2 needed) | | `autoplot()` | 25+ classes | ggplot2 visualizations with transparent backgrounds | | `as.data.frame()` | `BorgRisk`, `BorgDiagnosis` | Convert to tidy data frame for downstream analysis | The `autoplot()` methods are the recommended way to visualize results. They produce ggplot2 objects that you can customize with standard ggplot2 syntax (`+ labs(...)`, `+ theme(...)`, etc.). Every autoplot uses transparent backgrounds by default, so figures look correct in both light and dark themes. ## Methods text for publications Writing the model evaluation paragraph of a Methods section is tedious and error-prone. You need to describe the dependency type, the statistical test used to detect it, the CV strategy and its parameters, the number of folds, and the literature references for each choice. `summary()` on a `borg_result` generates this paragraph automatically. Three citation styles are available: APA (default), Nature (superscript numbers), and Ecology (inline author-year). ```{r methods-text} methods_text <- summary(result) ``` ```{r methods-styles, eval = FALSE} summary(result, style = "nature") summary(result, style = "ecology") ``` If you also ran `borg_compare_cv()`, pass the comparison object to include empirical inflation estimates in the methods text. This adds a sentence along the lines of "Random cross-validation inflated RMSE by X% compared to spatial blocking (paired t-test, p = Y)." ```{r methods-comparison, eval = FALSE} summary(result, comparison = comparison) ``` The returned string is a character vector that you can write directly to a file or paste into a manuscript. All references to BORG functions, statistical tests, and methodological citations are included. # Practical guidance ## When to use BORG Use BORG any time you evaluate a predictive model on data that has spatial, temporal, or grouped structure. This includes species distribution models, remote-sensing-based land cover classification, clinical models with repeated patient measurements, environmental time series forecasting, and any situation where observations are not independent. If your data is genuinely IID (no spatial coordinates, no time column, no grouping variable), BORG will detect `"none"` as the dependency type and recommend random CV. In this case, there is no penalty from using BORG; it simply confirms that random CV is appropriate. ## When NOT to use BORG BORG is designed for tabular prediction problems with a response variable and structured dependencies between observations. The following fall outside its scope: - Image or text data where spatial/temporal structure is internal to each observation rather than between observations. - Unsupervised learning (no target variable to evaluate). - Reinforcement learning or online learning settings. ## Rules of thumb - **Minimum 100 observations** for stable spatial blocking. Below this, folds become too small and fold-to-fold variance dominates the CV estimate. - **Minimum 20 clusters** for leave-group-out CV. With fewer groups, each fold removes a large fraction of the data, inflating variance. - **5 repeats minimum** for `borg_compare_cv()`. The paired t-test needs enough observations to distinguish a real signal from noise. Use 10 or more for publication. - **Check power before blocking.** If `borg_power()` reports `sufficient = FALSE`, consider whether the cost of biased error estimates (from random CV) outweighs the cost of noisy estimates (from blocked CV with low power). - **Always simulate first.** Use `borg_simulate()` with parameters similar to your real data (same n, similar autocorrelation strength) to calibrate your expectations. If BORG cannot detect the simulated dependency, it will also miss it in your real data. - **Include the validation certificate** in your supplementary materials. It documents every diagnostic test and CV decision, which makes peer review faster and more transparent. # Validation certificates For reproducibility, BORG can generate a certificate that records the diagnosis, the CV strategy, and the software environment. ```{r certificate} cert <- borg_certificate(result$diagnosis, data = sim$data) cert ``` The certificate records the R version, BORG version, diagnosis summary, recommended CV strategy, number of folds, and a timestamp. Export it to YAML or JSON for archival alongside your analysis code. ```{r export-cert, eval = FALSE} borg_export(result$diagnosis, sim$data, "validation.yaml") borg_export(result$diagnosis, sim$data, "validation.json") ``` Storing a validation certificate with your analysis makes the evaluation choices reproducible and auditable. If a reviewer asks why you used spatial blocking instead of random CV, the certificate documents the statistical tests and their results. # Spatial input formats BORG accepts three spatial input formats. Plain data frames with coordinate columns (as used throughout this vignette) are the simplest. If your data is already an `sf` object, BORG extracts coordinates from the geometry column automatically; you do not need to specify `coords`. ```{r sf-input, eval = requireNamespace("sf", quietly = TRUE)} library(sf) sf_data <- st_as_sf(sim$data, coords = c("x", "y")) result_sf <- borg(sf_data, target = "y.1", v = 5) cat("Detected from sf geometry:", result_sf$diagnosis@dependency_type, "\n") ``` Terra `SpatVector` objects are also supported. BORG calls `terra::crds()` to extract coordinates and converts the attribute table to a data frame internally. All three formats (data.frame + coords, sf, SpatVector) produce identical results. Choose whichever format your upstream data pipeline already uses. Converting between formats just to satisfy BORG is unnecessary; the package handles the conversion internally. # Function reference The table below lists the main user-facing functions. Each function has a help page (`?function_name`) with argument documentation and examples. ## Core | Function | Purpose | |---|---| | `borg()` | Main entry point: diagnose data or validate splits | | `borg_inspect()` | Detailed inspection of train/test split, preprocessing, or model | | `borg_diagnose()` | Analyze data dependencies only | | `borg_cv()` | Generate CV folds (list, rsample, caret, or mlr3 format) | | `borg_check()` | Quick one-call pipeline check | | `borg_workflow()` | Full diagnose-fit-validate cycle | | `borg_pipeline()` | Validate a multi-stage pipeline (recipe + model + tuning) | ## Evaluation | Function | Purpose | |---|---| | `borg_compare_cv()` | Empirical random vs blocked CV comparison | | `borg_power()` | Power analysis after blocking | | `borg_null_test()` | Permutation-based null hypothesis test | | `borg_calibration()` | Reliability diagram and calibration metrics | | `borg_conformal()` | Conformal prediction intervals | ## Spatial analysis | Function | Purpose | |---|---| | `borg_aoa()` | Area of applicability (Meyer & Pebesma 2021) | | `borg_extrapolation()` | Univariate range and Mahalanobis extrapolation check | | `borg_transferability()` | Region-to-region transfer performance | | `borg_geodist()` | CV vs prediction distance distribution matching | | `borg_block_size()` | Optimize spatial block size | | `borg_check_residuals()` | Spatial residual diagnostics (Moran's I, variogram) | | `borg_thin()` | Spatial thinning to reduce autocorrelation | ## Feature selection and importance | Function | Purpose | |---|---| | `borg_importance()` | Block-permutation variable importance | | `borg_forward_selection()` | Forward feature selection with blocked CV | | `borg_best_subset()` | Best subset selection (exhaustive, small predictor sets) | | `borg_shap()` | SHAP-based feature contributions | ## Reporting | Function | Purpose | |---|---| | `summary()` | Publication-ready methods text (APA, Nature, Ecology styles) | | `borg_certificate()` | Validation certificate for reproducibility | | `borg_export()` | Export certificate to YAML/JSON | | `borg_report()` | Generate an HTML validation report | | `borg_leaflet()` | Interactive spatial maps | ## Simulation | Function | Purpose | |---|---| | `borg_simulate()` | Generate data with known dependencies | | `borg_sample_design()` | Evaluate sampling design adequacy | # Next steps - `vignette("risk-taxonomy")`: Complete catalog of all risks BORG detects, with examples showing each violation type and how to fix it. - `vignette("frameworks")`: Integration with caret, tidymodels, mlr3, ENMeval, and biomod2. Shows how to use BORG-guarded CV functions as drop-in replacements in existing workflows. # References - Roberts, D.R. et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. *Ecography*, 40(8), 913-929. doi:10.1111/ecog.02881 - Kaufman, S. et al. (2012). Leakage in data mining: Formulation, detection, and avoidance. *ACM TKDD*, 6(4), 1-21. doi:10.1145/2382577.2382579 - Meyer, H. & Pebesma, E. (2021). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. *Methods in Ecology and Evolution*, 12(9), 1620-1633. doi:10.1111/2041-210X.13650 - Kapoor, S. & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. *Patterns*, 4(9), 100804. doi:10.1016/j.patter.2023.100804 - Linnenbrink, J. et al. (2024). kNNDM: k-fold nearest neighbour distance matching cross-validation for map accuracy estimation. *Geoscientific Model Development*, 17, 5897-5912. doi:10.5194/gmd-17-5897-2024