--- title: "Risk Taxonomy" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Risk Taxonomy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, dev = "svglite", fig.ext = "svg" ) library(BORG) ``` This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism. Each risk type includes a description of what triggers it, why it matters for real-world modelling, and how to fix it. ## Risk Classification BORG classifies risks into two categories based on their impact on evaluation validity: | Category | Impact | BORG Response | |----------|--------|---------------| | **Hard Violation** | Results are invalid | Blocks evaluation, requires fix | | **Soft Inflation** | Results are biased | Warns, allows with caution | Hard violations indicate that the evaluation result is not trustworthy at all. The computed metrics could be arbitrarily optimistic, and no conclusion should be drawn from them. Soft inflation risks mean the metrics are likely overestimated, but the direction of model comparisons may still hold. Fixing soft risks is best practice; fixing hard violations is mandatory. # Hard Violations These make your evaluation results invalid. Any metrics computed with these violations are unreliable. ## 1. Index Overlap **What**: Same row indices appear in both training and test sets. **Why it matters**: Because the model has seen the exact data it is being tested on, any sufficiently flexible learner will achieve perfect scores on the overlapping portion of the test set. In practice, index overlap usually arises from manual bookkeeping errors: copying indices incorrectly, using inclusive rather than exclusive ranges, or inadvertently sharing a row between two data splits. The bug is easy to introduce and hard to catch by looking at metrics alone, because a small overlap (say 5% of the test set) produces a subtle, plausible-looking inflation. **Detection**: Set intersection of `train_idx` and `test_idx`. ```{r index-overlap} data <- data.frame(x = 1:100, y = rnorm(100)) # Accidental overlap: indices 51-60 appear in both sets result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100) result ``` **Fix**: Ensure indices are mutually exclusive. Use `setdiff()` to remove any shared indices, or generate non-overlapping splits from the start with `sample()` and complementary indexing. ## 2. Duplicate Rows **What**: Test set contains rows identical to training rows. **Why it matters**: Disjoint indices do not guarantee distinct data. The underlying rows may still be exact copies. This happens often in observational datasets that were merged from multiple sources (where the same record appears in two files), in datasets with many categorical predictors and few unique combinations, or when data augmentation accidentally copies real rows into a held-out set. A model that memorizes training patterns will score well on these duplicates regardless of its true generalization ability. The effect is proportional to the fraction of test rows that are duplicates: 10% duplicates in the test set can inflate accuracy by several percentage points. **Detection**: Row hashing and comparison (C++ backend for numeric data). ```{r duplicate-rows} # Data with duplicate rows dup_data <- rbind( data.frame(x = 1:5, y = 1:5), data.frame(x = 1:5, y = 1:5) # Duplicates ) result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10) result ``` **Fix**: Remove duplicate rows before splitting, or ensure splits respect duplicates by keeping all copies in the same set. In ecological survey data, duplicates sometimes represent legitimate repeated measurements; in that case, treat them as a group and use group-aware splitting. ## 3. Preprocessing Leakage **What**: Normalization, imputation, or dimensionality reduction fitted on the full dataset before splitting. **Why it matters**: Column means, standard deviations, or PCA loadings computed from the entire dataset carry test-set information into the training pipeline. Information flows backwards from test to train through these shared statistics. The effect is typically small for large datasets with independent observations (a few percent inflation in R-squared) but can be severe for small datasets, datasets with strong distributional shifts between train and test, or high-dimensional data where PCA loadings are heavily sample-dependent. This is one of the most common forms of leakage in published studies because many preprocessing pipelines default to fitting on all available data. **Detection**: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage. **Supported objects**: | Object Type | Parameters Checked | |-------------|-------------------| | `caret::preProcess` | `$mean`, `$std` | | `recipes::recipe` | Step parameters after `prep()` | | `prcomp` | `$center`, `$scale`, rotation matrix | | `scale()` attributes | `center`, `scale` | ```{r preprocessing-leak, eval=FALSE} # BAD: Scale fitted on all data scaled_data <- scale(data) # Uses all rows! train <- scaled_data[1:70, ] test <- scaled_data[71:100, ] # BORG detects this borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100) ``` **Fix**: Fit preprocessing on training data only, then apply to test: ```r train_data <- data[1:70, ] test_data <- data[71:100, ] # Fit on train means <- colMeans(train_data) sds <- apply(train_data, 2, sd) # Apply to both train_scaled <- scale(train_data, center = means, scale = sds) test_scaled <- scale(test_data, center = means, scale = sds) ``` ## 4. Target Leakage (Direct) **What**: Feature has absolute correlation > 0.99 with target. **Why it matters**: Near-perfect correlation (|r| > 0.99) between a feature and the outcome almost always means the feature was derived from the outcome, either directly or through a chain of deterministic transformations. Classic examples include `days_since_diagnosis` when predicting `has_disease`, `total_spent` when predicting `is_customer`, or aggregated future values leaked into current-period features. These features will dominate any model, producing impressive metrics that collapse entirely in deployment where the post-hoc information is unavailable. Target leakage is particularly common in tabular datasets assembled by joining multiple tables, where a downstream aggregate accidentally ends up as a predictor. **Detection**: Compute Pearson correlation of each numeric feature with target on training data. ```{r target-leakage} # Use borg_simulate for controlled target leakage sim <- borg_simulate(n = 100, type = "target_leak", leak_strength = 0.99, seed = 1) result <- borg_inspect(sim$data, train_idx = 1:70, test_idx = 71:100, target = "y") result ``` **Fix**: Remove or investigate the leaky feature. If it is a legitimate predictor, document why correlation > 0.99 is expected and verify the feature is available at prediction time. ## 5. Group Leakage **What**: Same group (patient, site, species) appears in both train and test. **Why it matters**: Latent group-level characteristics let the model learn patient- or site-specific intercepts it cannot use on new groups. If the same patient appears in both train and test, the model effectively learns patient-specific intercepts during training and is rewarded for them at test time. This inflates performance because, in deployment, the model will encounter new patients it has never seen. The magnitude of inflation depends on the intra-class correlation (ICC): when ICC is high (observations within a group are very similar), group leakage can inflate R-squared by 20-50% or more. This risk is pervasive in medical data, ecological monitoring (repeated visits to the same site), and any longitudinal study design. **Detection**: `borg_diagnose()` detects clustered structure and recommends group-aware CV. At the split level, check whether any group's members appear in both the training and test sets. ```{r group-leakage, eval=FALSE} # Clinical data with patient IDs clinical <- data.frame( patient_id = rep(1:10, each = 10), measurement = rnorm(100) ) # Random split ignoring patients (groups will leak) set.seed(123) all_idx <- sample(100) train_idx <- all_idx[1:70] test_idx <- all_idx[71:100] # Check which patients appear in both sets train_patients <- unique(clinical$patient_id[train_idx]) test_patients <- unique(clinical$patient_id[test_idx]) shared <- intersect(train_patients, test_patients) cat("Shared patients:", shared, "\n") ``` **Fix**: Use group-aware splitting so that entire groups are held out together: ```r # Split at the patient level train_patients <- sample(unique(clinical$patient_id), 7) train_idx <- which(clinical$patient_id %in% train_patients) test_idx <- which(!clinical$patient_id %in% train_patients) ``` Alternatively, use `borg_cv()` with the `groups` argument to generate group-respecting folds automatically. ## 6. Temporal Ordering Violation **What**: Test observations predate training observations. **Why it matters**: Predicting the past from the future is look-ahead bias, and in any time-dependent process the effect is severe. The model can exploit temporal trends, seasonal patterns, or regime shifts that would not be available at prediction time. This violation is especially damaging for autoregressive features (lagged values, rolling averages), because those features were computed under the assumption that earlier data is available. Random splitting of time series data is one of the most common methodological errors in applied ML: it is easy to do, produces excellent-looking metrics, and fails silently. Studies that randomly split stock prices, weather records, or patient trajectories routinely report inflated accuracy that disappears in live deployment. **Detection**: Compare max training timestamp to min test timestamp. Any test observation that falls before a training observation constitutes a violation. ```{r temporal-leak} # Simulate temporal data sim <- borg_simulate(n = 100, type = "temporal", autocorrelation = 0.7, seed = 1) # Wrong: random split ignores time ordering set.seed(42) random_idx <- sample(100) train_idx <- random_idx[1:70] test_idx <- random_idx[71:100] # borg_diagnose detects the temporal structure diagnosis <- borg_diagnose(sim$data, time = "time", target = "y", verbose = FALSE) diagnosis@recommended_cv ``` **Fix**: Use chronological splits where all test data comes after training: ```r train_idx <- 1:70 test_idx <- 71:100 ``` ## 7. CV Fold Contamination **What**: Cross-validation folds contain test indices, or folds overlap incorrectly. **Why it matters**: For nested cross-validation to be valid, the outer test set must remain completely invisible to all inner folds. If the outer test indices leak into any inner fold's training set, the tuning procedure has seen the test data. This is a subtle and surprisingly common mistake when manually constructing fold objects. It invalidates the entire nested CV result because the hyperparameter selection was influenced by test data. **Detection**: Check if any fold's training indices intersect with held-out test set. **Supported objects**: - `caret::trainControl`: checks `$index` and `$indexOut` - `rsample::vfold_cv` and other `rset` objects - `rsample::rsplit` objects ## 8. Model Scope **What**: Model was trained on more rows than the claimed training set. **Why it matters**: Evidence inside the fitted object that it saw more rows than the training indices specify means test data leaked into fitting. This can happen indirectly through hyperparameter tuning on the full dataset, through ensemble methods that use the entire dataset for base learner training, or simply through a bookkeeping mismatch between the reported split and the actual training call. Unlike index overlap (which checks the split definition), model scope checks the fitted model itself for signs that it saw more data than it should have. **Detection**: Compare `nrow(trainingData)` or `length(fitted.values)` to `length(train_idx)`. **Supported objects**: `lm`, `glm`, `ranger`, `caret::train`, parsnip models, workflows. # Soft Inflation Risks These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic. ## 1. Target Leakage (Proxy) **What**: Feature has correlation 0.95-0.99 with target. **Why warning not error**: A correlation in this range can represent either a legitimately strong predictor or a subtler form of target leakage. Domain knowledge is required to distinguish the two cases. For example, in ecological modelling, mean annual temperature might genuinely correlate at r = 0.97 with species richness. Conversely, a feature like `average_daily_sales_last_year` might correlate at r = 0.96 with `annual_revenue` because it is a near-direct transformation of the target. BORG warns but does not block, giving you the opportunity to investigate. **Detection**: Same as direct leakage, different threshold. ```{r proxy-leakage} # Use borg_simulate for controlled proxy leakage sim <- borg_simulate(n = 100, type = "target_leak", leak_strength = 0.7, seed = 1) result <- borg_inspect(sim$data, train_idx = 1:70, test_idx = 71:100, target = "y") result ``` **Action**: Review whether the feature should be available at prediction time in production. If it is a legitimate predictor, document the justification. ## 2. Spatial Proximity **What**: Test points are very close to training points in geographic space. **Why it matters**: Nearby locations share environmental conditions, soil properties, and species compositions through spatial autocorrelation. When train and test points are spatially interleaved, the model can interpolate between nearby training points rather than learning generalizable patterns. Error estimates become optimistic. The magnitude scales with the strength of spatial autocorrelation and the ratio of inter-point distance to autocorrelation range. In ecological and environmental modelling, where spatial structure is the rule rather than the exception, this inflation can be severe: reported R-squared values may drop by 30-60% when switching from random to spatially blocked CV. **Detection**: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread. ```{r spatial-proximity} # Use borg_simulate for spatially autocorrelated data sim <- borg_simulate(n = 100, type = "spatial", autocorrelation = 0.7, seed = 42) # Random split intermixes nearby points set.seed(42) train_idx <- sample(100, 70) test_idx <- setdiff(1:100, train_idx) result <- borg_inspect(sim$data, train_idx = train_idx, test_idx = test_idx, coords = c("x", "y")) result ``` **Fix**: Use spatial blocking so that train and test occupy distinct geographic regions: ```r # Geographic split along x-axis train_idx <- which(sim$data$x < 0.5) test_idx <- which(sim$data$x >= 0.5) ``` ## 3. Spatial Overlap **What**: Test region falls inside training region's convex hull. **Why it matters**: Because interpolation is easier than extrapolation, test points are surrounded by training points, the model benefits from having nearby reference data in all directions. In deployment, the model may need to predict in regions where no training data exists (true extrapolation). Performance on interpolated test points overestimates the model's transferability to new areas. **Detection**: Compute convex hull of training points, count test points inside. **Threshold**: Warning if > 50% of test points fall inside training hull. ## 4. Random CV on Dependent Data **What**: Using random k-fold CV when data has spatial, temporal, or group structure. **Why it matters**: Breaking dependencies artificially, random folds produce optimistic error estimates. When observations are not independent (e.g., spatial neighbors, repeated measures, time series), random CV treats each observation as if it were drawn independently. This overestimates the effective sample size and underestimates the prediction error. ```{r random-cv-inflation} # Simulate spatial data and diagnose its dependency structure sim <- borg_simulate(n = 200, type = "spatial", autocorrelation = 0.7, seed = 42) diagnosis <- borg_diagnose(sim$data, coords = c("x", "y"), target = "y.1", verbose = FALSE) cat("Dependency:", diagnosis@dependency_type, "\n") cat("Recommended CV:", diagnosis@recommended_cv, "\n") ``` **Fix**: Use `borg_cv()` to generate appropriate blocked CV folds that respect the dependency structure, or call `borg()` for a full automated pipeline. # Risk Interactions In practice, evaluation risks rarely appear in isolation. Real-world datasets often exhibit multiple dependency structures simultaneously, and their effects compound in ways that are worse than the sum of parts. ## Spatial + Target Leakage When a dataset has both spatial autocorrelation and a leaked feature, the two risks interact. Spatial proximity allows the model to exploit local patterns, while the leaked feature provides a near-perfect shortcut to the target. Random CV inflates metrics through both channels at once. The combined inflation is often much larger than either source alone. ```{r combined-risks} # Simulate data with both spatial autocorrelation and target leakage sim <- borg_simulate(n = 200, type = "combined", autocorrelation = 0.7, leak_strength = 0.95, seed = 42) # Random split set.seed(1) train_idx <- sample(200, 140) test_idx <- setdiff(1:200, train_idx) # Inspect for both risks result <- borg_inspect(sim$data, train_idx = train_idx, test_idx = test_idx, target = "y.1", coords = c("x", "y")) result ``` In this example, `borg_inspect()` detects both the target leakage and the spatial proximity issue. Fixing only one of them still leaves the evaluation compromised. ## Preprocessing + Spatial Structure Preprocessing leakage (fitting normalization on the full dataset) interacts with spatial structure in a subtle way. When the train and test regions have different covariate distributions (which is common in spatial data, where western and eastern regions may differ in elevation, rainfall, or land cover), the test region's statistics contaminate the training normalization. This makes the model's predictions in the test region artificially better, because the preprocessing has already "seen" the test distribution. ## Groups + Temporal Ordering In longitudinal studies, group leakage and temporal violations can co-occur. If patient A has observations at times t=1,2,3 and you randomly split, some of patient A's early observations may end up in the test set while later observations are in training. The model then predicts the past for a patient it has already seen in the future. Both risks need to be addressed: split by groups first, then ensure temporal ordering within each group. ## Temporal + Spatial Spatiotemporal datasets carry two dependency axes at once. A classic case is environmental monitoring: weather stations at fixed locations record daily values for years. Observations close in space share local conditions (microclimate, topography), while observations close in time share global conditions (season, regime). A random split violates both dependencies simultaneously. Blocking on space alone still allows temporal leakage within each spatial block. Blocking on time alone still allows spatial interpolation within each time slice. The practical consequence is that neither a pure spatial block CV nor a pure temporal block CV produces honest error estimates. A spatial block CV applied to a 10-year daily monitoring dataset will place all 3,650 days from a held-out station into the test set, but the model still sees the same days at neighboring stations. If the spatial autocorrelation range is 50 km and stations are 20 km apart, the model interpolates with ease. Conversely, a temporal block CV that holds out the final year still allows the model to train on the held-out year's spatial neighbors. The correct approach is to block on both axes: hold out entire spatial regions for entire time windows. This is sometimes called "leave-location-and-time-out" (LLTO) CV. The inflation gap between random CV and LLTO CV is often the largest of any single correction, because two leakage channels close at once. In species distribution modelling with multi-year survey data, we have seen R-squared drop by 40-70% when switching from random to LLTO CV. ```{r temporal-spatial-interaction} # Simulate data with both spatial and temporal structure sim_sp <- borg_simulate(n = 200, type = "spatial", autocorrelation = 0.7, seed = 10) sim_tm <- borg_simulate(n = 200, type = "temporal", autocorrelation = 0.7, seed = 10) # Diagnose each independently diag_sp <- borg_diagnose(sim_sp$data, coords = c("x", "y"), target = "y.1", verbose = FALSE) diag_tm <- borg_diagnose(sim_tm$data, time = "time", target = "y", verbose = FALSE) cat("Spatial dependency:", diag_sp@dependency_type, "\n") cat("Temporal dependency:", diag_tm@dependency_type, "\n") ``` When both structures are present, run `borg_diagnose()` with both `coords` and `time` arguments so that BORG can recommend a spatiotemporal blocking strategy rather than a single-axis block. # Decision Tree: Which Risks to Check Not every dataset requires checking all risk types. Use this decision tree to identify which risks are relevant for your data. **Step 1: Data structure** - Does your data have repeated measurements from the same subject, site, or unit? Check **group leakage**. - Does your data have coordinates (longitude/latitude, UTM, or any x/y)? Check **spatial proximity** and **spatial overlap**. - Does your data have a time column (dates, timestamps, or ordered sequence)? Check **temporal ordering**. **Step 2: Preprocessing pipeline** - Did you normalize, impute, or reduce dimensionality on the full dataset before splitting? Check **preprocessing leakage**. - Did you fit a PCA, encoder, or embedding on all data? Same risk. **Step 3: Feature provenance** - Were any features computed from the target variable? Check **target leakage**. - Were any features aggregated across time periods that include the test period? Check **target leakage** (temporal variant). - Do any features have suspiciously high correlation with the target? Use `borg_inspect()` with the `target` argument to scan automatically. **Step 4: Split mechanics** - Did you create train/test splits manually? Check **index overlap**. - Did your data go through deduplication after splitting? Check **duplicate rows**. - Are you using nested CV? Check **CV fold contamination**. - Did you fit the model separately from splitting? Check **model scope**. **Quick start**: if you are unsure, run `borg_diagnose()` on your data to detect dependency structure, then `borg_inspect()` on your split to scan for leakage: ```{r decision-tree-example} # Step 1: Diagnose the dataset sim <- borg_simulate(n = 200, type = "spatial", autocorrelation = 0.5, seed = 1) diagnosis <- borg_diagnose(sim$data, coords = c("x", "y"), target = "y.1", verbose = FALSE) cat("Structure:", diagnosis@dependency_type, "\n") cat("Severity:", diagnosis@severity, "\n") # Step 2: Inspect a candidate split set.seed(1) train_idx <- sample(200, 140) test_idx <- setdiff(1:200, train_idx) inspection <- borg_inspect(sim$data, train_idx = train_idx, test_idx = test_idx, coords = c("x", "y"), target = "y.1") cat("\nValid split?", inspection@is_valid, "\n") cat("Hard violations:", inspection@n_hard, "\n") cat("Soft warnings:", inspection@n_soft, "\n") ``` When `borg_diagnose()` reports spatial or temporal structure and `borg_inspect()` flags spatial proximity on a random split, the diagnosis and inspection agree: you need blocked CV. # Practical Guidance ## Hard vs. Soft: Triage Order Hard violations invalidate the evaluation entirely. Soft inflation risks bias it. The triage order follows from this distinction: fix all hard violations first, because no amount of soft-risk mitigation matters if the evaluation is fundamentally broken. A dataset with index overlap and spatial proximity has two problems, but the index overlap must be fixed before the spatial proximity is even worth thinking about. Once all hard violations are resolved, address soft risks in order of estimated inflation magnitude, which usually means spatial proximity first (it tends to produce the largest bias in environmental and ecological data), then proxy leakage, then spatial overlap. A common mistake is to spend effort on spatial blocking while the dataset still contains duplicate rows from a sloppy merge. The spatial blocking will not help. The model memorizes the duplicates and scores well on them regardless of how the geographic blocks are arranged. Always clear the hard violations before tuning the soft ones. ## How Risk Severity Scales with Dataset Size Small datasets amplify every risk. With n = 50, a single duplicate row in the test set represents 6-7% of the test data (assuming a 70/30 split). With n = 10,000 and the same one duplicate, the effect is 0.03% of the test set, which is negligible. The same logic applies to preprocessing leakage: the train-set mean and the full-dataset mean converge as n grows, so the information leaked through preprocessing diminishes. At n > 5,000, the bias from preprocessing leakage on well-behaved numeric features is typically below 0.5% of the metric. Spatial proximity is the exception. Its severity depends on the ratio of inter-point distance to autocorrelation range, not on n directly. A dataset with n = 100,000 points densely covering a small area can have worse spatial leakage than n = 200 points spread across a continent. The relevant quantity is the effective spatial coverage, not the sample size. Target leakage severity is independent of n. A perfectly correlated feature dominates the model at any sample size. ## Rules of Thumb These are empirically grounded defaults. They will not be correct in every situation, but they cover the majority of applied cases we have encountered. **1. Below n = 200, treat every soft risk as hard.** With fewer than 200 observations, the margin between a biased evaluation and an honest one is small enough that soft inflation risks can flip model rankings. A spatial proximity warning on n = 150 is, in practice, as damaging as a hard violation on n = 5,000. Run `borg_diagnose()` and fix everything it flags. **2. If the median nearest-neighbor distance between test and train points is less than 10% of the autocorrelation range, spatial blocking is mandatory.** This threshold comes from simulation studies in spatial statistics: below it, interpolation dominates and error estimates are optimistic by 30% or more. If the autocorrelation range is unknown, estimate it from a variogram. If that is not feasible, use 5% of the spatial extent as a conservative proxy for the autocorrelation range and apply the 10% rule to that. **3. When all features have |r| < 0.8 with the target and n > 500, spatial proximity is the dominant remaining risk for georeferenced data.** In this regime, target leakage is absent and the sample is large enough that preprocessing leakage and duplicates have negligible effect. The evaluation's honesty hinges almost entirely on whether the spatial structure was respected. Conversely, if any feature has |r| > 0.95, target leakage is the dominant risk regardless of spatial structure, and it must be resolved first. **4. For time series shorter than 100 observations, use a single temporal split rather than rolling-window CV.** Rolling-window CV on short series produces folds with very few test points (sometimes fewer than 10), which makes the performance estimate high-variance and unreliable. A single 70/30 chronological split gives a noisier but less biased estimate than 5-fold rolling windows with 15 test points each. **5. Group-aware splitting is non-negotiable when the ICC exceeds 0.3.** An intra-class correlation of 0.3 means that 30% of the total variance lives between groups. Random CV that ignores groups will exploit this structure and inflate metrics accordingly. At ICC > 0.5, the inflation from ignoring groups routinely exceeds the inflation from ignoring spatial structure. ```{r practical-guidance-example} # Quick triage workflow sim <- borg_simulate(n = 150, type = "combined", autocorrelation = 0.6, leak_strength = 0.97, seed = 7) set.seed(7) train_idx <- sample(150, 105) test_idx <- setdiff(1:150, train_idx) result <- borg_inspect(sim$data, train_idx = train_idx, test_idx = test_idx, target = "y.1", coords = c("x", "y")) cat("Hard violations:", result@n_hard, "\n") cat("Soft warnings:", result@n_soft, "\n") ``` With n = 150, both the target leakage (hard) and spatial proximity (soft) need fixing. The hard violation comes first. Remove or investigate the leaked feature, then switch to spatial blocking for the remaining soft risk. # Quick Reference | Risk Type | Severity | Detection Method | Fix | |-----------|----------|------------------|-----| | `index_overlap` | Hard | Index intersection | Use `setdiff()` | | `duplicate_rows` | Hard | Row hashing | Deduplicate or group | | `preprocessing_leak` | Hard | Parameter comparison | Fit on train only | | `target_leakage` | Hard | Correlation > 0.99 | Remove feature | | `group_leakage` | Hard | Group intersection | Group-aware split | | `temporal_leak` | Hard | Timestamp comparison | Chronological split | | `cv_contamination` | Hard | Fold index check | Rebuild folds | | `model_scope` | Hard | Row count | Refit on train only | | `proxy_leakage` | Soft | Correlation 0.95-0.99 | Domain review | | `spatial_proximity` | Soft | Distance check | Spatial blocking | | `spatial_overlap` | Soft | Convex hull | Geographic split | # Accessing Risk Details ```{r risk-access} # Create result with violations result <- borg_inspect( data.frame(x = 1:100, y = rnorm(100)), train_idx = 1:60, test_idx = 51:100 ) # Summary cat("Valid:", result@is_valid, "\n") cat("Hard violations:", result@n_hard, "\n") cat("Soft warnings:", result@n_soft, "\n") # Individual risks for (risk in result@risks) { cat("\n", risk$type, "(", risk$severity, "):\n", sep = "") cat(" ", risk$description, "\n") if (!is.null(risk$affected_indices)) { cat(" Affected:", head(risk$affected_indices, 5), "...\n") } } # Tabular format as.data.frame(result) ``` ## See Also - `vignette("quickstart")` - Basic usage - `vignette("frameworks")` - Framework integration