--- title: "Temporal Cross-Validation" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Temporal Cross-Validation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, dev = "svglite", fig.ext = "svg" ) library(BORG) library(ggplot2) theme_borg <- theme_minimal(base_size = 12) + theme( panel.background = element_rect(fill = "transparent", colour = NA), plot.background = element_rect(fill = "transparent", colour = NA), legend.background = element_rect(fill = "transparent", colour = NA) ) theme_set(theme_borg) ``` # Introduction Time series cross-validation is one of those techniques that most practitioners know they should use but often skip in favour of random k-fold CV. The cost of that shortcut is inflated performance estimates. A model that appears to explain 60% of variance under random CV might explain 25% when evaluated honestly on future data. The gap between those numbers is real: it shows up in production, in failed forecasts, and in published results that do not replicate. The problem is look-ahead bias. When random CV scatters observations into folds without regard for time, the training set routinely contains future observations. A model trained on data from January through December and tested on March has already seen April. The test error shrinks because the model has access to information that would not exist at prediction time. This is not a subtle statistical nuance; it is a data leak that invalidates the entire evaluation. The mechanism is straightforward. Suppose we want to predict species abundance in July 2023. Under random 5-fold CV, the training set could include observations from June and August 2023. A model that simply memorizes local trajectories will report low test error because it interpolates between temporally adjacent training points. Under temporal CV, the training set contains only data from before July 2023. The model must extrapolate from the past, which is the actual task we care about. Look-ahead bias affects any dataset where observations have a natural temporal ordering: financial time series, ecological monitoring records, climate projections, sensor readings, patient health trajectories. In finance, a stock-price model evaluated with random CV can appear to predict next-day returns with 2% accuracy when the true out-of-sample accuracy is 0.5%. In ecology, species distribution models fitted to multi-year survey data routinely show R-squared inflation of 15-40% under random CV because annual population counts are autocorrelated. In clinical medicine, patient health trajectories measured at weekly intervals carry enough autocorrelation that random CV inflates the apparent predictive power of longitudinal biomarker models by 10-30%. Panel data deserves special mention. Longitudinal studies track the same units (patients, field sites, countries) over time. Both temporal autocorrelation within each unit and similarity between repeated measurements on the same unit inflate metrics simultaneously. A patient's blood glucose at time $t$ predicts their blood glucose at time $t+1$, and a random CV fold that places both observations in different sets allows the model to exploit this identity-level autocorrelation. Ignoring either the temporal or the group structure produces optimistic estimates; ignoring both produces estimates that can be off by a factor of two. BORG provides temporal cross-validation strategies that enforce chronological ordering, insert embargo gaps between training and test periods, and detect when random CV would produce misleading results. This vignette walks through the full workflow: diagnosing temporal structure, choosing between blocking and expanding window strategies, setting embargo periods, comparing random vs temporal CV, and detecting concept drift in deployment data. The functions we will use are `borg_simulate()`, `borg_diagnose()`, `borg_cv()`, `borg_temporal_cv()`, `borg_compare_cv()`, `borg_drift()`, and `borg_fold_performance()`. All code runs end to end with no external dependencies beyond BORG and ggplot2. # The problem: temporal autocorrelation Temporal autocorrelation is the tendency for consecutive observations to resemble each other. Today's temperature predicts tomorrow's. This quarter's GDP predicts next quarter's. A patient's blood pressure reading at 10:00 AM carries information about the reading at 10:15 AM. The closer two observations are in time, the more correlated they tend to be. Three quantities characterize the severity of temporal autocorrelation for cross-validation purposes. **Autocorrelation function (ACF).** The ACF measures the correlation between a time series and lagged versions of itself. The lag-1 ACF is the correlation between consecutive observations. For financial returns, this might be 0.05 (nearly independent). For daily temperature, it exceeds 0.9. For ecological population counts measured annually, values of 0.4-0.7 are common. When the lag-1 ACF exceeds 0.2, random CV starts to produce measurably inflated estimates. **Ljung-Box test.** The Ljung-Box Q-statistic tests the null hypothesis that the first *k* autocorrelations are all zero. A significant result (p < 0.05) means the series contains temporal structure that CV must respect. BORG computes this at lag 10 by default. A non-significant result does not guarantee independence (the series could have nonlinear dependence), but a significant result is a clear signal that random CV is inappropriate. **Effective sample size.** With autocorrelation, 500 observations do not carry as much information as 500 independent observations. The effective sample size is $n_{\text{eff}} = n \cdot (1 - \rho) / (1 + \rho)$, where $\rho$ is the lag-1 ACF. A series of 500 observations with $\rho = 0.8$ has an effective sample size of about 56. Random CV acts as if each of those 500 observations contributes independent test information. They do not. The test error is estimated from 56 effective data points masquerading as 500, producing confidence intervals that are far too narrow and point estimates that are biased toward optimism. The connection between autocorrelation and CV inflation is mechanical. Consider a simple scenario: you want to predict $y_{t+1}$ given predictors $x_t$. In random CV, the training set might contain $y_{t-1}$, $y_t$, $y_{t+2}$, and $y_{t+3}$. If $\rho = 0.8$, the model can predict $y_{t+1}$ to within $\sqrt{1 - \rho^2} \approx 0.6$ standard deviations just by interpolating between the surrounding training points. The apparent R-squared rises, but the model has learned nothing about the underlying data-generating process; it has memorized the local trajectory. The interaction between autocorrelation and model complexity matters. Linear models are somewhat protected because they cannot memorize local trajectories as aggressively; they fit a global hyperplane and the autocorrelation mainly inflates confidence intervals rather than point predictions. Tree-based models (random forests, gradient boosted trees) are more vulnerable because they partition the feature space into small regions and can effectively memorize clusters of temporally adjacent training points. A random forest trained with random CV on daily temperature data will place nearby days in the same leaf nodes and report near-perfect test accuracy. The same forest evaluated with temporal CV will show the true predictive skill, which for a naive model is often poor. Neural networks sit somewhere in between: their capacity to memorize depends on architecture and regularization, but high-capacity networks with dropout disabled can exploit temporal leakage as aggressively as tree-based methods. The character of the autocorrelation also matters. Strong short-range autocorrelation, like daily weather or hourly sensor data, means that observations 1-3 time steps apart are nearly identical. The lag-1 ACF might be 0.95. Random CV on such data is almost equivalent to testing on the training set itself, because adjacent observations end up on opposite sides of the train/test split. The inflation is large (often 30-60%) but easy to detect with `borg_diagnose()`. Weak long-range autocorrelation is subtler. Annual ecological surveys with lag-1 ACF of 0.3-0.5 produce moderate inflation (5-20%) that is easy to dismiss as noise. But at these autocorrelation levels, the effective sample size is already 40-70% of the nominal $n$, and confidence intervals around the CV estimate are meaningfully too narrow. The danger with weak long-range autocorrelation is not that the point estimate is wildly wrong, but that the practitioner believes the estimate is more precise than it is. # Simulating temporal data We start by generating data with known temporal structure. `borg_simulate()` with `type = "temporal"` creates AR(1) predictors and an AR(1) noise process. Both the features and the response exhibit temporal autocorrelation, which is the typical scenario in applied work where predictors are themselves time series (economic indicators, environmental measurements, lagged variables). ```{r simulate} sim <- borg_simulate( n = 500, type = "temporal", n_predictors = 5, signal_strength = 0.3, autocorrelation = 0.7, seed = 42 ) sim ``` The simulated data contains a `time` column (daily dates starting from 2020-01-01), five predictors (`x1` through `x5`), and a response `y`. The true R-squared is `r round(sim$true_r2, 2)`, which represents the ceiling for any model that correctly recovers the signal without temporal leakage. ```{r colnames} names(sim$data) ``` We can visualize the response over time to see the autocorrelation structure: ```{r plot-ts, fig.width = 7, fig.height = 4} ggplot(sim$data, aes(x = time, y = y)) + geom_line(alpha = 0.6) + labs(title = "Simulated temporal response (AR(1), rho = 0.7)", x = "Time", y = "Response") ``` The smooth, slowly varying trajectory is the visual fingerprint of temporal autocorrelation. Consecutive values track each other closely, which means that a random train/test split would place temporally adjacent observations on both sides of the partition. Now we confirm the temporal structure formally with `borg_diagnose()`: ```{r diagnose} diag <- borg_diagnose( sim$data, time = "time", target = "y", verbose = FALSE ) diag ``` BORG detects the temporal dependency, reports the lag-1 ACF, the Ljung-Box p-value, the decorrelation lag, and the minimum embargo period. The `@recommended_cv` slot tells us which CV strategy BORG would select automatically. The diagnostic output deserves careful reading. The lag-1 ACF is the single most informative number: values above 0.2 indicate that random CV will produce measurably inflated estimates, and values above 0.5 indicate that the inflation will be large enough to change substantive conclusions. The Ljung-Box p-value is a formal hypothesis test; a value below 0.05 rejects the null of independence, but values well below 0.001 (as we see here) indicate that the temporal structure is not borderline but unambiguous. The decorrelation lag tells us how many time steps it takes for the autocorrelation to decay below the significance threshold ($2 / \sqrt{n}$). For our data with $\rho = 0.7$ and $n = 500$, the significance threshold is about 0.09, and the ACF at lag $k$ is approximately $0.7^k$, so it takes roughly 27 time steps for the autocorrelation to become negligible. The embargo minimum is derived from this decorrelation lag and represents the smallest gap between training and test sets that will prevent temporal leakage. Setting the embargo below this value leaves residual autocorrelation between the last training observation and the first test observation. We can inspect the temporal diagnostics directly: ```{r temporal-slots} cat("ACF lag-1: ", diag@temporal$acf_lag1, "\n") cat("Ljung-Box p: ", diag@temporal$ljung_box_p, "\n") cat("Embargo minimum: ", diag@temporal$embargo_minimum, "\n") ``` The lag-1 ACF of `r round(diag@temporal$acf_lag1, 3)` confirms strong temporal dependence. The Ljung-Box p-value is effectively zero, rejecting the null of no autocorrelation. The embargo minimum tells us how many time units we need between training and test sets to break the autocorrelation link. # Temporal block CV Temporal block CV divides the time series into contiguous blocks and assigns each block to a fold. Within each iteration, the test set is a continuous time window, and the training set is everything before it (minus an embargo gap). This mimics the real prediction task: we train on the past and predict the future. ```{r temporal-block} cv_block <- borg_cv( sim$data, time = "time", target = "y", strategy = "temporal_block", v = 5, verbose = TRUE ) cv_block ``` The fold structure preserves chronological order. Each fold tests on a later time window than the previous fold's test set. Training data is always temporally prior to test data, with an embargo gap in between. We can verify the chronological structure by checking the time ranges of training and test sets in each fold: ```{r fold-ranges} for (i in seq_along(cv_block$folds)) { f <- cv_block$folds[[i]] train_range <- range(sim$data$time[f$train]) test_range <- range(sim$data$time[f$test]) cat(sprintf("Fold %d: train %s to %s | test %s to %s\n", i, train_range[1], train_range[2], test_range[1], test_range[2])) } ``` In each fold, the training period ends before the test period begins. The gap between the last training observation and the first test observation is the embargo period. No future data leaks into the training set. The contiguous block structure is not just a convenience; it is a requirement. Consider the alternative: assigning random time points to folds while keeping folds "mostly" in order. A fold that tests on March 2021 but includes two February 2021 observations in its test set and three March 2021 observations in the training set has a subtle leak. The autocorrelation between the March training and March test observations acts as a bridge. Contiguous blocks eliminate this by construction: every observation in the test fold is later than every observation in the training set. There is no ambiguity and no possibility of partial leakage. We can also use the convenience wrapper `borg_temporal_cv()`, which returns an rsample-compatible `borg_rset` object for use with tidymodels: ```{r temporal-cv-wrapper, eval = requireNamespace("rsample", quietly = TRUE)} cv_rset <- borg_temporal_cv( sim$data, time = "time", target = "y", v = 5 ) cv_rset ``` This plugs directly into `tune::tune_grid()` and `tune::fit_resamples()` without any adapter code. Per-fold performance gives us a direct view of how model accuracy changes over time: ```{r fold-perf} perf <- borg_fold_performance( sim$data, folds = cv_block, formula = y ~ x1 + x2 + x3 + x4 + x5, metric = "rmse" ) perf ``` The per-fold RMSE values carry diagnostic information that aggregate metrics hide. Three patterns are worth looking for. First, a monotonically increasing RMSE across folds suggests concept drift: the relationship between predictors and response is changing over time, and the model trained on early data predicts later periods poorly. Second, a U-shaped pattern (high RMSE in early and late folds, low in the middle) can indicate seasonality or a periodic process that the model captures during its central period but misses at the edges. Third, a single spike in one fold often points to an anomalous event (a sensor failure, a policy change, an extreme weather year) rather than a systematic problem. Random CV averages all of these patterns away, producing a single number that hides the temporal structure of model performance. # Expanding window CV Temporal block CV assigns each time block to exactly one fold. Expanding window CV (also called forward chaining or walk-forward validation) takes a different approach: it grows the training set over time. Fold 1 trains on the first block and tests on the second. Fold 2 trains on the first two blocks and tests on the third. By the final fold, the training set contains almost all historical data. ```{r expanding} cv_expand <- borg_cv( sim$data, time = "time", target = "y", strategy = "temporal_expanding", v = 5, verbose = TRUE ) cv_expand ``` Compare the training set sizes across folds: ```{r expanding-sizes} sizes <- data.frame( fold = seq_along(cv_expand$folds), n_train = vapply(cv_expand$folds, function(f) length(f$train), integer(1)), n_test = vapply(cv_expand$folds, function(f) length(f$test), integer(1)) ) sizes ``` The training set grows with each fold while the test set remains roughly constant. This reflects the real deployment scenario more faithfully than block CV when the model will be retrained on all available data before making predictions. **When to use which.** Temporal block CV is appropriate when you want unbiased estimates of average future performance and when the data-generating process is approximately stationary. Each fold contributes roughly equal weight, and the estimate is not dominated by the largest training set. Expanding window CV is appropriate when the model will be deployed with a growing training history (which is almost always the case in production forecasting). It answers the question: "Given everything I know up to time $t$, how well do I predict the next window?" The cost is that early folds have small training sets and may underperform, which inflates the variance of the overall estimate. The bias-variance tradeoff between the two strategies is concrete. Temporal block CV uses a fixed training set size for each fold (roughly $n \times (v-1)/v$ observations). This gives each fold comparable statistical power, so the per-fold estimates have similar variance. But if the data-generating process evolves over time, the early folds train on data that may no longer represent the current regime, introducing bias. Expanding window CV reduces bias in later folds because it trains on the most data, but early folds suffer from high variance due to small training sets. In practice, the expanding window estimate is dominated by the last few folds (which have the most training data and the most test-relevant information), while the block CV estimate weights all folds equally. A third option exists: sliding window CV (also called rolling window). Here, the training set has a fixed width and slides forward in time. Fold 1 trains on observations 1-100 and tests on 101-120. Fold 2 trains on 21-120 and tests on 121-140. The training set never grows; old data is dropped as new data enters. This is the right strategy when old data is actively harmful (e.g., a financial regime changed two years ago, and pre-change data misleads the model). BORG does not currently implement sliding window CV as a separate strategy, but you can approximate it by setting `strategy = "temporal_block"` and filtering the training indices in each fold to include only observations within a fixed time window before the test period. A practical heuristic: if you have fewer than 200 observations, use expanding window CV to avoid wasting data on small training sets. If you have 500+ observations and care about average performance across the full time range, use temporal block CV. If concept drift is a concern, expanding window CV is more informative because you can plot performance against time and see whether the model degrades. # Embargo periods The embargo parameter controls the gap between the last training observation and the first test observation. Without an embargo, the observation immediately before the test window appears in training. If autocorrelation is high, that observation carries information about the test data, and the information leak (while reduced compared to random CV) persists. BORG computes a recommended embargo from the decorrelation lag: the number of time steps after which the ACF drops below the significance threshold ($2 / \sqrt{n}$). For our simulated data with $\rho = 0.7$, the decorrelation lag is several time steps, and the embargo needs to span at least that many time steps to eliminate residual leakage. Here we compare temporal block CV with and without an embargo: ```{r embargo-compare} cv_no_embargo <- borg_cv( sim$data, time = "time", target = "y", strategy = "temporal_block", embargo = 0, v = 5, verbose = FALSE ) cv_embargo <- borg_cv( sim$data, time = "time", target = "y", strategy = "temporal_block", embargo = 10, v = 5, verbose = FALSE ) ``` Check training set sizes to see the effect of the embargo: ```{r embargo-sizes} no_emb_sizes <- vapply(cv_no_embargo$folds, function(f) length(f$train), integer(1)) emb_sizes <- vapply(cv_embargo$folds, function(f) length(f$train), integer(1)) cat("Without embargo:", paste(no_emb_sizes, collapse = ", "), "\n") cat("With embargo=10:", paste(emb_sizes, collapse = ", "), "\n") ``` The embargo removes training observations that fall within the gap window. Larger embargoes exclude more data, which reduces the effective training set but also reduces leakage. The trade-off is clear: too small and leakage persists; too large and training data is wasted. The embargo must also account for feature engineering. If the model uses lagged features or rolling statistics, the embargo needs to be at least as wide as the widest feature window. Consider a model that uses 7-day rolling averages as input features. The rolling average at day $t$ is computed from days $t-6$ through $t$. If the embargo is only 3 days, the training set includes observations from day $t-3$, and the 7-day rolling average at day $t$ in the test set was computed using data from day $t-6$, which overlaps with the training period. The embargo must be at least 7 days to prevent this indirect leakage. More generally, if the model uses $k$-step lagged features, rolling windows of width $w$, or exponentially weighted moving averages with a half-life of $h$, the embargo should be at least $\max(k, w, 3h)$ time steps. BORG's automatic embargo is based on the raw autocorrelation of the response, not on the feature engineering pipeline, so you may need to override it with a larger value when lagged or rolling features are present. How to choose the right embargo: - Set it to the decorrelation lag from `borg_diagnose()`. This is the default when you let BORG choose automatically. - For daily financial data, 1-5 days is typical. - For monthly ecological monitoring, 2-6 months covers most autocorrelation structures. - For annual data with low sample sizes, an embargo of 1-2 years may consume too many observations. In that case, accept a slightly optimistic estimate rather than discarding 20% of your data as an embargo buffer. If you are uncertain, run `borg_fold_performance()` with several embargo values and check whether the per-fold RMSE stabilizes. If increasing the embargo from 5 to 10 does not change the metrics, 5 is sufficient. # Comparing random vs temporal CV The strongest evidence for using temporal CV is an empirical comparison on your own data. `borg_compare_cv()` runs both random and blocked CV, repeats the comparison for stability, and tests whether the performance gap is statistically significant. ```{r compare, fig.width = 7, fig.height = 5} comparison <- borg_compare_cv( sim$data, formula = y ~ x1 + x2 + x3 + x4 + x5, time = "time", metric = "rmse", v = 5, repeats = 10, seed = 42, verbose = FALSE ) comparison ``` The output reports mean RMSE under random and temporal CV, the estimated inflation as a percentage, and a p-value from a paired t-test. For our simulated data with $\rho = 0.7$, random CV typically underestimates RMSE by 20-40%. That gap represents the look-ahead bias. Visualize the comparison: ```{r plot-compare, fig.width = 6, fig.height = 5} plot(comparison, type = "boxplot") ``` Each box shows the distribution of RMSE values across repeats. When the boxes do not overlap, the inflation is both statistically and practically significant. This plot serves as direct evidence for reviewers, or for yourself, that random CV on this dataset gives a misleading picture of model performance. To calibrate our intuition, we can run the same comparison on independent data where random CV should be fine: ```{r compare-independent} sim_ind <- borg_simulate( n = 500, type = "independent", n_predictors = 5, signal_strength = 0.3, seed = 123 ) # Add a time column so we can compare random vs temporal CV sim_ind$data$time <- seq.Date(as.Date("2020-01-01"), by = "day", length.out = nrow(sim_ind$data)) comp_ind <- borg_compare_cv( sim_ind$data, formula = y ~ x1 + x2 + x3 + x4 + x5, time = "time", metric = "rmse", v = 5, repeats = 10, seed = 123, verbose = FALSE ) comp_ind ``` With independent data, the two CV approaches should give similar RMSE values. Any residual gap is due to the different fold geometries (temporal folds are contiguous; random folds are scattered), not leakage. This sanity check confirms that temporal CV is not introducing unnecessary pessimism when the data are genuinely independent. The inflation number from `borg_compare_cv()` belongs in your methods section. A statement like "Random 5-fold CV underestimated RMSE by 32% relative to temporal block CV (paired t-test, p < 0.001)" gives readers a concrete, dataset-specific measure of how much the evaluation strategy matters. For a journal publication, we recommend including the following numbers: the mean and standard deviation of the metric under both random and temporal CV, the inflation percentage, and the paired t-test p-value. Reviewers increasingly expect this comparison, especially in ecology and environmental science where Roberts et al. (2017) has made temporal and spatial CV standard practice. How you respond to the inflation result depends on its magnitude. If the inflation is small (below 5%), the temporal structure in your data is weak enough that random CV gives a reasonable estimate. You can note this in the methods section and proceed with either strategy. If the inflation is between 5% and 20%, random CV is measurably optimistic but the qualitative conclusions (which model is best, which features matter) are likely unchanged. Report the temporal CV metrics as the primary result and mention the inflation as context. If the inflation exceeds 20%, random CV is actively misleading. The model's apparent performance is substantially better than its true out-of-sample performance. In this regime, random CV should not be used for model selection, hyperparameter tuning, or performance reporting. Use temporal CV exclusively, and flag the inflation to co-authors and reviewers as evidence that the evaluation strategy is not a methodological detail but a first-order concern. # Concept drift detection Temporal CV produces honest estimates of how well a model predicts the near future. But what happens when the deployment data differs from the training data? In time series, the data-generating process can change: a climate regime shifts, a market enters a new phase, a population crosses a tipping point. This is concept drift, and it means that even honestly estimated performance on historical data overestimates performance on deployment data. `borg_drift()` detects whether the distribution of predictor variables has shifted between a training set and a new dataset. It combines per-feature Kolmogorov-Smirnov tests with a multivariate classifier two-sample test. To demonstrate, we split our simulated data into an early training period and a late deployment period: ```{r drift-split} n <- nrow(sim$data) train_data <- sim$data[1:round(n * 0.7), ] deploy_data <- sim$data[(round(n * 0.7) + 1):n, ] cat("Training period:", as.character(min(train_data$time)), "to", as.character(max(train_data$time)), "\n") cat("Deployment period:", as.character(min(deploy_data$time)), "to", as.character(max(deploy_data$time)), "\n") ``` Now test for drift: ```{r drift-test} drift <- borg_drift( train = train_data, new = deploy_data, predictors = paste0("x", 1:5), seed = 42 ) drift ``` For our AR(1) simulation with a constant data-generating process, the drift test should report no severe shift: the predictor distributions are stationary by construction. The classifier AUC should be near 0.5 (no better than random at distinguishing training from deployment), and the per-feature KS tests should mostly be non-significant. In real applications, concept drift arises from identifiable causes. Regime changes are the most common: a lake transitions from oligotrophic to eutrophic, an economy enters a recession, or a new policy alters the relationship between predictors and outcomes. Sensor degradation is another source: an instrument that drifts out of calibration over months produces features whose distributions shift gradually. Seasonal or cyclical processes can mimic drift if the training period does not cover a full cycle. A model trained on spring and summer data will see apparent drift when evaluated on autumn data, even though the underlying process is stationary over years. Finally, changes in data collection protocols (a new survey method, a different sampling frequency, a replaced sensor) create abrupt distributional shifts that are easy to detect but hard to correct without retraining. Here is an artificial example where we shift the deployment data: ```{r drift-shifted} deploy_shifted <- deploy_data deploy_shifted$x1 <- deploy_shifted$x1 + 1.5 deploy_shifted$x2 <- deploy_shifted$x2 * 2 drift_shifted <- borg_drift( train = train_data, new = deploy_shifted, predictors = paste0("x", 1:5), seed = 42 ) drift_shifted ``` With two features shifted, the classifier AUC rises well above 0.5, and the per-feature results identify exactly which variables drifted. The `overall_severity` field summarizes whether the shift is mild, moderate, or severe. The severity levels map to concrete actions. "None" means the deployment data is indistinguishable from the training data; no action is needed. "Mild" means one or two features show statistically significant shifts, but the effect sizes are small (KS statistic below 0.15). The model's predictions are likely still reliable, but monitoring the drifted features over the next evaluation cycle is prudent. "Moderate" means the classifier can distinguish training from deployment data with AUC above 0.65, or multiple features show KS statistics between 0.15 and 0.30. Performance degradation is likely, and retraining on more recent data is recommended before deployment. "Severe" means the classifier AUC exceeds 0.75 or at least one feature has a KS statistic above 0.30. The deployment data comes from a different regime than the training data, and any performance estimates derived from the training period are unreliable. Retrain the model, investigate which features changed and why, and consider whether the model's feature set is still appropriate for the new regime. The practical workflow is: 1. Train your model and evaluate it with temporal CV. 2. Before deploying, run `borg_drift()` on the deployment data. 3. If severity is "none" or "mild", proceed. If "moderate" or "severe", retrain on more recent data or investigate which features changed and why. Concept drift detection does not replace temporal CV; it extends it. Temporal CV tells you how well the model predicts the next time window given the current training data. Drift detection tells you whether the next time window is similar enough to the training data for those predictions to be trustworthy. # Practical guidance ## Choosing a strategy The choice between temporal block CV and expanding window CV depends on the prediction task and dataset size. **Temporal block CV** (`strategy = "temporal_block"`): the default for temporal data. Each fold tests on a contiguous time block. Training data comes from before the test block. Use this when you want an average estimate of future prediction performance and when the time series is long enough (300+ observations with 5 folds, so each fold has 60+ observations). **Expanding window CV** (`strategy = "temporal_expanding"`): the training set grows with each fold. Use this when the model will be retrained on all available data before deployment. Also the better choice for shorter time series (100-300 observations) because early folds still have enough training data to fit a reasonable model. **Purged CV** (`strategy = "purged_cv"`): de Prado's combinatorial purged cross-validation. Uses all combinations of time groups as test sets, with embargo gaps. Most conservative; computationally expensive. Use for financial applications where even small leakage has large dollar consequences. A decision tree: - Fewer than 100 observations -> expanding window with v = 3. - 100-300 observations -> expanding window with v = 5. - 300+ observations, stationary process -> temporal block with v = 5. - 300+ observations, suspected drift -> expanding window with v = 5-10, then plot per-fold performance to detect degradation. - Financial time series -> purged CV with appropriate embargo. ## Rules of thumb with specific numbers **Minimum dataset size.** Temporal block CV with 5 folds requires at least 100 observations for each fold to have 20 test observations. Below 50 total observations, expanding window with v = 3 is the practical minimum. **Embargo size.** Set it to the decorrelation lag. For daily data with $\rho = 0.7$, this is typically 5-10 days. For monthly data with $\rho = 0.5$, 2-4 months. The embargo should never exceed 20% of the total time range; if it does, the training sets become too small. **Number of folds.** Five is the default. Three folds work for short series (under 200 observations). Ten folds are appropriate for long series (1000+ observations) where you want lower-bias estimates. **Repeats.** Temporal block CV is deterministic given the data ordering, so repeats = 1 is fine for a single run. Use repeats = 5-10 in `borg_compare_cv()` because the random CV baseline needs repeated sampling for stability. **When to worry about inflation.** If `borg_compare_cv()` shows less than 5% inflation, temporal CV and random CV are practically equivalent for your data. This can happen when the autocorrelation is weak ($\rho < 0.1$) or when the signal-to-noise ratio is high enough that the signal dominates the autocorrelation structure. **Feature engineering pipelines.** Temporal CV interacts with feature engineering in ways that are easy to overlook. If features are computed from the response variable (e.g., lagged response, rolling mean of response), those computations must happen inside the CV loop, not before it. Computing a 30-day rolling average of the response on the full dataset and then splitting into temporal folds leaks future response values into the training set through the rolling window. BORG does not perform feature engineering, so this is the user's responsibility. The safe approach is to compute time-dependent features separately for each fold's training set and apply the same transformation (using only training-set parameters) to the test set. For features derived only from predictors (e.g., lagged temperature, rolling mean of precipitation), pre-computation on the full dataset is safe as long as the embargo accounts for the feature window width. **Reporting temporal CV in publications.** The methods section should state the CV strategy (temporal block or expanding window), the number of folds, the embargo size in the same time units as the data (e.g., "10-day embargo" not "embargo = 10"), and whether the embargo was set automatically or overridden. If `borg_compare_cv()` was run, report the inflation percentage and the paired t-test p-value. Include per-fold metrics in a supplementary table or figure so that reviewers can assess whether performance is stable across time or shows evidence of drift. A sentence like "Temporal block CV with 5 folds and a 10-day embargo produced mean RMSE of 0.42 (SD 0.05); random 5-fold CV underestimated RMSE by 28% (paired t-test, p < 0.001)" is sufficient for most journals. ## When NOT to use temporal CV Temporal CV is not always the right choice, even when the data has timestamps. **Interpolation tasks.** If the goal is to fill gaps in a time series (e.g., imputing missing sensor readings between known values), random CV correctly mimics the prediction task. The model will interpolate at prediction time, so testing its interpolation ability is appropriate. **Cross-sectional data with timestamps.** If you survey 500 sites once, and each observation has a sampling date, but the dates reflect logistics (which field crew visited when) rather than temporal dynamics, the timestamps carry no temporal autocorrelation. Use `borg_diagnose()` to check: if the Ljung-Box test is non-significant and the lag-1 ACF is below 0.1, random CV is safe. **Stationary data with weak autocorrelation.** When $\rho < 0.1$ and the Ljung-Box test is non-significant, the series is effectively independent for cross-validation purposes. Temporal blocking would reduce the effective sample size without improving the estimate. **Very short time series.** With fewer than 30 observations, no CV strategy produces stable estimates. Consider a train/test split at a fixed time point, or use bootstrap-based confidence intervals instead. In all these cases, set `allow_random = TRUE` in `borg_cv()` to override the automatic blocking. BORG will issue a warning with the estimated inflation, so the decision is transparent and documented. ## Panel data Panel data (multiple entities observed over time) requires both temporal and group structure. Each entity (patient, site, country) has its own time series, and observations within an entity are autocorrelated. For panel data, pass both `time` and `groups` to `borg_cv()`: ```{r panel, eval = FALSE} cv_panel <- borg_cv( panel_data, time = "date", groups = "site_id", target = "response", v = 5 ) ``` BORG will detect the joint structure and select a strategy that respects both the temporal ordering and the group boundaries. No site appears in both training and test within a fold, and training observations are always temporally prior to test observations. If groups are not important (all entities share the same data-generating process and you want to predict future values for the same entities), temporal blocking alone is sufficient. If entities are important (you want to predict for new entities), group CV is needed in addition to temporal blocking. ## End-to-end workflow Putting the pieces together, here is the workflow for temporal model validation: 1. **Diagnose.** Run `borg_diagnose()` with the `time` argument. Check the lag-1 ACF, the Ljung-Box p-value, and the recommended embargo. 2. **Generate folds.** Call `borg_cv()` with `time = "time"`. Let BORG select the strategy automatically, or choose between `"temporal_block"` and `"temporal_expanding"` based on the decision tree above. 3. **Set the embargo.** The default (from diagnosis) is usually adequate. Override with `embargo = k` if you have domain knowledge about the decorrelation time. 4. **Evaluate per-fold.** Use `borg_fold_performance()` to compute per-fold metrics. Plot them in order. If performance degrades in later folds, concept drift may be present. 5. **Compare.** Run `borg_compare_cv()` at least once. Report the inflation percentage in your methods section. 6. **Check for drift.** Before deployment, run `borg_drift()` on the deployment data. Retrain if severity is moderate or above. 7. **Report.** Include the CV strategy, the embargo size, the number of folds, the inflation estimate, and the drift severity. These five numbers let readers assess whether the reported metrics are trustworthy. ```{r workflow-summary} cat("Diagnosis:\n") cat(" Dependency type:", diag@dependency_type, "\n") cat(" Severity: ", diag@severity, "\n") cat(" Recommended CV: ", diag@recommended_cv, "\n\n") cat("CV strategy used: ", cv_block$strategy, "\n") cat("Number of folds: ", length(cv_block$folds), "\n") ``` A few extra lines of code, a few extra numbers in the methods section. The payoff is an evaluation you can trust, and a model that performs in deployment the way your metrics predicted it would. # References - Roberts, D.R., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. *Ecography*, 40(8), 913-929. doi:10.1111/ecog.02881 - de Prado, M.L. (2018). *Advances in Financial Machine Learning*. John Wiley & Sons. Chapter 7: Cross-Validation in Finance. - Bergmeir, C., & Benitez, J.M. (2012). On the use of cross-validation for time series predictor evaluation. *Information Sciences*, 191, 192-213. doi:10.1016/j.ins.2011.12.028 - Cerqueira, V., Torgo, L., & Mozetic, I. (2020). Evaluating time series forecasting models: An empirical study on performance estimation methods. *Machine Learning*, 109, 1997-2028. doi:10.1007/s10994-020-05910-7 - Tashman, L.J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and review. *International Journal of Forecasting*, 16(4), 437-450. doi:10.1016/S0169-2070(00)00065-0