| Title: | Explicit Key Assumptions for Flat-File Data |
|---|---|
| Description: | Helps make implicit data assumptions explicit by attaching keys to flat-file data that error when those assumptions are violated. Designed for CSV-first workflows without database infrastructure or version control. Provides key definition, assumption checks, join diagnostics, and automatic drift detection via watched data frames that snapshot before each transformation and report cell-level changes. |
| Authors: | Gilles Colling [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3070-6066>) |
| Maintainer: | Gilles Colling <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-05-29 23:12:41 UTC |
| Source: | https://github.com/gcol33/keyed |
Adds a stable UUID column to each row. This is an opt-in feature for tracking row lineage through transformations.
add_id(.data, .id = ".id", .overwrite = FALSE)add_id(.data, .id = ".id", .overwrite = FALSE)
.data |
A data frame. |
.id |
Column name for the ID (default: ".id"). |
.overwrite |
If TRUE, overwrite existing ID column. If FALSE (default), error if column exists. |
IDs are generated using a hash of row content plus a random salt, making them stable for identical rows within a session but unique across different data frames.
If the uuid package is available, it will be used for true UUIDs. Otherwise, a hash-based ID is generated.
Data frame with ID column added.
df <- data.frame(x = 1:3, y = c("a", "b", "c")) df <- add_id(df) dfdf <- data.frame(x = 1:3, y = c("a", "b", "c")) df <- add_id(df) df
Binds data frames while properly handling ID columns. Checks for overlapping IDs, combines the data, and fills in missing IDs.
bind_id(..., .id = ".id")bind_id(..., .id = ".id")
... |
Data frames to bind. |
.id |
Column name for IDs (default: ".id"). |
This function:
Checks if IDs overlap between datasets (warns if so)
Binds rows using dplyr::bind_rows()
Fills missing IDs using extend_id()
Use this instead of dplyr::bind_rows() when working with ID columns.
Combined data frame with valid IDs for all rows.
df1 <- add_id(data.frame(x = 1:3)) df2 <- data.frame(x = 4:6) combined <- bind_id(df1, df2)df1 <- add_id(data.frame(x = 1:3)) df2 <- data.frame(x = 4:6) combined <- bind_id(df1, df2)
Bind keyed data frames
bind_keyed(..., .id = NULL)bind_keyed(..., .id = NULL)
... |
Data frames to bind. |
.id |
Optional column name to identify source. |
Wrapper for dplyr::bind_rows() that attempts to preserve key metadata.
A keyed data frame if key is preserved and unique, otherwise tibble.
Compares current data against its committed reference snapshot. When both snapshots are keyed with the same key columns, returns a cell-level diff. Otherwise falls back to structural comparison.
check_drift(.data, reference = NULL)check_drift(.data, reference = NULL)
.data |
A data frame with a snapshot reference. |
reference |
Optional content hash to compare against. If NULL, uses the attached snapshot reference. |
A drift report (class keyed_drift_report), or NULL if no
snapshot found.
df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) df <- stamp(df) # Modify the data df$x[1] <- "modified" check_drift(df)df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) df <- stamp(df) # Modify the data df$x[1] <- "modified" check_drift(df)
Validates ID column for common issues: missing values, duplicates, and suspicious formats.
check_id(.data, .id = ".id")check_id(.data, .id = ".id")
.data |
A data frame with ID column. |
.id |
Column name (default: ".id"). |
Invisibly returns a list with:
valid: TRUE if no issues found
n_na: count of NA values
n_duplicates: count of duplicate IDs
format_ok: TRUE if IDs look like proper UUIDs/hashes
df <- add_id(data.frame(x = 1:3)) check_id(df)df <- add_id(data.frame(x = 1:3)) check_id(df)
Verifies that ID columns don't overlap between datasets. Useful before binding datasets to ensure no ID collisions.
check_id_disjoint(..., .id = ".id")check_id_disjoint(..., .id = ".id")
... |
Data frames to check. |
.id |
Column name for IDs (default: ".id"). |
Invisibly returns a list with:
disjoint: TRUE if no overlaps found
overlaps: character vector of overlapping IDs (if any)
df1 <- add_id(data.frame(x = 1:3)) df2 <- add_id(data.frame(x = 4:6)) check_id_disjoint(df1, df2)df1 <- add_id(data.frame(x = 1:3)) df2 <- add_id(data.frame(x = 4:6)) check_id_disjoint(df1, df2)
Clear all snapshots from cache
clear_all_snapshots(confirm = TRUE)clear_all_snapshots(confirm = TRUE)
confirm |
If TRUE, require confirmation. |
No return value, called for side effects.
Removes the snapshot reference from a data frame.
clear_snapshot(.data, purge = FALSE)clear_snapshot(.data, purge = FALSE)
.data |
A data frame. |
purge |
If TRUE, also remove the snapshot from cache. |
Data frame without snapshot reference.
Compares IDs between two data frames to detect lost rows.
compare_ids(before, after, .id = ".id")compare_ids(before, after, .id = ".id")
before |
Data frame before transformation. |
after |
Data frame after transformation. |
.id |
Column name for IDs (default: ".id"). |
A list with:
lost: IDs present in before but not after
gained: IDs present in after but not before
preserved: IDs present in both
df1 <- add_id(data.frame(x = 1:5)) df2 <- df1[1:3, ] compare_ids(df1, df2)df1 <- add_id(data.frame(x = 1:5)) df2 <- df1[1:3, ] compare_ids(df1, df2)
Identifies keys that are new, removed, or common between two keyed data frames. Does not compare values, only key membership.
compare_keys(x, y, by = NULL)compare_keys(x, y, by = NULL)
x |
First keyed data frame. |
y |
Second keyed data frame. |
by |
Column(s) to compare. If NULL, uses the key from x. |
A key comparison object.
df1 <- key(data.frame(id = 1:3, x = 1:3), id) df2 <- key(data.frame(id = 2:4, x = 2:4), id) compare_keys(df1, df2)df1 <- key(data.frame(id = 1:3, x = 1:3), id) df2 <- key(data.frame(id = 2:4, x = 2:4), id) compare_keys(df1, df2)
Compares the structural properties of two data frames without comparing actual values. Useful for detecting schema drift.
compare_structure(x, y)compare_structure(x, y)
x |
First data frame. |
y |
Second data frame. |
A structure comparison object.
df1 <- data.frame(id = 1:3, x = c("a", "b", "c")) df2 <- data.frame(id = 1:5, x = c("a", "b", "c", "d", "e"), y = 1:5) compare_structure(df1, df2)df1 <- data.frame(id = 1:3, x = c("a", "b", "c")) df2 <- data.frame(id = 1:5, x = c("a", "b", "c", "d", "e"), y = 1:5) compare_structure(df1, df2)
Analyzes join cardinality without performing the full join. Useful for detecting many-to-many joins that would explode row count.
diagnose_join(x, y, by = NULL, use_joinspy = TRUE)diagnose_join(x, y, by = NULL, use_joinspy = TRUE)
x |
Left data frame. |
y |
Right data frame. |
by |
Join specification. |
use_joinspy |
If TRUE (default), use joinspy for enhanced diagnostics when available. Set to FALSE to use built-in diagnostics only. |
If the joinspy package is installed, this function delegates to
joinspy::join_spy() for enhanced diagnostics including whitespace
detection, encoding issues, and detailed match analysis.
A list with cardinality information, or a JoinReport object if joinspy is used.
joinspy::join_spy() for enhanced diagnostics (if installed)
x <- data.frame(id = c(1, 1, 2), a = 1:3) y <- data.frame(id = c(1, 1, 2), b = 4:6) diagnose_join(x, y, by = "id", use_joinspy = FALSE)x <- data.frame(id = c(1, 1, 2), a = 1:3) y <- data.frame(id = c(1, 1, 2), b = 4:6) diagnose_join(x, y, by = "id", use_joinspy = FALSE)
Compares two data frames row-by-row using the key from x to align rows.
Identifies added, removed, and modified rows, with cell-level detail for
modifications.
## S3 method for class 'keyed_df' diff(x, y, ...)## S3 method for class 'keyed_df' diff(x, y, ...)
x |
A keyed data frame (the "old" or "reference" state). |
y |
A data frame (the "new" state). Must contain the key columns from |
... |
Ignored (present for S3 compatibility with |
A keyed_diff object with fields:
key_cols: character vector of key column names
n_removed, n_added, n_modified, n_unchanged: counts
removed: data frame of rows in x not in y
added: data frame of rows in y not in x
changes: named list of per-column change tibbles
(each with key columns, old, and new)
cols_only_x, cols_only_y: columns present in only one side
old <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) new <- data.frame(id = 2:4, x = c("B", "c", "d")) diff(old, new)old <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) new <- data.frame(id = 2:4, x = c("B", "c", "d")) diff(old, new)
Adds IDs to rows where the ID column is NA, preserving existing IDs. Useful after binding new data to an existing dataset with IDs.
extend_id(.data, .id = ".id")extend_id(.data, .id = ".id")
.data |
A data frame with an ID column (possibly with NAs). |
.id |
Column name for IDs (default: ".id"). |
Data frame with IDs filled in for NA rows.
# Original data with IDs old <- add_id(data.frame(x = 1:3)) # New data without IDs new <- data.frame(.id = NA_character_, x = 4:5) # Combine and extend combined <- dplyr::bind_rows(old, new) extend_id(combined)# Original data with IDs old <- add_id(data.frame(x = 1:3)) # New data without IDs new <- data.frame(.id = NA_character_, x = 4:5) # Combine and extend combined <- dplyr::bind_rows(old, new) extend_id(combined)
Identifies rows with duplicate key values.
find_duplicates(.data, ...)find_duplicates(.data, ...)
.data |
A data frame. |
... |
Column names to check. If empty, uses the key columns. |
Data frame containing only the rows with duplicate keys,
with a .n column showing the count.
df <- data.frame(id = c(1, 1, 2, 3, 3, 3), x = letters[1:6]) find_duplicates(df, id)df <- data.frame(id = c(1, 1, 2, 3, 3, 3), x = letters[1:6]) find_duplicates(df, id)
Get ID column
get_id(.data, .id = ".id")get_id(.data, .id = ".id")
.data |
A data frame. |
.id |
Column name (default: ".id"). |
Character vector of IDs, or NULL if not present.
Get key column names
get_key_cols(.data)get_key_cols(.data)
.data |
A keyed data frame. |
Character vector of column names, or NULL if no key.
Check if data frame has IDs
has_id(.data, .id = ".id")has_id(.data, .id = ".id")
.data |
A data frame. |
.id |
Column name to check (default: ".id"). |
Logical.
Check if data frame has a key
has_key(.data)has_key(.data)
.data |
A data frame. |
Logical.
Attaches key metadata to a data frame, marking which column(s) form the unique identifier for rows. Keys are validated for uniqueness at creation.
key(.data, ..., .validate = TRUE, .strict = FALSE) key(.data) <- valuekey(.data, ..., .validate = TRUE, .strict = FALSE) key(.data) <- value
.data |
A data frame or tibble. |
... |
Column names (unquoted) that form the key. Can be a single column or multiple columns for a composite key. |
.validate |
If |
.strict |
If |
value |
Character vector of column names to use as key. |
A keyed data frame (class keyed_df).
df <- data.frame(id = 1:3, x = c("a", "b", "c")) key(df, id) # Composite key df2 <- data.frame(country = c("US", "US", "UK"), year = c(2020, 2021, 2020), val = 1:3) key(df2, country, year)df <- data.frame(id = 1:3, x = c("a", "b", "c")) key(df, id) # Composite key df2 <- data.frame(country = c("US", "US", "UK"), year = c(2020, 2021, 2020), val = 1:3) key(df2, country, year)
Checks whether the key columns still exist and are still unique.
key_is_valid(.data)key_is_valid(.data)
.data |
A keyed data frame. |
Logical. Returns FALSE with a warning if key is invalid.
Returns diagnostic information about a keyed data frame.
key_status(.data)key_status(.data)
.data |
A data frame. |
A key status object with diagnostic information.
df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) key_status(df)df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) key_status(df)
List all snapshots in cache
list_snapshots()list_snapshots()
Data frame with snapshot information, including size_mb.
Checks that all columns have no NA values.
lock_complete(.data, .strict = FALSE)lock_complete(.data, .strict = FALSE)
.data |
A data frame. |
.strict |
If |
Invisibly returns .data (for piping).
Checks that the fraction of non-NA values meets a threshold. Useful after joins to verify expected coverage.
lock_coverage(.data, threshold, ..., .strict = FALSE)lock_coverage(.data, threshold, ..., .strict = FALSE)
.data |
A data frame. |
threshold |
Minimum fraction of non-NA values (0 to 1). |
... |
Column names (unquoted) to check. If empty, checks all columns. |
.strict |
If |
Invisibly returns .data (for piping).
df <- data.frame(id = 1:10, x = c(1:8, NA, NA)) lock_coverage(df, 0.8, x) lock_coverage(df, 0.9, x) # warns (only 80% coverage)df <- data.frame(id = 1:10, x = c(1:8, NA, NA)) lock_coverage(df, 0.8, x) lock_coverage(df, 0.9, x) # warns (only 80% coverage)
Checks that specified columns contain no NA values.
lock_no_na(.data, ..., .strict = FALSE)lock_no_na(.data, ..., .strict = FALSE)
.data |
A data frame. |
... |
Column names (unquoted) to check. If empty, checks all columns. |
.strict |
If |
Invisibly returns .data (for piping).
df <- data.frame(id = 1:3, x = c("a", NA, "c")) lock_no_na(df, id) lock_no_na(df, x) # warnsdf <- data.frame(id = 1:3, x = c("a", NA, "c")) lock_no_na(df, id) lock_no_na(df, x) # warns
Checks that the number of rows is within an expected range. Useful for sanity checks after filtering or joins.
lock_nrow(.data, min = 1, max = Inf, expected = NULL, .strict = FALSE)lock_nrow(.data, min = 1, max = Inf, expected = NULL, .strict = FALSE)
.data |
A data frame. |
min |
Minimum expected rows (inclusive). Default 1. |
max |
Maximum expected rows (inclusive). Default Inf. |
expected |
Exact expected row count. If provided, overrides min/max. |
.strict |
If |
Invisibly returns .data (for piping).
df <- data.frame(id = 1:100) lock_nrow(df, min = 50, max = 200) lock_nrow(df, expected = 100)df <- data.frame(id = 1:100) lock_nrow(df, min = 50, max = 200) lock_nrow(df, expected = 100)
Checks that the combination of specified columns has unique values. This is a point-in-time assertion that either passes silently or fails.
lock_unique(.data, ..., .strict = FALSE)lock_unique(.data, ..., .strict = FALSE)
.data |
A data frame. |
... |
Column names (unquoted) to check for uniqueness. |
.strict |
If |
Invisibly returns .data (for piping).
df <- data.frame(id = 1:3, x = c("a", "b", "c")) lock_unique(df, id) # Fails with warning df2 <- data.frame(id = c(1, 1, 2), x = c("a", "b", "c")) lock_unique(df2, id)df <- data.frame(id = 1:3, x = c("a", "b", "c")) lock_unique(df, id) # Fails with warning df2 <- data.frame(id = c(1, 1, 2), x = c("a", "b", "c")) lock_unique(df2, id)
Creates an ID column by combining values from one or more columns.
Unlike add_id(), this produces deterministic IDs based on column values.
make_id(.data, ..., .id = ".id", .sep = "|")make_id(.data, ..., .id = ".id", .sep = "|")
.data |
A data frame. |
... |
Columns to combine into the ID. |
.id |
Column name for the ID (default: ".id"). |
.sep |
Separator between column values (default: "|"). |
Data frame with ID column added.
df <- data.frame(country = c("US", "UK", "US"), year = c(2020, 2020, 2021)) make_id(df, country, year) #> .id country year #> 1 US|2020 US 2020 #> 2 UK|2020 UK 2020 #> 3 US|2021 US 2021df <- data.frame(country = c("US", "UK", "US"), year = c(2020, 2020, 2021)) make_id(df, country, year) #> .id country year #> 1 US|2020 US 2020 #> 2 UK|2020 UK 2020 #> 3 US|2021 US 2021
Remove ID column
remove_id(.data, .id = ".id")remove_id(.data, .id = ".id")
.data |
A data frame. |
.id |
Column name to remove (default: ".id"). |
Data frame without the ID column.
Stores a snapshot of the current data state, including the full data frame.
This enables cell-level drift reports when used with check_drift().
stamp(.data, name = NULL, .silent = FALSE) commit_keyed(.data, name = NULL)stamp(.data, name = NULL, .silent = FALSE) commit_keyed(.data, name = NULL)
.data |
A data frame (preferably keyed). |
name |
Optional name for the snapshot. If NULL, derived from data. |
.silent |
If |
Snapshots are stored in memory for the session. They are keyed by content hash, so identical data shares the same snapshot.
When data is watch()ed, dplyr verbs auto-stamp before executing,
creating an automatic safety net for drift detection.
Invisibly returns .data with snapshot metadata attached.
watch() for automatic stamping before dplyr verbs.
df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) df <- stamp(df) # Later, check for drift df2 <- df df2$x[1] <- "z" check_drift(df2)df <- key(data.frame(id = 1:3, x = c("a", "b", "c")), id) df <- stamp(df) # Later, check for drift df2 <- df df2$x[1] <- "z" check_drift(df2)
Summary method for keyed data frames
## S3 method for class 'keyed_df' summary(object, ...)## S3 method for class 'keyed_df' summary(object, ...)
object |
A keyed data frame. |
... |
Additional arguments (ignored). |
Invisibly returns a summary list.
Remove key from a data frame
unkey(.data)unkey(.data)
.data |
A keyed data frame. |
A tibble without key metadata.
Removes the watched attribute. Dplyr verbs will no longer auto-stamp.
unwatch(.data)unwatch(.data)
.data |
A data frame. |
.data without the watched attribute.
df <- key(data.frame(id = 1:3, x = 1:3), id) |> watch() df <- unwatch(df)df <- key(data.frame(id = 1:3, x = 1:3), id) |> watch() df <- unwatch(df)
Marks a keyed data frame as "watched". Watched data frames are
automatically stamped before each dplyr verb, so check_drift() always
reports changes from the most recent transformation step.
watch(.data)watch(.data)
.data |
A keyed data frame. |
Invisibly returns .data with watched attribute set and
a baseline snapshot committed.
unwatch() to stop watching, stamp() for manual snapshots.
df <- key(data.frame(id = 1:5, x = letters[1:5]), id) |> watch() df2 <- df |> dplyr::filter(id > 2) check_drift(df2)df <- key(data.frame(id = 1:5, x = letters[1:5]), id) |> watch() df2 <- df |> dplyr::filter(id > 2) check_drift(df2)