| Title: | Diagnostic Tools for Data Frame Joins |
|---|---|
| Description: | Provides diagnostic tools for understanding and debugging data frame joins. Analyzes key columns before joining to detect duplicates, mismatches, encoding issues, and other common problems. Explains unexpected row count changes and provides safe join wrappers with cardinality enforcement. Concepts and diagnostics build on tidy data principles as described in 'Wickham' (2014) <doi:10.18637/jss.v059.i10>. |
| Authors: | Gilles Colling [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3070-6066>) |
| Maintainer: | Gilles Colling <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.8.0 |
| Built: | 2026-05-29 23:13:06 UTC |
| Source: | https://github.com/gcol33/joinspy |
Analyzes a sequence of joins to identify potential issues in the chain. Useful for debugging complex multi-table joins.
analyze_join_chain(tables, joins)analyze_join_chain(tables, joins)
tables |
A named list of data frames to join. |
joins |
A list of join specifications, each with elements:
|
A summary of the join chain analysis.
orders <- data.frame(order_id = 1:3, customer_id = c(1, 2, 2)) customers <- data.frame(customer_id = 1:3, region_id = c(1, 1, 2)) regions <- data.frame(region_id = 1:2, name = c("North", "South")) analyze_join_chain( tables = list(orders = orders, customers = customers, regions = regions), joins = list( list(left = "orders", right = "customers", by = "customer_id"), list(left = "result", right = "regions", by = "region_id") ) )orders <- data.frame(order_id = 1:3, customer_id = c(1, 2, 2)) customers <- data.frame(customer_id = 1:3, region_id = c(1, 1, 2)) regions <- data.frame(region_id = 1:2, name = c("North", "South")) analyze_join_chain( tables = list(orders = orders, customers = customers, regions = regions), joins = list( list(left = "orders", right = "customers", by = "customer_id"), list(left = "result", right = "regions", by = "region_id") ) )
Warns if a join will produce a very large result due to many-to-many relationships (Cartesian product explosion).
check_cartesian(x, y, by, threshold = 10)check_cartesian(x, y, by, threshold = 10)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
Column names to join by. |
threshold |
Warn if result will exceed this many times the larger input. Default 10. |
A list with explosion analysis.
# Dangerous: both tables have duplicates x <- data.frame(id = c(1, 1, 2, 2), val_x = 1:4) y <- data.frame(id = c(1, 1, 2, 2), val_y = 1:4) check_cartesian(x, y, by = "id")# Dangerous: both tables have duplicates x <- data.frame(id = c(1, 1, 2, 2), val_x = 1:4) y <- data.frame(id = c(1, 1, 2, 2), val_y = 1:4) check_cartesian(x, y, by = "id")
Determines the actual cardinality relationship between two tables.
detect_cardinality(x, y, by)detect_cardinality(x, y, by)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
Column names to join by. |
Character string: "1:1", "1:n", "n:1", or "n:m".
# 1:1 relationship x <- data.frame(id = 1:3, val = 1:3) y <- data.frame(id = 1:3, name = c("A", "B", "C")) detect_cardinality(x, y, "id") # 1:n relationship x <- data.frame(id = 1:3, val = 1:3) y <- data.frame(id = c(1, 1, 2, 3), name = c("A1", "A2", "B", "C")) detect_cardinality(x, y, "id")# 1:1 relationship x <- data.frame(id = 1:3, val = 1:3) y <- data.frame(id = 1:3, name = c("A", "B", "C")) detect_cardinality(x, y, "id") # 1:n relationship x <- data.frame(id = 1:3, val = 1:3) y <- data.frame(id = c(1, 1, 2, 3), name = c("A1", "A2", "B", "C")) detect_cardinality(x, y, "id")
Performs a full join and automatically prints diagnostic information.
full_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)full_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
backend |
Character or |
... |
Additional arguments passed to the underlying join function. |
The joined data frame with a "join_report" attribute.
left_join_spy(), join_spy(), last_report()
Returns the current automatic log file path, if set.
get_log_file()get_log_file()
The log file path, or NULL if not set.
Performs an inner join and automatically prints diagnostic information.
inner_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)inner_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
backend |
Character or |
... |
Additional arguments passed to the underlying join function. |
The joined data frame with a "join_report" attribute.
left_join_spy(), join_spy(), last_report()
Check if Object is a JoinReport
is_join_report(x)is_join_report(x)
x |
An object to test. |
TRUE if x is a JoinReport, FALSE otherwise.
Shows a side-by-side comparison of key statistics before and after a join operation.
join_diff(before, after, by = NULL)join_diff(before, after, by = NULL)
before |
The original data frame (before joining). |
after |
The result data frame (after joining). |
by |
Optional. Column names to analyze for key statistics. |
Invisibly returns a comparison summary. Prints a formatted comparison.
before <- data.frame(id = 1:3, x = letters[1:3]) after <- data.frame(id = c(1, 2, 2, 3), x = c("a", "b", "b", "c"), y = 1:4 ) join_diff(before, after)before <- data.frame(id = 1:3, x = letters[1:3]) after <- data.frame(id = c(1, 2, 2, 3), x = c("a", "b", "b", "c"), y = 1:4 ) join_diff(before, after)
After performing a join, use this function to understand why the row count changed. It analyzes the original tables and the result to explain the difference.
join_explain(result, x, y, by, type = NULL)join_explain(result, x, y, by, type = NULL)
result |
The result of a join operation. |
x |
The original left data frame. |
y |
The original right data frame. |
by |
A character vector of column names used in the join. |
type |
Character. The type of join that was performed. One of
|
Invisibly returns a list with explanation details. Prints a human-readable explanation.
orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4) customers <- data.frame(id = c(1, 2, 2, 4), name = c("A", "B1", "B2", "D")) result <- merge(orders, customers, by = "id", all.x = TRUE) # Explain why we got more rows than expected join_explain(result, orders, customers, by = "id", type = "left")orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4) customers <- data.frame(id = c(1, 2, 2, 4), name = c("A", "B1", "B2", "D")) result <- merge(orders, customers, by = "id", all.x = TRUE) # Explain why we got more rows than expected join_explain(result, orders, customers, by = "id", type = "left")
Automatically fixes trivial join key issues like whitespace and case mismatches. Returns the repaired data frame(s) with a summary of changes.
join_repair( x, y = NULL, by, trim_whitespace = TRUE, standardize_case = NULL, remove_invisible = TRUE, empty_to_na = FALSE, dry_run = FALSE )join_repair( x, y = NULL, by, trim_whitespace = TRUE, standardize_case = NULL, remove_invisible = TRUE, empty_to_na = FALSE, dry_run = FALSE )
x |
A data frame (left table). |
y |
A data frame (right table). If NULL, only repairs x. |
by |
A character vector of column names to repair. |
trim_whitespace |
Logical. Trim leading/trailing whitespace. Default TRUE. |
standardize_case |
Character. Standardize case to "lower", "upper", or NULL (no change). Default NULL. |
remove_invisible |
Logical. Remove invisible Unicode characters. Default TRUE. |
empty_to_na |
Logical. Convert empty strings to NA. Default FALSE. |
dry_run |
Logical. If TRUE, only report what would be changed without modifying data. Default FALSE. |
If y is NULL, returns the repaired x. If both are provided,
returns a list with x and y. In dry_run mode, returns a summary of
proposed changes.
# Data with whitespace issues orders <- data.frame( id = c(" A", "B ", "C"), value = 1:3, stringsAsFactors = FALSE ) # Dry run to see what would change join_repair(orders, by = "id", dry_run = TRUE) # Actually repair orders_fixed <- join_repair(orders, by = "id")# Data with whitespace issues orders <- data.frame( id = c(" A", "B ", "C"), value = 1:3, stringsAsFactors = FALSE ) # Dry run to see what would change join_repair(orders, by = "id", dry_run = TRUE) # Actually repair orders_fixed <- join_repair(orders, by = "id")
Analyzes two data frames before joining to detect potential issues and predict the outcome. Returns a detailed report of key quality, match rates, and detected problems.
join_spy(x, y, by, sample = NULL, ...)join_spy(x, y, by, sample = NULL, ...)
x |
A data frame (left table in the join). |
y |
A data frame (right table in the join). |
by |
A character vector of column names to join by, or a named character
vector for joins where column names differ (e.g., |
sample |
Integer or NULL. If provided, randomly sample this many rows from each table for faster diagnostics on large datasets. Default NULL (analyze all rows). |
... |
Reserved for future use. |
This function detects the following common join issues:
Duplicate keys: Keys appearing multiple times, which cause row multiplication during joins
Whitespace: Leading or trailing spaces that prevent matches
Case mismatches: Keys that differ only by case (e.g., "ABC" vs "abc")
Encoding issues: Different character encodings or invisible Unicode characters
NA values: Missing values in key columns
A JoinReport object with the following components:
Summary statistics for keys in the left table
Summary statistics for keys in the right table
Details of which keys will/won't match
List of detected problems (duplicates, whitespace, etc.)
Predicted row counts for each join type
key_check(), join_explain(), join_strict()
# Create sample data with issues orders <- data.frame( order_id = 1:5, customer_id = c("A", "B", "B", "C", "D ") ) customers <- data.frame( customer_id = c("A", "B", "C", "E"), name = c("Alice", "Bob", "Carol", "Eve") ) # Get diagnostic report join_spy(orders, customers, by = "customer_id")# Create sample data with issues orders <- data.frame( order_id = 1:5, customer_id = c("A", "B", "B", "C", "D ") ) customers <- data.frame( customer_id = c("A", "B", "C", "E"), name = c("Alice", "Bob", "Carol", "Eve") ) # Get diagnostic report join_spy(orders, customers, by = "customer_id")
Performs a join operation that fails if the specified cardinality constraint is violated. Use this to catch unexpected many-to-many relationships early.
join_strict( x, y, by, type = c("left", "right", "inner", "full"), expect = c("1:1", "1:n", "1:many", "n:1", "many:1", "n:m", "many:many"), backend = NULL, ... )join_strict( x, y, by, type = c("left", "right", "inner", "full"), expect = c("1:1", "1:n", "1:many", "n:1", "many:1", "n:m", "many:many"), backend = NULL, ... )
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
type |
Character. The type of join to perform. One of |
expect |
Character. The expected cardinality relationship. One of:
|
backend |
Character or |
... |
Additional arguments passed to the underlying join function. |
The joined data frame if the cardinality constraint is satisfied. Throws an error if the constraint is violated.
orders <- data.frame(id = 1:3, product = c("A", "B", "C")) customers <- data.frame(id = 1:3, name = c("Alice", "Bob", "Carol")) # This succeeds (1:1 relationship) join_strict(orders, customers, by = "id", expect = "1:1") # This fails if customers had duplicate ids (wrapped in try to show error) customers_dup <- data.frame(id = c(1, 1, 2), name = c("A1", "A2", "B")) try(join_strict(orders, customers_dup, by = "id", expect = "1:1"))orders <- data.frame(id = 1:3, product = c("A", "B", "C")) customers <- data.frame(id = 1:3, name = c("Alice", "Bob", "Carol")) # This succeeds (1:1 relationship) join_strict(orders, customers, by = "id", expect = "1:1") # This fails if customers had duplicate ids (wrapped in try to show error) customers_dup <- data.frame(id = c(1, 1, 2), name = c("A1", "A2", "B")) try(join_strict(orders, customers_dup, by = "id", expect = "1:1"))
A fast check of join key quality that returns a simple pass/fail status
with a brief summary. Use this for quick validation; use join_spy() for
detailed diagnostics.
key_check(x, y, by, warn = TRUE)key_check(x, y, by, warn = TRUE)
x |
A data frame (left table in the join). |
y |
A data frame (right table in the join). |
by |
A character vector of column names to join by. |
warn |
Logical. If |
Invisibly returns a logical: TRUE if no issues detected, FALSE otherwise.
Also prints a brief status message unless warn = FALSE.
orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) # Quick check key_check(orders, customers, by = "id") # Silent check is_ok <- key_check(orders, customers, by = "id", warn = FALSE)orders <- data.frame(id = c(1, 2, 2, 3), value = 1:4) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) # Quick check key_check(orders, customers, by = "id") # Silent check is_ok <- key_check(orders, customers, by = "id", warn = FALSE)
Identifies rows with duplicate values in the specified key columns. Returns a data frame containing only the rows with duplicated keys, along with a count of occurrences.
key_duplicates(data, by, keep = c("all", "first", "last"))key_duplicates(data, by, keep = c("all", "first", "last"))
data |
A data frame. |
by |
A character vector of column names to check for duplicates. |
keep |
Character. Which duplicates to return:
|
A data frame containing the duplicated rows, with an additional
column .n_duplicates showing how many times each key appears.
Returns an empty data frame (0 rows) if no duplicates found.
df <- data.frame( id = c(1, 2, 2, 3, 3, 3, 4), value = letters[1:7] ) # Find all duplicates key_duplicates(df, by = "id") # Find first occurrence only key_duplicates(df, by = "id", keep = "first")df <- data.frame( id = c(1, 2, 2, 3, 3, 3, 4), value = letters[1:7] ) # Find all duplicates key_duplicates(df, by = "id") # Find first occurrence only key_duplicates(df, by = "id", keep = "first")
Retrieves the most recent JoinReport object from any *_join_spy() call.
Useful when using .quiet = TRUE in pipelines and wanting to inspect
the diagnostics afterward.
last_report()last_report()
The last JoinReport object, or NULL if no join has been performed.
orders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) # Silent join in a pipeline result <- left_join_spy(orders, customers, by = "id", .quiet = TRUE) # Inspect the report afterward last_report()orders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) # Silent join in a pipeline result <- left_join_spy(orders, customers, by = "id", .quiet = TRUE) # Inspect the report afterward last_report()
Performs a left join and automatically prints diagnostic information about the operation. The diagnostic report is also attached as an attribute.
left_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)left_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
backend |
Character or |
... |
Additional arguments passed to the underlying join function. |
The joined data frame with a "join_report" attribute containing
the diagnostic information.
join_spy(), join_strict(), last_report()
orders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) result <- left_join_spy(orders, customers, by = "id") # Access the diagnostic report attr(result, "join_report") # Silent mode for pipelines result2 <- left_join_spy(orders, customers, by = "id", .quiet = TRUE) last_report() # Access diagnostics afterwardorders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) result <- left_join_spy(orders, customers, by = "id") # Access the diagnostic report attr(result, "join_report") # Silent mode for pipelines result2 <- left_join_spy(orders, customers, by = "id", .quiet = TRUE) last_report() # Access diagnostics afterward
Writes a JoinReport object to a file for audit trails and reproducibility.
Supports plain text, JSON, and RDS formats.
log_report(report, file, append = FALSE, timestamp = TRUE)log_report(report, file, append = FALSE, timestamp = TRUE)
report |
A |
file |
File path to write to. Extension determines format:
|
append |
Logical. If |
timestamp |
Logical. If |
Invisibly returns the file path.
orders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) report <- join_spy(orders, customers, by = "id") # Log to temporary file tmp <- tempfile(fileext = ".log") log_report(report, tmp, append = TRUE) unlink(tmp)orders <- data.frame(id = 1:3, value = c(10, 20, 30)) customers <- data.frame(id = c(1, 2, 4), name = c("A", "B", "D")) report <- join_spy(orders, customers, by = "id") # Log to temporary file tmp <- tempfile(fileext = ".log") log_report(report, tmp, append = TRUE) unlink(tmp)
Creates a Venn diagram showing key overlap between tables.
## S3 method for class 'JoinReport' plot( x, file = NULL, width = 6, height = 5, colors = c("#4A90D9", "#D94A4A"), ... )## S3 method for class 'JoinReport' plot( x, file = NULL, width = 6, height = 5, colors = c("#4A90D9", "#D94A4A"), ... )
x |
A |
file |
Optional file path to save the plot (PNG, SVG, or PDF based on extension). If NULL (default), displays in the current graphics device. |
width |
Width in inches (default 6). |
height |
Height in inches (default 5). |
colors |
Character vector of length 2 for left and right circle colors. |
... |
Additional arguments (ignored). |
Invisibly returns the plot data (left_only, both, right_only counts).
orders <- data.frame(id = 1:5, val = 1:5) customers <- data.frame(id = 3:7, name = letters[3:7]) report <- join_spy(orders, customers, by = "id") plot(report)orders <- data.frame(id = 1:5, val = 1:5) customers <- data.frame(id = 3:7, name = letters[3:7]) report <- join_spy(orders, customers, by = "id") plot(report)
Print Method for JoinReport
## S3 method for class 'JoinReport' print(x, ...)## S3 method for class 'JoinReport' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
Invisibly returns x.
Performs a right join and automatically prints diagnostic information.
right_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)right_join_spy(x, y, by, verbose = TRUE, .quiet = FALSE, backend = NULL, ...)
x |
A data frame (left table). |
y |
A data frame (right table). |
by |
A character vector of column names to join by. |
verbose |
Logical. If |
.quiet |
Logical. If |
backend |
Character or |
... |
Additional arguments passed to the underlying join function. |
The joined data frame with a "join_report" attribute.
left_join_spy(), join_spy(), last_report()
Sets up automatic logging of all join reports to a specified file.
When enabled, every *_join_spy() call will append its report to the log.
set_log_file(file, format = c("text", "json"))set_log_file(file, format = c("text", "json"))
file |
File path for automatic logging. Set to |
format |
Log format: "text" (default) or "json". |
Invisibly returns the previous log file setting.
# Enable automatic logging to temp file tmp <- tempfile(fileext = ".log") old <- set_log_file(tmp) # Disable logging and clean up set_log_file(NULL) unlink(tmp)# Enable automatic logging to temp file tmp <- tempfile(fileext = ".log") old <- set_log_file(tmp) # Disable logging and clean up set_log_file(NULL) unlink(tmp)
Analyzes join issues and returns R code snippets to fix them.
suggest_repairs(report)suggest_repairs(report)
report |
A |
Character vector of R code snippets to fix detected issues.
orders <- data.frame(id = c("A ", "B"), val = 1:2, stringsAsFactors = FALSE) customers <- data.frame(id = c("a", "b"), name = c("Alice", "Bob"), stringsAsFactors = FALSE) report <- join_spy(orders, customers, by = "id") suggest_repairs(report)orders <- data.frame(id = c("A ", "B"), val = 1:2, stringsAsFactors = FALSE) customers <- data.frame(id = c("a", "b"), name = c("Alice", "Bob"), stringsAsFactors = FALSE) report <- join_spy(orders, customers, by = "id") suggest_repairs(report)
Returns a compact summary data frame of the join diagnostic report.
## S3 method for class 'JoinReport' summary(object, format = c("data.frame", "text", "markdown"), ...)## S3 method for class 'JoinReport' summary(object, format = c("data.frame", "text", "markdown"), ...)
object |
A |
format |
Output format: "data.frame" (default), "text", or "markdown". |
... |
Additional arguments (ignored). |
A data frame with key metrics (or printed output for text/markdown).
orders <- data.frame(id = 1:5, val = 1:5) customers <- data.frame(id = 3:7, name = letters[3:7]) report <- join_spy(orders, customers, by = "id") summary(report) summary(report, format = "markdown")orders <- data.frame(id = 1:5, val = 1:5) customers <- data.frame(id = 3:7, name = letters[3:7]) report <- join_spy(orders, customers, by = "id") summary(report) summary(report, format = "markdown")