| Title: | Build Backbones and Enrichments for the 'taxify' Package |
|---|---|
| Description: | Build pipeline for the 'taxify' package. Downloads raw source data from official providers (WFO, COL, GBIF, ITIS, NCBI Taxonomy, Open Tree of Life, WoRMS, Euro+Med PlantBase, Index Fungorum, AlgaeBase) and a wide set of trait and conservation datasets, normalizes to a unified Darwin Core-like schema, and writes pre-compiled '.vtr' files that the 'taxify' runtime consumes. Separates build-time concerns (network access, parsing, schema normalization) from runtime concerns so that 'taxify' itself stays lean. |
| Authors: | Gilles Colling [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3070-6066>) |
| Maintainer: | Gilles Colling <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-19 23:06:13 UTC |
| Source: | https://github.com/gcol33/taxifydb |
Apply a binary delta to produce a new .vtr
apply_delta(old_path, delta_path, new_path)apply_delta(old_path, delta_path, new_path)
old_path |
Character. Path to the current local .vtr. |
delta_path |
Character. Path to the downloaded .xdelta file. |
new_path |
Character. Output path for the patched .vtr. |
The new path (invisibly), or NULL on failure.
Build the AlgaeBase backbone .vtr
build_algaebase( output_dir = "output/algaebase", version = NULL, verbose = TRUE )build_algaebase( output_dir = "output/algaebase", version = NULL, verbose = TRUE )
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Locates each backbone in the user's taxify data dir and writes a
{backend}_name_lookup.vtr alongside it.
build_all_name_lookups( backends = c("wfo", "col", "gbif", "itis", "ncbi", "ott", "worms"), overwrite = FALSE )build_all_name_lookups( backends = c("wfo", "col", "gbif", "itis", "ncbi", "ott", "worms"), overwrite = FALSE )
backends |
Character vector. Default: 7 standard backends. |
overwrite |
Logical. Rebuild even if the lookup .vtr already exists. |
Character vector of paths to the built lookups.
.vtr file from sourceSingle entry point for backbone builds. Downloads the raw source,
normalizes to the unified schema, precomputes matching keys, and writes
the final .vtr with indexes and metadata sidecar.
build_backend(name, output_dir = NULL, version = NULL, verbose = TRUE)build_backend(name, output_dir = NULL, version = NULL, verbose = TRUE)
name |
Character. Backend identifier (e.g., "itis"). See
|
output_dir |
Character. Output directory. Default: |
version |
Character or NULL. Version string for the build. If |
verbose |
Logical. |
Path to the built .vtr file (invisibly).
Also writes the SpeciesProfile.tsv (extinct/marine flags) as a separate
col_species_profile.vtr next to the main backbone, if present.
build_col(output_dir = "output/col", version = NULL, verbose = TRUE)build_col(output_dir = "output/col", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
.vtr from sourceDownloads the raw source for an enrichment, parses it via the registered
parser, expands names across all 7 taxify backbones via
resolve_enrichment_names(), and writes a .vtr plus meta.json sidecar
via build_enrichment_vtr().
build_enrichment( name, output_dir = NULL, version = NULL, url = NULL, resolve_names = TRUE, verbose = TRUE )build_enrichment( name, output_dir = NULL, version = NULL, url = NULL, resolve_names = TRUE, verbose = TRUE )
name |
Character. Enrichment identifier. See |
output_dir |
Character or |
version |
Character or |
url |
Character or |
resolve_names |
Logical. Run |
verbose |
Logical. Default |
Path to the built .vtr file (invisibly).
Sorts by canonical_name, writes the .vtr, creates hash indexes on
canonical_name (and optionally a group column), and writes a
meta.json sidecar.
build_enrichment_vtr( df, vtr_path, name, version, source_url, source_doi = NULL, license = "unknown", attribution = NULL, group_col = NULL, batch_size = 50000L )build_enrichment_vtr( df, vtr_path, name, version, source_url, source_doi = NULL, license = "unknown", attribution = NULL, group_col = NULL, batch_size = 50000L )
df |
A data.frame with at least a |
vtr_path |
Character. Output path for the .vtr file. |
name |
Character. Enrichment identifier (e.g., "woodiness"). |
version |
Character. Version string (e.g., "2026.04"). |
source_url |
Character. URL the source data was downloaded from. |
source_doi |
Character or NULL. DOI of the source dataset. |
license |
Character. License string (e.g., "CC0", "CC BY 4.0"). |
attribution |
Character. Human-readable attribution string. |
group_col |
Character or NULL. Column to index for group-based enrichments (e.g., "country_code", "tdwg_code", "lang"). |
batch_size |
Integer. Row group size for vectra (default 50000). |
The path to the .vtr file (invisibly).
Build the Euro+Med backbone .vtr
build_euromed(output_dir = "output/euromed", version = NULL, verbose = TRUE)build_euromed(output_dir = "output/euromed", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Build the Species Fungorum Plus backbone .vtr
build_fungorum(output_dir = "output/fungorum", version = NULL, verbose = TRUE)build_fungorum(output_dir = "output/fungorum", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Build the GBIF backbone .vtr
build_gbif(output_dir = "output/gbif", version = NULL, verbose = TRUE)build_gbif(output_dir = "output/gbif", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Downloads the ITIS SQLite dump, normalizes it, precomputes matching keys, embeds synonym info, and writes the final .vtr with indexes.
build_itis(output_dir = "output/itis", version = NULL, verbose = TRUE)build_itis(output_dir = "output/itis", version = NULL, verbose = TRUE)
output_dir |
Character. Directory for the output .vtr. |
version |
Character or NULL. If NULL, uses YYYY.MM of build date. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Build a name-lookup .vtr from a backbone .vtr
build_name_lookup(bb_path, out_path, verbose = TRUE)build_name_lookup(bb_path, out_path, verbose = TRUE)
bb_path |
Character. Backbone .vtr path. |
out_path |
Character. Lookup .vtr destination. |
verbose |
Logical. |
out_path (invisibly).
Build the NCBI backbone .vtr
build_ncbi(output_dir = "output/ncbi", version = NULL, verbose = TRUE)build_ncbi(output_dir = "output/ncbi", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Build the OTT backbone .vtr
build_ott(output_dir = "output/ott", version = NULL, verbose = TRUE)build_ott(output_dir = "output/ott", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Sorts by genus (for zone-map pruning), writes the .vtr, creates hash indexes, and writes a metadata sidecar.
build_vtr(df, vtr_path, backend_name, version, source_url, batch_size = 50000L)build_vtr(df, vtr_path, backend_name, version, source_url, batch_size = 50000L)
df |
A precomputed backbone data.frame (from |
vtr_path |
Character. Output path for the .vtr file. |
backend_name |
Character. Backend identifier (e.g., "itis"). |
version |
Character. Version string. |
source_url |
Character. URL the source data was downloaded from. |
batch_size |
Integer. Row group size for vectra (default 50000). |
The path to the .vtr file (invisibly).
Build the WFO backbone .vtr from source
build_wfo(output_dir = "output/wfo", version = NULL, verbose = TRUE)build_wfo(output_dir = "output/wfo", version = NULL, verbose = TRUE)
output_dir |
Character. Output directory. |
version |
Character or NULL. Defaults to the bundled WFO release tag. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Also writes the SpeciesProfile.tsv (habitat flags) as a separate
worms_species_profile.vtr next to the main backbone.
build_worms(output_dir = "output/worms", version = NULL, verbose = TRUE)build_worms(output_dir = "output/worms", version = NULL, verbose = TRUE)
output_dir |
Character. |
version |
Character or NULL. |
verbose |
Logical. |
Path to the .vtr file (invisibly).
Check all non-static enrichments in a manifest for version freshness
check_all_enrichment_versions(manifest_path = "manifest/manifest.json")check_all_enrichment_versions(manifest_path = "manifest/manifest.json")
manifest_path |
Character. Path to manifest.json. |
Data.frame with columns: name, source_version, upstream_version, outdated, check_url, note.
Check a Dryad dataset for the latest version
check_dryad_version(source_url)check_dryad_version(source_url)
source_url |
Character. Dryad download URL containing a DOI. |
Named list with version and url.
Dispatch version check based on a manifest entry's source format
check_enrichment_source_version(entry)check_enrichment_source_version(entry)
entry |
List. Manifest enrichment entry with |
Named list with source_version, upstream_version, outdated,
check_url, and optionally note.
Check the GBIF backbone API for the latest update date
check_gbif_api_version(source_url)check_gbif_api_version(source_url)
source_url |
Character. (Unused; the GBIF backbone dataset ID is fixed.) |
Named list with version and url.
Check a GBIF hosted dataset for the Last-Modified date
check_gbif_version(source_url)check_gbif_version(source_url)
source_url |
Character. GBIF hosted dataset URL. |
Named list with version (date string) and url.
WCVP has no dedicated API, so we fall back to a HEAD request via
check_gbif_version().
check_wcvp_version(source_url)check_wcvp_version(source_url)
source_url |
Character. WCVP download URL. |
Named list with version and url.
Check a Zenodo record for the latest version
check_zenodo_version(source_url)check_zenodo_version(source_url)
source_url |
Character. Zenodo download URL. |
Named list with version (publication date) and url.
Count rows in a .vtr file
count_vtr_rows(vtr_path)count_vtr_rows(vtr_path)
vtr_path |
Character. |
Integer.
Create a binary diff between two .vtr files
create_delta(old_path, new_path, delta_path)create_delta(old_path, new_path, delta_path)
old_path |
Character. Path to the previous-version .vtr. |
new_path |
Character. Path to the new-version .vtr. |
delta_path |
Character. Output path for the .xdelta file. |
The delta path (invisibly), or NULL if xdelta3 is unavailable.
Fetch all AlgaeBase records via /nameusage/search
download_algaebase(verbose = TRUE)download_algaebase(verbose = TRUE)
verbose |
Logical. |
A list of record objects (raw JSON list-of-lists).
Downloads url to dest_dir/source.zip (if not cached), extracts into
dest_dir/extracted/, and returns the first file matching pattern. If
pattern is NULL, returns the extraction directory itself.
download_and_unzip(url, dest_dir, pattern = NULL)download_and_unzip(url, dest_dir, pattern = NULL)
url |
Character. URL to a ZIP archive. |
dest_dir |
Character. Directory for download and extraction. |
pattern |
Character or |
Path to the matched file, or the extraction directory if pattern
is NULL.
Download and extract the COL DwC-A
download_col(dest = tempdir(), verbose = TRUE)download_col(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extracted COL directory.
Downloads url into dest_dir/filename if the destination does not
already exist (or is empty). Follows redirects, sets a User-Agent header,
and optionally a Referer.
download_curl_file(url, dest_dir, filename, referer = NULL, user_agent = NULL)download_curl_file(url, dest_dir, filename, referer = NULL, user_agent = NULL)
url |
Character. URL to download. |
dest_dir |
Character. Directory to save into. Created if missing. |
filename |
Character. Output filename. |
referer |
Character or |
user_agent |
Character or |
Path to the downloaded file.
Download and extract the Euro+Med CSV
download_euromed(dest = tempdir(), verbose = TRUE)download_euromed(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extracted CSV.
Download and extract the Species Fungorum Plus ColDP archive
download_fungorum(dest = tempdir(), verbose = TRUE)download_fungorum(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the directory containing the extracted ColDP files.
Download the GBIF backbone simple.txt.gz
download_gbif(dest = tempdir(), verbose = TRUE)download_gbif(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the .gz file.
Pages through a GBIF API endpoint that uses offset + limit and an
endOfRecords flag. Returns a combined data.frame of $results payloads.
download_gbif_api_pages(base_url, params, limit = 1000L, max_pages = 100L)download_gbif_api_pages(base_url, params, limit = 1000L, max_pages = 100L)
base_url |
Character. GBIF API endpoint. |
params |
Named list. Query parameters (excluding |
limit |
Integer. Page size. |
max_pages |
Integer. Maximum pages to fetch. |
A data.frame of combined results. Empty data.frame if the endpoint returned no rows.
Download and extract the ITIS SQLite database
download_itis(dest = tempdir(), verbose = TRUE)download_itis(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extracted SQLite database file.
Download and extract NCBI taxdump
download_ncbi(dest = tempdir(), verbose = TRUE)download_ncbi(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extraction directory.
Download and extract OTT taxonomy
download_ott(dest = tempdir(), verbose = TRUE)download_ott(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extraction directory containing taxonomy.tsv.
Download and extract the WFO classification file
download_wfo(dest = tempdir(), verbose = TRUE)download_wfo(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extracted classification file.
Download and extract WoRMS DwC-A
download_worms(dest = tempdir(), verbose = TRUE)download_worms(dest = tempdir(), verbose = TRUE)
dest |
Character. Destination directory. |
verbose |
Logical. |
Path to the extracted DwC-A directory.
Same as build_enrichment() but returns the parsed data.frame instead of
writing a .vtr file. Useful when the runtime needs the raw data in
memory (e.g. emergency fallback wired into taxify).
enrichment_emergency_fallback(name, verbose = TRUE)enrichment_emergency_fallback(name, verbose = TRUE)
name |
Character. Enrichment identifier. |
verbose |
Logical. Default |
data.frame with canonical_name + trait columns.
Check if xdelta3 is available on PATH
has_xdelta3()has_xdelta3()
Logical.
List available backend builders
list_backends()list_backends()
Character vector of backend identifiers.
List available enrichment builders
list_enrichments()list_enrichments()
Character vector of enrichment identifiers.
Renames source-specific columns to canonical names and ensures consistent types and formatting (uppercase ranks/statuses, trimmed whitespace, etc.).
normalize_backbone(df, col_map, extra_cols = NULL)normalize_backbone(df, col_map, extra_cols = NULL)
df |
The raw data.frame from a backend's download/parse step. |
col_map |
A named list mapping canonical names to source column names:
|
extra_cols |
Optional named list of additional columns to keep
( |
A data.frame with standardized column names and formatting.
Parse AlgaeTraits macroalgal traits (WoRMS ZIP export)
parse_algae_traits(path)parse_algae_traits(path)
path |
Character. Directory holding CSV/TXT/XLSX files extracted from the AlgaeTraits archive. |
data.frame with canonical_name + algae trait columns.
Reads the "FirstRecords" sheet from the Seebens Excel file, maps region names to ISO 3166-1 alpha-2 codes, and deduplicates per species x country (keeping the earliest year).
parse_alien_first_records(path)parse_alien_first_records(path)
path |
Character. Path to the Seebens XLSX file. |
data.frame with canonical_name + country_code + first-record cols.
Parse AmphiBIO amphibian traits (CSV from ZIP)
parse_amphibio(path)parse_amphibio(path)
path |
Character. Path to the AmphiBIO CSV. |
data.frame with canonical_name + trait columns.
Parse AnAge longevity and life-history traits (TSV from ZIP)
parse_anage(path)parse_anage(path)
path |
Character. Path to anage_*.txt. |
data.frame with canonical_name + AnAge trait columns.
Parse AnimalTraits observations CSV (aggregate to species medians)
parse_animaltraits(path)parse_animaltraits(path)
path |
Character. Path to observations.csv. |
data.frame with canonical_name + body_mass_kg + metabolic_rate_w.
Parse NW European Arthropod DwC-A (taxon + measurement + description)
parse_arthropod_traits(dir_path)parse_arthropod_traits(dir_path)
dir_path |
Character. Directory holding the DwC archive. |
data.frame with canonical_name + arthropod trait columns.
Parse AVONET bird morphology XLSX
parse_avonet(path)parse_avonet(path)
path |
Character. Path to AVONET_BirdLife.xlsx. |
data.frame with canonical_name + morphology columns.
Merges vernacular names from three sources:
GBIF: VernacularName.tsv (has ISO 639-1 language codes)
NCBI: names.dmp where name_class == "common name" (no language)
OTT: synonyms.tsv where type == "common name" (no language)
parse_common_names(dir_path)parse_common_names(dir_path)
dir_path |
Character. Directory containing |
NCBI and OTT common names have no language tag, so lang is set to NA.
data.frame with canonical_name + lang + common_name.
Reads the IUCN Red List archive published on GBIF (taxon.txt core plus the
Distribution extension that carries threatStatus) and returns one global
Red List category per accepted species-rank name. Reading the archive
directly avoids the GBIF species/search endpoint, whose threat facet spans
every checklist (so a name picks up conflicting categories from regional or
erroneous lists) and whose offset ceiling truncates the large categories.
parse_conservation_status(dir_path)parse_conservation_status(dir_path)
dir_path |
Character. Directory holding the extracted IUCN archive
( |
data.frame with canonical_name + conservation_status.
Parse Diaz et al. 2022 supplementary traits (XLSX)
parse_diaz_traits(path)parse_diaz_traits(path)
path |
Character. Path to the Diaz 2022 XLSX. |
data.frame with canonical_name + seed_mass_mg + plant_height_m.
Parse EIVE 1.0 ecological indicator values (XLSX)
parse_eive(path)parse_eive(path)
path |
Character. Path to EIVE_1.0.xlsx. |
data.frame with canonical_name + indicator columns.
Parse EltonTraits 1.0 birds + mammals TSVs
parse_elton_traits(birds_path, mammals_path)parse_elton_traits(birds_path, mammals_path)
birds_path |
Character. Path to BirdFuncDat.txt. |
mammals_path |
Character. Path to MamFuncDat.txt. |
data.frame with canonical_name + diet/foraging/body mass columns.
Parse FISHMORPH freshwater fish morphological traits (CSV)
parse_fish_traits(path)parse_fish_traits(path)
path |
Character. Path to FISHMORPH_Database.csv. |
data.frame with canonical_name + morphology columns.
Pulls the species and ecology tables from rfishbase, joins them, and
builds a canonical_name + trait data.frame.
parse_fishbase(path)parse_fishbase(path)
path |
Character. Not used (rfishbase fetches data directly), kept for interface consistency. |
data.frame with canonical_name + body/depth/diet columns.
Reads the Polme et al. (2020) supplementary XLSX, selects the most
informative trait columns, and returns a data.frame keyed on genus.
parse_fungal_traits(path)parse_fungal_traits(path)
path |
Character. Path to the downloaded XLSX file. |
data.frame with genus + 9 trait columns.
Reads the JSON array returned by the FUNGuild API, filters to genus and
species-level entries, and returns a clean data.frame with canonical_name
and trait columns.
parse_funguild(path)parse_funguild(path)
path |
Character. Path to the downloaded JSON/HTML file. |
data.frame with canonical_name + trophic_mode + guild + growth_morphology + confidence_ranking.
Parse GBIF vernacular names (VernacularName.tsv + Taxon.tsv)
parse_gbif_common_names(dir_path)parse_gbif_common_names(dir_path)
dir_path |
Character. Directory holding VernacularName.tsv and Taxon.tsv. |
data.frame with canonical_name + lang + common_name.
Parse GloNAF taxon-region data (multiple CSVs/XLSXs from ZIP)
parse_glonaf(dir_path)parse_glonaf(dir_path)
dir_path |
Character. Directory containing the GloNAF files. |
data.frame with canonical_name + region_id + naturalized.
Parse GRIIS Country Compendium CSV
parse_griis(path)parse_griis(path)
path |
Character. Path to GRIIS_Country_Compendium_V1_0.csv. |
data.frame with canonical_name + country_code + invasive_status.
Parse LEDA trait files (multiple semicolon/tab-delimited files)
parse_leda(dir_path)parse_leda(dir_path)
dir_path |
Character. Directory containing the LEDA *.txt files. |
data.frame with canonical_name + 10 LEDA trait columns.
Parse LepTraits 1.0 butterfly consensus CSV
parse_leptraits(path)parse_leptraits(path)
path |
Character. Path to consensus.csv. |
data.frame with canonical_name + butterfly trait columns.
Parse Meiri (2018) lizard traits (XLSX from Figshare)
parse_lizard_traits(path)parse_lizard_traits(path)
path |
Character. Path to the ReptTraits/Meiri XLSX (or CSV/TSV). |
data.frame with canonical_name + lizard trait columns.
Parse NCBI common names from names.dmp
parse_ncbi_common_names(dir_path)parse_ncbi_common_names(dir_path)
dir_path |
Character. Directory containing names.dmp. |
data.frame with canonical_name + lang (NA) + common_name.
Parse OTT common names from synonyms.tsv + taxonomy.tsv
parse_ott_common_names(dir_path)parse_ott_common_names(dir_path)
dir_path |
Character. Directory containing synonyms.tsv + taxonomy.tsv. |
data.frame with canonical_name + lang (NA) + common_name.
Parse PanTHERIA mammal life-history traits (TSV)
parse_pantheria(path)parse_pantheria(path)
path |
Character. Path to PanTHERIA.txt. |
data.frame with canonical_name + life-history columns.
Parse WCVP names + distribution (from extracted ZIP directory)
parse_wcvp(dir_path)parse_wcvp(dir_path)
dir_path |
Character. Directory containing wcvp_names + wcvp_distribution. |
data.frame with canonical_name + tdwg_code + native_status.
Parse Zanne et al. 2014 woodiness CSV
parse_woodiness(path)parse_woodiness(path)
path |
Character. Path to GlobalWoodinessDatabase.csv. |
data.frame with canonical_name + woodiness.
Adds the matching-key columns and embedded synonym info that taxify's
runtime expects. Operates on a normalized backbone (column names from
normalize_backbone()).
precompute_backbone(df, synonym_pattern = "SYNONYM")precompute_backbone(df, synonym_pattern = "SYNONYM")
df |
A normalized backbone data.frame. |
synonym_pattern |
Regex for synonym detection. |
The data.frame ready for build_vtr().
Uses the gh CLI. Assumes gh is authenticated and on PATH.
publish_release( backend_name, version, vtr_path, delta_path = NULL, meta_path = NULL, extras = character(0L), repo = "gcol33/taxifydb", notes = NULL )publish_release( backend_name, version, vtr_path, delta_path = NULL, meta_path = NULL, extras = character(0L), repo = "gcol33/taxifydb", notes = NULL )
backend_name |
Character. Backend identifier. |
version |
Character. Version string. |
vtr_path |
Character. Path to the main |
delta_path |
Character or NULL. Path to the |
meta_path |
Character or NULL. Path to the |
extras |
Character vector. Paths to additional sidecar artifacts
(e.g., |
repo |
Character. GitHub repo (e.g., "gcol33/taxifydb"). |
notes |
Character. Release notes. |
The release tag (invisibly).
Read and normalize a fetched list of AlgaeBase records
read_algaebase(records, verbose = TRUE)read_algaebase(records, verbose = TRUE)
records |
A list of raw search records (from |
verbose |
Logical. |
A normalized data.frame.
Read and normalize the COL Taxon.tsv
read_col(col_dir, verbose = TRUE)read_col(col_dir, verbose = TRUE)
col_dir |
Character. Path to the extracted COL directory. |
verbose |
Logical. |
A normalized data.frame.
Read a pipe-delimited NCBI .dmp file
read_dmp(path, col_names)read_dmp(path, col_names)
path |
Character. Path to the .dmp file. |
col_names |
Character vector of column names. |
A data.frame.
Resolves the parent-child hierarchy for family/genus, maps synonym relationships via TaxonConceptID, and produces the unified backbone schema.
read_euromed(csv_path, verbose = TRUE)read_euromed(csv_path, verbose = TRUE)
csv_path |
Character. Path to the EuroMed.csv file. |
verbose |
Logical. |
A normalized data.frame.
Read and normalize the Species Fungorum Plus ColDP files
read_fungorum(fungorum_dir, verbose = TRUE)read_fungorum(fungorum_dir, verbose = TRUE)
fungorum_dir |
Character. Directory containing Taxon.tsv, Name.tsv, Synonym.tsv. |
verbose |
Logical. |
A normalized data.frame.
Read and normalize the GBIF backbone
read_gbif(gz_path, verbose = TRUE)read_gbif(gz_path, verbose = TRUE)
gz_path |
Character. Path to simple.txt.gz. |
verbose |
Logical. |
A normalized data.frame ready for precompute_backbone().
Read and normalize the ITIS SQLite database
read_itis(sqlite_path, verbose = TRUE)read_itis(sqlite_path, verbose = TRUE)
sqlite_path |
Character. Path to the ITIS .sqlite file. |
verbose |
Logical. |
A normalized data.frame ready for precompute_backbone().
Read and normalize the NCBI taxonomy dump
read_ncbi(dump_dir, verbose = TRUE)read_ncbi(dump_dir, verbose = TRUE)
dump_dir |
Character. Path to the extracted taxdump directory. |
verbose |
Logical. |
A normalized data.frame ready for precompute_backbone().
Read and normalize the OTT taxonomy
read_ott(ott_dir, verbose = TRUE)read_ott(ott_dir, verbose = TRUE)
ott_dir |
Character. Path to the extracted OTT directory. |
verbose |
Logical. |
A normalized data.frame.
Read and normalize the WFO classification file
read_wfo(txt_path, verbose = TRUE)read_wfo(txt_path, verbose = TRUE)
txt_path |
Character. Path to the WFO classification.txt file. |
verbose |
Logical. |
A normalized data.frame ready for precompute_backbone().
Read and normalize the WoRMS taxonomy
read_worms(worms_dir, verbose = TRUE)read_worms(worms_dir, verbose = TRUE)
worms_dir |
Character. Path to the extracted DwC-A directory. |
verbose |
Logical. |
A normalized data.frame.
Takes an enrichment data.frame with canonical_name + trait columns and
expands it so that each source name maps to all unique accepted_name
values across the requested backends.
resolve_enrichment_names( df, group_cols = NULL, backends = c("wfo", "col", "gbif", "itis", "ncbi", "ott", "worms"), verbose = TRUE, use_lookup = TRUE )resolve_enrichment_names( df, group_cols = NULL, backends = c("wfo", "col", "gbif", "itis", "ncbi", "ott", "worms"), verbose = TRUE, use_lookup = TRUE )
df |
A data.frame with at least a |
group_cols |
Character vector of grouping columns. Deduplication uses
|
backends |
Character vector of backend names. Default: all 7. |
verbose |
Logical. |
use_lookup |
Logical. Try the hash-join fast path first. Default |
By default tries the fast hash-join path against per-backend
name_lookup.vtr files in the user's taxify data directory (built by
build_all_name_lookups()); falls back to per-name-per-backend
taxify::taxify() if no lookup files are found.
The expanded data.frame.
Many backends (ITIS, NCBI, OTT) store taxonomy as a parent-child tree without explicit family/genus columns. This function walks up the tree to find the nearest ancestor at each target rank.
resolve_hierarchy(df, target_ranks = c("family", "genus"), max_depth = 20L)resolve_hierarchy(df, target_ranks = c("family", "genus"), max_depth = 20L)
df |
A data.frame with columns |
target_ranks |
Character vector of ranks to resolve (e.g.,
|
max_depth |
Maximum number of hops (default 20). |
The input data.frame with new columns named after target_ranks, containing the resolved ancestor name at that rank.
Compute SHA-256 checksum of a file
sha256(path)sha256(path)
path |
Character. Path to the file. |
Character. Hex-encoded SHA-256 hash.
Update manifest.json enrichment entry from built meta.json
update_enrichment_manifest( manifest_path, name, vtr_path, repo = "gcol33/taxifydb" )update_enrichment_manifest( manifest_path, name, vtr_path, repo = "gcol33/taxifydb" )
manifest_path |
Character. Path to manifest.json. |
name |
Character. Enrichment identifier. |
vtr_path |
Character. Path to the built .vtr file. |
repo |
Character. GitHub repo for URL construction. |
The updated manifest (invisibly).
Update manifest.json after a successful backbone build
update_manifest( manifest_path, backend_name, version, vtr_path, delta_path = NULL, delta_from = NULL, extras = character(0L), repo = "gcol33/taxifydb", source_url = NULL )update_manifest( manifest_path, backend_name, version, vtr_path, delta_path = NULL, delta_from = NULL, extras = character(0L), repo = "gcol33/taxifydb", source_url = NULL )
manifest_path |
Character. Path to manifest.json. |
backend_name |
Character. |
version |
Character. |
vtr_path |
Character. |
delta_path |
Character or NULL. |
delta_from |
Character or NULL. Previous version the delta is from. |
extras |
Character vector. Paths to sidecar artifacts uploaded with
the release. Recorded as |
repo |
Character. GitHub repo for URL construction. |
source_url |
Character. Original data source URL. |
The updated manifest (invisibly).