---
title: "Out-of-core GIS with vectra"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Out-of-core GIS with vectra}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = FALSE
)
```
```{r va-engine, echo = FALSE, results = "asis", eval = TRUE}
cat('
')
```
## The streaming envelope
Every desktop GIS runs its toolbox one in-memory layer at a time. Open a layer,
run the tool, write the result. The working layer has to fit in RAM, and a
country-scale point set or a national DEM often does not.
vectra runs the same toolbox one batch at a time. A query is pulled through the
C engine in fixed-size row groups, each spatial step works on the batch in front
of it, and the transformed batch spills back to disk as a fresh lazy node. Peak
memory is one batch and whatever small resident layer the step needs, so a
billion-row layer flows past a fixed memory budget.
Topology stays with [sf](https://r-spatial.github.io/sf/) and
[terra](https://rspatial.github.io/terra/): vectra adds no GEOS or GDAL link.
Geometry rides through the engine as hex-encoded WKB in an ordinary string
column, and the coordinate reference system is carried on the node. What vectra
contributes is the streaming envelope around the operations a desktop GIS keeps
resident.
```{r}
library(vectra)
```
## From a raster to one row per cell
A raster is a dense grid of values. To feed it into a tabular pipeline, read it
as one row per cell, each cell a record of its coordinate and value. That is the
bridge between the raster cube and every verb in the engine.
```{r raster-to-points-anim, echo = FALSE, results = "asis", eval = TRUE}
body <- "
var s=VA.setup('m2p-cv'); if(!s)return; var x=s.ctx,W=s.w,H=s.h,C=VA.C;
var GX=26,GY=72,COLS=8,ROWS=5,TW=44,TH=44,GW=COLS*TW,GH=ROWS*TH;
var N=COLS*ROWS;
var LX=GX+GW+70, LY=86, RH=15;
var seed=20260625;function rnd(){seed=(seed*1103515245+12345)&0x7fffffff;return seed/0x7fffffff;}
var V=[];for(var i=0;i ONE ROW PER CELL',16,30);VA.noglow(x);
var done=Math.floor(tc/PER);
for(var i=0;i\n",
"\n"))
```
`vec_open_raster()` opens a `.vec` raster
cube, and `vec_extract_points()` samples
it at coordinates. A whole national climate grid stays on disk while the
occurrence points it is queried at stay in memory.
```{r}
clim <- vec_open_raster("worldclim_bio.vec")
occ <- read.csv("occurrences.csv") # species, x, y
occ <- cbind(occ, vec_extract_points(clim, occ$x, occ$y))
head(occ)
```
For a GeoTIFF the same sampling is
`tiff_extract_points()`. From here the
points are an ordinary vectra table, ready for any verb.
## Select by location
The most-used vector tool keeps the features that stand in a spatial relation to
another layer: points inside a study region, parcels touching a road, patches
within a buffer. `spatial_filter()` streams
the large layer and tests each batch against a small resident locator with an sf
predicate, keeping or dropping whole features.
```{r select-by-location-anim, echo = FALSE, results = "asis", eval = TRUE}
body <- "
var s=VA.setup('sbl-cv'); if(!s)return; var x=s.ctx,W=s.w,H=s.h,C=VA.C;
var PER=6.0,HOLD=1.6,PERIOD=PER+HOLD;
var poly=[[330,128],[470,116],[548,214],[472,322],[332,332],[262,216]];
var seed=77;function rnd(){seed=(seed*1103515245+12345)&0x7fffffff;return seed/0x7fffffff;}
var P=[];for(var i=0;i<56;i++){P.push([60+rnd()*(W-120),64+rnd()*(H-118)]);}
P.sort(function(a,b){return a[0]-b[0];});
function inPoly(px,py){var inside=false;for(var i=0,j=poly.length-1;ipy)!=(yj>py))&&(px<(xj-xi)*(py-yi)/(yj-yi)+xi);if(hit)inside=!inside;}return inside;}
function drawPoly(){x.save();VA.glow(x,C.amber,8);x.strokeStyle=C.amber;x.lineWidth=2;x.beginPath();x.moveTo(poly[0][0],poly[0][1]);for(var i=1;i\n",
"\n"))
```
```{r}
region <- sf::st_read("study_area.gpkg", quiet = TRUE)
inside <- tbl("occurrences.vtr") |>
spatial_filter(region, coords = c("x", "y"), crs = 4326)
```
The result carries the same schema as the input, with the outside features
removed. `negate = TRUE` keeps the outside features instead. To cut geometry
along the region rather than keep whole features,
`spatial_clip()` intersects each batch with the
mask, and `erase = TRUE` keeps the part outside it.
```{r}
clipped <- tbl("rivers.vtr") |> spatial_clip(region, crs = 4326)
```
To tag each feature with the polygon it falls in rather than filter,
`spatial_join()` appends the polygon
attributes. When both layers are larger than RAM, `partition = grid(cellsize)`
bins them to a grid and joins one shard at a time, with the right layer itself a
streamed node.
```{r}
tagged <- tbl("gbif_points.vtr") |>
spatial_join(tbl("countries.vtr"), coords = c("x", "y"),
crs = 4326, partition = grid(1))
```
## Rasterize a streamed point set
Folding a point set into a grid is the headline larger-than-RAM operation: the
grid fits in memory, the points do not.
`rasterize()` maps each point to its cell through
the geotransform and accumulates a per-cell value in C, one batch at a time, no
spill. This is the monoid-fold tier: bounded memory, a single pass.
```{r rasterize-anim, echo = FALSE, results = "asis", eval = TRUE}
body <- "
var s=VA.setup('rz-cv'); if(!s)return; var x=s.ctx,W=s.w,H=s.h,C=VA.C;
var GX=44,GW=W-88,COLS=16,ROWS=8,TW=GW/COLS,GY=150,TH=27,GH=ROWS*TH;
var PERIOD=9.0;
var seed=531;function rnd(){seed=(seed*1103515245+12345)&0x7fffffff;return seed/0x7fffffff;}
var D=[];var cxC=10.5,cyC=4.5;
for(var i=0;i<150;i++){var c=Math.round(VA.clamp(cxC+(rnd()-0.5)*9+(rnd()-0.5)*5,0,COLS-1));var r=Math.round(VA.clamp(cyC+(rnd()-0.5)*5+(rnd()-0.5)*3,0,ROWS-1));D.push([c,r,rnd()]);}
function draw(tc){VA.bg(x,W,H);
VA.glow(x,C.cyan,7);x.fillStyle=C.cyan;x.textAlign='left';x.font=VA.F(15,true);x.fillText('RASTERIZE :: POINTS -> DENSITY GRID',16,28);VA.noglow(x);
var cnt=new Array(COLS*ROWS).fill(0),maxc=1;
for(var i=0;i=rt+0.7){var k=D[i][1]*COLS+D[i][0];cnt[k]++;if(cnt[k]>maxc)maxc=cnt[k];}}
for(var r=0;r0?'rgba(58,215,255,'+(0.14+0.78*v).toFixed(3)+')':C.off;
x.fillRect(GX+c*TW+1,GY+r*TH+1,TW-2,TH-2);}
x.strokeStyle=C.grid;for(var c2=0;c2<=COLS;c2++){x.beginPath();x.moveTo(GX+c2*TW,GY);x.lineTo(GX+c2*TW,GY+GH);x.stroke();}
for(var r2=0;r2<=ROWS;r2++){x.beginPath();x.moveTo(GX,GY+r2*TH);x.lineTo(GX+GW,GY+r2*TH);x.stroke();}
for(var i=0;i=0&&fall<0.7){var f=fall/0.7;var tcx=GX+(D[i][0]+0.5)*TW,tcy=GY+(D[i][1]+0.5)*TH;var py=VA.lerp(62,tcy,VA.ease(f));
VA.glow(x,C.green,7);x.fillStyle=C.green;x.globalAlpha=1-0.25*f;x.beginPath();x.arc(tcx,py,3.4,0,6.2832);x.fill();x.globalAlpha=1;VA.noglow(x);}}
x.fillStyle=C.cyan;x.font=VA.F(12,true);x.textAlign='right';x.fillText('brighter = more points per cell',W-16,GY-12);
x.fillStyle=C.mut;x.font=VA.F(11);x.textAlign='left';x.fillText('one batch streams past; the grid stays resident',16,H-18);
VA.scan(x,W,H);
}
VA.run(draw,PERIOD,null,'rz','rz-cv');
"
cat(paste0(
"\n",
"\n"))
```
```{r}
# Point density on a continental grid, streamed from a billion-row file.
density <- tbl("gbif_points.vtr") |>
rasterize(extent = c(-180, -90, 180, 90), res = 0.1,
fun = "count", path = "density.vec")
```
`fun` is the per-cell reduction: `"count"`, `"sum"`, `"mean"`, `"min"`, `"max"`.
With a `field` it aggregates that column rather than tallies points. The output
is a `.vec` cube or an in-memory matrix.
Going the other way, `zonal()` summarises a raster
within zones, streaming the grid one tile-row strip at a time and folding
per-zone moments in memory.
```{r}
admin <- sf::st_read("regions.gpkg", quiet = TRUE)
zonal(clim, admin, fun = c("mean", "sd"), zone_field = "region_id")
```
## Terrain on a streamed DEM
Moving-window operations read each output tile expanded by the kernel radius (a
halo read), so a national DEM never has to be resident.
`focal()` applies an arbitrary weight window, and
`terrain()` derives slope, aspect, hillshade, TPI,
roughness, and TRI with Horn's method, matching `terra::terrain()`. When `path`
is given the result streams straight back to a new `.vec`, so neither the input
nor the output band is ever fully in memory.
```{r}
terrain("dem.vec", v = c("slope", "aspect", "hillshade"),
path = "dem_derivatives.vec")
```
To resample or reproject, `warp()` walks the output one
tile-row strip at a time, projects each strip's pixel centres into the source CRS
(via PROJ when the two CRSs differ), and samples the bounded source window with
nearest, bilinear, or cubic interpolation, matching `terra::project()`.
```{r}
warp("dem.vec", list(crs = 3035, res = 25),
method = "bilinear", path = "dem_laea.vec")
```
## Back from raster to vector
`rasterize()` carries points onto a grid; `polygonize()` carries a grid back to
vector. It reads one tile-row strip at a time and dissolves equal-valued cells
into one polygon per value, so a classified land-cover raster becomes a polygon
layer out of core.
```{r}
habitat <- polygonize("landcover.vec") # one polygon per class
```
`contours()` traces iso-lines from a continuous surface with marching squares
over the same haloed strip pass, then joins each level's segments into
continuous lines.
```{r}
isolines <- contours("dem.vec", levels = seq(0, 3000, by = 200))
```
## Masking and raster math
`mask()` clips a raster to a polygon layer one strip at a time, keeping the
pixels whose centre falls inside it, the raster counterpart of `spatial_clip()`.
```{r}
study <- mask("dem.vec", region, path = "dem_study.vec")
```
`rast_calc()` evaluates a cellwise expression across aligned rasters, reading
each input strip by strip. A vegetation index, a reclassification, or arithmetic
across layers is one expression over the named rasters.
```{r}
ndvi <- rast_calc(list(nir = "nir.vec", red = "red.vec"),
(nir - red) / (nir + red), path = "ndvi.vec")
```
`mosaic()` merges rasters that share a resolution and cell grid onto their union,
resolving overlap with `mean`, `first`, `last`, `min`, `max`, or `sum`, one
output strip at a time.
```{r}
tile_merge <- mosaic(list("n50.vec", "n51.vec", "n52.vec"), fun = "mean")
```
## Distance to the nearest feature
`proximity()` gives every cell its Euclidean distance to the nearest feature
cell, in CRS units. Features are the non-NA cells, or the cells whose value is
in `target`. The transform is separable: a pass along the rows, a pass along the
columns, each linear in the line length, with an out-of-core transpose between
them so the whole grid is never resident. Squared distances scale by the x and y
resolution, so the answer is exact on non-square cells.
```{r}
dist_to_road <- proximity("roads.vec", path = "road_distance.vec")
sea_distance <- proximity("landcover.vec", target = 0) # 0 = water
```
Cost-distance, which accumulates a per-cell friction along the path, is a global
shortest-path problem and stays resident: `collect()` the raster and run a
resident solver for it.
## The cost model
Each operation states its memory tier, so the toolbox reads as a cost model
rather than a grab-bag. The tier tells you what a run holds resident.
```{r cost-tiers-anim, echo = FALSE, results = "asis", eval = TRUE}
body <- "
var s=VA.setup('ct-cv'); if(!s)return; var x=s.ctx,W=s.w,H=s.h,C=VA.C;
var PERIOD=8.0;
var rows=[
{name:'monoid fold',sub:'spatial_filter, rasterize, zonal',col:C.green,f:function(t){return 0.15+0.02*Math.sin(t*9);}},
{name:'sort / partition',sub:'focal, terrain, warp, partitioned join',col:C.amber,f:function(t){return 0.16+0.2*(0.5-0.5*Math.cos(t*6.5));}},
{name:'all-to-all',sub:'self-overlay, voronoi, neighbour graphs',col:C.red,f:function(t){return VA.clamp(t,0,1)*0.9+0.05;}}
];
var PX=212,PW=W-212-92,BH=56,GAP=36,TOP=74;
function draw(tc){VA.bg(x,W,H);
VA.glow(x,C.cyan,7);x.fillStyle=C.cyan;x.textAlign='left';x.font=VA.F(15,true);x.fillText('COST-MODEL TIERS :: MEMORY OVER A RUN',16,28);VA.noglow(x);
var t=tc/PERIOD;
for(var i=0;i<3;i++){var y=TOP+i*(BH+GAP);var rw=rows[i];
x.strokeStyle=C.red;x.setLineDash([5,4]);x.beginPath();x.moveTo(PX,y-7);x.lineTo(PX+PW,y-7);x.stroke();x.setLineDash([]);
x.fillStyle=C.off;x.fillRect(PX,y,PW,BH);x.strokeStyle=C.grid;x.strokeRect(PX,y,PW,BH);
var cursor=VA.clamp(t,0,1),steps=Math.floor(cursor*PW);
x.fillStyle=rw.col;x.globalAlpha=0.82;
for(var px=0;px\n",
"\n"))
```
- **Monoid fold.** One batch at a time, bounded memory, no spill:
`spatial_filter()`, `spatial_clip()`, `rasterize()`, `zonal()`, and every
`spatial_map()` recipe.
- **Sort / partition.** Spatial locality first: `focal()`, `terrain()`,
`warp()` over tile-and-halo reads, `proximity()` over its transpose passes,
`spatial_dissolve()`, and the partitioned `spatial_join()`.
- **All-to-all.** Inherently resident: `spatial_overlay()` and operations such
as Voronoi, Delaunay, convex hull, and global neighbour graphs. For these,
collect the result and run sf or terra on it; vectra does not pretend to
stream what is global by nature.
That boundary is the honest part. The streaming envelope covers the operations
whose memory can be bounded, and names the ones it cannot.
## Where to go next
- [Larger-than-RAM strategy](large-data.html) for the spill and partition tiers
in depth.
- [Offloading](offload.html) for the cost grades the spatial tiers reuse.
- [Species distribution modelling](sdm.html) for an end-to-end ecological
workflow.
- The function reference for every spatial verb and its arguments.