---
title: "Geometry functions in expressions"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Geometry functions in expressions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
has_sf <- requireNamespace("sf", quietly = TRUE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = has_sf
)
```

vectra carries geometry through the engine as hex-encoded WKB in an ordinary
string column. The verbs in
[Streaming spatial operations](streaming-spatial.html) wrap whole sf operations
around that column. This vignette covers the other half of the spatial surface:
a family of `st_*` functions that work inside the expression verbs themselves,
so a measure, a predicate, or a geometry transform is just another term in
`mutate()`, `filter()`, or `summarise()`.

These functions run on the GEOS C library straight off the WKB column, one row
at a time, with no per-batch round-trip through sf. `filter(st_area(geometry) >
1e6)` prunes the stream in C, and `mutate(here = st_centroid(geometry))` adds a
new WKB geometry column that any later verb can read. The per-row decode is
parallelised with OpenMP, so a measure over a large layer uses every core.

```{r libs, message = FALSE}
library(vectra)
library(sf)
```

## A layer to work on

The examples use the North Carolina counties shipped with sf. Writing the layer
to a `.vtr` is the usual first step: the geometry becomes a hex-WKB string
column (named `geometry` by convention), and the attributes ride alongside it.

```{r data}
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)

f <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  NAME     = nc$NAME,
  BIR74    = nc$BIR74,
  geometry = st_as_binary(st_geometry(nc), hex = TRUE)
), f)

tbl(f)
```

The counties are stored in longitude and latitude, so every measure below is
planar in those units, the same convention the streaming verbs follow. Project
the layer first if you need metric areas or distances.

## Measures

A measure reads a geometry and returns a number, so it drops into `mutate()` as
an ordinary column.

```{r measures}
tbl(f) |>
  mutate(area   = st_area(geometry),
         perim  = st_perimeter(geometry),
         nverts = st_npoints(geometry)) |>
  select(NAME, area, perim, nverts) |>
  collect() |>
  head()
```

`st_length()` returns the boundary length of a polygon (an alias of
`st_perimeter()`) and the line length of a linestring. `st_ngeometries()`
counts the parts of a multi-geometry. `st_x()` and `st_y()` read the coordinate
of a point and return `NA` for anything that is not a point, which makes them
most useful on a centroid:

```{r coords}
tbl(f) |>
  mutate(centroid = st_centroid(geometry),
         cx = st_x(centroid),
         cy = st_y(centroid)) |>
  select(NAME, cx, cy) |>
  collect() |>
  head()
```

A geometry-valued function such as `st_centroid()` produces a new WKB column
(`centroid` above), and the next term reads it like any other column. That is
the whole mechanism: geometry in, geometry or a scalar out, all as columns.

## Predicates

A unary predicate tests one geometry: `st_is_valid()`, `st_is_empty()`,
`st_is_simple()`. A binary predicate tests a topological relation against a
second geometry: `st_intersects()`, `st_within()`, `st_contains()`,
`st_overlaps()`, `st_touches()`, `st_crosses()`, `st_equals()`,
`st_disjoint()`, `st_covers()`, `st_covered_by()`.

In `filter()` a predicate keeps the rows where the relation holds, the
geometry-expression form of select-by-location:

```{r predicate-filter}
aoi <- st_as_sfc(st_bbox(c(xmin = -80, ymin = 35, xmax = -78, ymax = 36.5)),
                 crs = st_crs(nc))

tbl(f) |>
  filter(st_intersects(geometry, aoi)) |>
  collect() |>
  nrow()
```

The second geometry here is a constant `sf` object. It is parsed once and shared
read-only across every row, so testing a whole stream against one area of
interest stays cheap. A multi-feature object is unioned to a single geometry
first.

In `mutate()` the same call returns a logical column, ready for a later verb:

```{r predicate-mutate}
tbl(f) |>
  mutate(near_raleigh = st_intersects(geometry, aoi)) |>
  filter(near_raleigh) |>
  select(NAME) |>
  collect() |>
  head()
```

## Distance

`st_distance()` returns the shortest planar distance to a second geometry,
again a constant or another column:

```{r distance}
raleigh <- st_sfc(st_point(c(-78.64, 35.78)), crs = st_crs(nc))

tbl(f) |>
  mutate(centroid   = st_centroid(geometry),
         d_raleigh  = st_distance(centroid, raleigh)) |>
  select(NAME, d_raleigh) |>
  arrange(d_raleigh) |>
  collect() |>
  head()
```

When the second argument is a geometry column instead of a constant, the
distance is computed row by row between the two columns.

## Aggregating a measure

Because a measure is an ordinary numeric column, it aggregates like one. A
grouped `summarise()` over a measure is a zonal total computed entirely in the
stream:

```{r summarise}
tbl(f) |>
  mutate(area = st_area(geometry)) |>
  summarise(total_area = sum(area), counties = n()) |>
  collect()
```

## Transforms

A transform returns a geometry, so it builds a new WKB column. Materialise it
with `collect_sf()`, which reads the WKB column back into an `sf` object (point
it at the column with `geom =`, and pass the `crs` the layer was stored in).

```{r transforms}
hulls <- tbl(f) |>
  mutate(geometry = st_convex_hull(geometry)) |>
  select(NAME, geometry) |>
  collect_sf(crs = st_crs(nc))

hulls
```

The transform set is `st_centroid()`, `st_point_on_surface()` (a point
guaranteed to lie on the geometry), `st_boundary()`, `st_envelope()` (the
bounding rectangle), `st_convex_hull()`, `st_make_valid()` (repair an invalid
geometry), and two parameterised forms: `st_buffer(g, dist)` and
`st_simplify(g, tol)`. Buffering each county and reading the areas back:

```{r buffer}
tbl(f) |>
  mutate(geometry = st_buffer(geometry, 0.1)) |>
  select(NAME, geometry) |>
  collect_sf(crs = st_crs(nc)) |>
  st_area() |>
  head()
```

`st_geometry_type()` returns the GEOS type name (`"Point"`, `"Polygon"`,
`"MultiPolygon"`, and so on) as a string column.

## The second geometry of a binary op

For `st_distance()` and the binary predicates, the second argument can be:

- another geometry column, compared row by row;

- a constant `sf` or `sfc` object, parsed once and reused across the stream (a
  multi-feature object is unioned to one geometry);
- a hex-WKB string.

## Missing geometry

A missing (`NA`) or unparseable geometry, or an operation with no answer (the
coordinate of a non-point, the distance to a missing geometry), yields `NA` for
that row rather than stopping the query:

```{r na}
g <- tempfile(fileext = ".vtr")
write_vtr(data.frame(
  id = 1:4,
  geometry = c(st_as_binary(st_geometry(nc)[1:3], hex = TRUE), NA)
), g)

tbl(g) |>
  mutate(area = st_area(geometry)) |>
  collect()
```

## Where this fits

The `st_*` expressions are the scalar, per-row layer of vectra's spatial
surface. They cover measures, predicates, and the common single-geometry
transforms at the price of a column term, with no sf object built per batch. For
an arbitrary per-feature transform that has no `st_*` form, reach for
`spatial_map()`, which runs any sf-in, sf-out function over each feature. For
constructions that read a whole feature set at once (dissolves, overlays, hulls
of a group, planar topology), the set-wise `spatial_*` verbs in
[Streaming spatial operations](streaming-spatial.html) and
[Coverage and topology](coverage-topology.html) are the tools.

```{r cleanup, include = FALSE}
unlink(c(f, g))
```