Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GeometryIndex #7

Merged
merged 28 commits into from
Nov 23, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@ Indexing
.. autosummary::
:toctree: generated/

ShapelySTRTreeIndex.__init__
ShapelySTRTreeIndex.from_variables
ShapelySTRTreeIndex.sel
GeoVectorIndex
GeoVectorIndex.crs
GeoVectorIndex.sindex
2,659 changes: 1,997 additions & 662 deletions doc/source/quickstart.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ classifiers = [
requires-python = ">=3.8"
dependencies = [
"xarray >= 2022.11.0",
"pyproj >= 2.6.1.post1",
benbovy marked this conversation as resolved.
Show resolved Hide resolved
"shapely >= 2.0b1",
]

Expand Down
4 changes: 2 additions & 2 deletions xvec/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
from importlib.metadata import PackageNotFoundError, version

from .strtree import ShapelySTRTreeIndex # noqa
from .index import GeoVectorIndex # noqa

try:
__version__ = version("package-name")
__version__ = version("xvec")
martinfleis marked this conversation as resolved.
Show resolved Hide resolved
except PackageNotFoundError: # noqa
# package is not installed
pass
193 changes: 193 additions & 0 deletions xvec/index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
from __future__ import annotations

from typing import Any, Hashable, Iterable, Mapping, Sequence

import numpy as np
import pandas as pd
import shapely
from pyproj import CRS
from xarray import DataArray, Variable
from xarray.core.indexing import IndexSelResult
from xarray.indexes import Index, PandasIndex


class GeoVectorIndex(Index):
"""An CRS-aware, Xarray-compatible index for vector geometries.

This index can be set from any 1-dimensional coordinate of
(shapely 2.0) :class:`shapely.Geometry` elements.

It provides all the basic functionality of an
:class:`xarray.indexes.PandasIndex`. In addition, it allows spatial
filtering based on geometries (powered by :class:`shapely.STRtree`).

Parameters
----------
index : :class:`xarray.indexes.PandasIndex`
An Xarray (pandas) index built from an array-like of
:class:`shapely.Geometry` objects.
crs : object
benbovy marked this conversation as resolved.
Show resolved Hide resolved
The coordinate reference system. Any value accepted by
:meth:`pyproj.crs.CRS.from_user_input`.

"""

_index: PandasIndex
_sindex: shapely.STRtree | None
_crs: CRS

def __init__(self, index: PandasIndex, crs: CRS):
if not np.all(shapely.is_geometry(index.index)):
raise ValueError("array must contain shapely.Geometry objects")

self._crs = CRS.from_user_input(crs)
self._index = index
self._sindex = None

@property
def crs(self) -> CRS:
"""Returns the coordinate reference system of the index as a
:class:`pyproj.crs.CRS` object.
"""
return self._crs

@property
def sindex(self) -> shapely.STRtree:
"""Returns the spatial index, i.e., a :class:`shapely.STRtree` object.

It may build the index before returning it.
benbovy marked this conversation as resolved.
Show resolved Hide resolved
"""
if self._sindex is None:
self._sindex = shapely.STRtree(self._index.index)
return self._sindex

@classmethod
def from_variables(
cls,
variables: Mapping[Any, Variable],
*,
options: Mapping[str, Any],
):
# TODO: try getting CRS from coordinate attrs or GeometryArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: try getting CRS from coordinate attrs or GeometryArray
# TODO: try getting CRS from coordinate attrs or GeometryArray or SRID

Just so we don't forget about this option as well. It is not common but when loading geoms from PostGIS, this may be useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can somehow get xarray to extract the CRS from the GeometryArray and store it on the object? That would probably have to be done in a hook that would be defined here... some sort of array type → hook callable mapping where the hooks return either the full Variable object or a data, attrs tuple, maybe?

Also, that would require some thought into how to store the crs. Not sure if we could copy whatever rioxarray / odc-geo are doing, since those seem to have one global crs per DataArray / Dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be nice to extract the CRS from the GeometryArray, but for dimension coordinates Xarray currently converts the GeometryArray into a plain numpy array, therefore losing the CRS info. We need to fix that when refactoring IndexVariable in Xarray.

Also, that would require some thought into how to store the crs. Not sure if we could copy whatever rioxarray / odc-geo are doing, since those seem to have one global crs per DataArray / Dataset?

With a CRSIndex (corteva/rioxarray#588) or RasterIndex built from three coordinates (2D spatial + spatial_ref), having one global crs per DataArray / Dataset rioxarray may not be required anymore and the conventions regarding the coordinate names may even be relaxed.

While for rasters it makes sense to have a scalar coordinate holding the CRS info to avoid duplicating it, for vector data the coordinate <-> index relationship is 1:1 so we could add the CRS info directly as attribute(s) of the vector coordinate.

It may be a good idea to add CRS info (a serializable version) as a coordinate attribute, if not present. So basic CRS info (identifier) would be accessible from the DataArray / Dataset repr, and the full CRS object would be accessible from the index object. E.g.,

>>> ds.geom_coord.attrs["crs"]
'EPSG:4267'
>>> ds.xindexes["geom_coord"].crs
<pyproj.crs.CRS>

if "crs" not in options:
raise ValueError("a CRS must be provided")
benbovy marked this conversation as resolved.
Show resolved Hide resolved

index = PandasIndex.from_variables(variables, options={})
martinfleis marked this conversation as resolved.
Show resolved Hide resolved
return cls(index, crs=options["crs"])

@classmethod
def concat(
cls,
indexes: Sequence[GeoVectorIndex],
dim: Hashable,
positions: Iterable[Iterable[int]] | None = None,
) -> GeoVectorIndex:
crss = [idx.crs for idx in indexes]

if any([s != crss[0] for s in crss]):
raise ValueError("conflicting CRS for coordinates to concat")

indexes_ = [idx._index for idx in indexes]
index = PandasIndex.concat(indexes_, dim, positions)
return cls(index, crss[0])

def create_variables(
self, variables: Mapping[Any, Variable] | None = None
) -> dict[Hashable, Variable]:
return self._index.create_variables(variables)

def to_pandas_index(self) -> pd.Index:
return self._index.index

def isel(self, indexers: Mapping[Any, Any]):
index = self._index.isel(indexers)

if index is not None:
return type(self)(index, self.crs)
else:
return None

def _sel_sindex(self, labels, method, tolerance):
# only one coordinate supported
assert len(labels) == 1
martinfleis marked this conversation as resolved.
Show resolved Hide resolved
label = next(iter(labels.values()))

if method != "nearest":
if not isinstance(label, shapely.Geometry):
raise ValueError(
"selection with another method than nearest only supports "
"a single geometry as input label."
)

if isinstance(label, DataArray):
label_array = label._variable._data
elif isinstance(label, Variable):
label_array = label._data
elif isinstance(label, shapely.Geometry):
label_array = np.array([label])
else:
label_array = np.array(label)

# check for possible CRS of geometry labels
# (by default assume same CRS than the index)
if hasattr(label_array, "crs") and label_array.crs != self.crs:
raise ValueError("conflicting CRS for input geometries")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need a change if crs becomes optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do when label CRS is not None and index CRS is None? Raise or ignore the label CRS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In geopandas we warn that the two CRS are not matching but let the operation happen. I'd do the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for being consistent with geopandas. Is it for all operations that could depend on a CRS?

(Note: it might be tricky to set the right stacklevel for the warnings here since the index API is called in various places of Xarray's internals, but not a big deal I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two CRS are checked against each other like in overlay or sjoin, if one gdf has a CRS and the other does not, it always warns and executes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I wonder if we shouldn't have a special case for .sel(). I find it convenient to just pass one or more (list/array) shapely geometries as input labels without an explicit CRS. A warning may be annoying.


assert np.all(shapely.is_geometry(label_array))
benbovy marked this conversation as resolved.
Show resolved Hide resolved

if method == "nearest":
indices = self.sindex.nearest(label_array)
else:
indices = self.sindex.query(label, predicate=method, distance=tolerance)

# attach dimension names and/or coordinates to positional indexer
if isinstance(label, Variable):
indices = Variable(label.dims, indices)
elif isinstance(label, DataArray):
indices = DataArray(indices, coords=label._coords, dims=label.dims)

return IndexSelResult({self._index.dim: indices})

def sel(
self, labels: dict[Any, Any], method=None, tolerance=None
) -> IndexSelResult:
if method is None:
return self._index.sel(labels)
else:
# We reuse here `method` and `tolerance` options of
# `xarray.indexes.PandasIndex` as `predicate` and `distance`
# options when `labels` is a single geometry.
# Xarray currently doesn't support custom options
# (see https://github.com/pydata/xarray/issues/7099)
return self._sel_sindex(labels, method, tolerance)

def equals(self, other: Index) -> bool:
if not isinstance(other, GeoVectorIndex):
return False
if other.crs != self.crs:
return False
return self._index.equals(other._index)

def join(
self: GeoVectorIndex, other: GeoVectorIndex, how: str = "inner"
) -> GeoVectorIndex:
index = self._index.join(other._index, how=how)
return type(self)(index, self.crs)
martinfleis marked this conversation as resolved.
Show resolved Hide resolved

def reindex_like(
self, other: GeoVectorIndex, method=None, tolerance=None
) -> dict[Hashable, Any]:
return self._index.reindex_like(
other._index, method=method, tolerance=tolerance
)

def roll(self, shifts: Mapping[Any, int]) -> GeoVectorIndex:
index = self._index.roll(shifts)
return type(self)(index, self.crs)

def rename(self, name_dict, dims_dict):
index = self._index.rename(name_dict, dims_dict)
return type(self)(index, self.crs)

def _repr_inline_(self, max_width):
return f"{self.__class__.__name__}(crs={self.crs.to_string()})"
benbovy marked this conversation as resolved.
Show resolved Hide resolved
72 changes: 0 additions & 72 deletions xvec/strtree.py

This file was deleted.

Loading