From f8c6f0af36c3de3756a375b09d7c89fa6a67e30b Mon Sep 17 00:00:00 2001 From: Simon Ilyushchenko Date: Fri, 5 Jan 2024 13:49:55 -0800 Subject: [PATCH] add public docs for the overall workflow of adding external datasets PiperOrigin-RevId: 596075960 --- README.md | 6 ++- docs/adding_datasets.md | 116 ++++++++++++++++++++++++++++++++++++++++ docs/spaces.md | 1 + 3 files changed, 121 insertions(+), 2 deletions(-) create mode 100644 docs/adding_datasets.md create mode 100644 docs/spaces.md diff --git a/README.md b/README.md index 8077e4e49..a2751c1c7 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,11 @@ You can use these external services to browse the EE STAC catalog: - [gee.stac.cloud](https://gee.stac.cloud/) -# Install / Build instructions +# Local Install Instructions -See [install](docs/install.md) +If you'd like to run validity checks locally (not via GitHub actions), see +[the local installation instructions](docs/install.md). Most people won't +need this. # Non-commercial datasets diff --git a/docs/adding_datasets.md b/docs/adding_datasets.md new file mode 100644 index 000000000..08359aea5 --- /dev/null +++ b/docs/adding_datasets.md @@ -0,0 +1,116 @@ +These instructions outline the workflow of adding a new dataset +to the Earth Engine catalog. See [docs on making simple edits](simple_edits.md) +for pointers on creating pull requests in GitHub. + +## Overview + +Currently, adding user-uploaded assets to the public catalog +involves mirroring these assets into public Earth Engine folders. +However, the source user-uploaded assets still need to be kept in user folders +for as long as the dataset is present in the catalog. + +To add a new dataset: + +* File a request and get confirmation that the dataset will be accepted. +* Write a jsonnet file describing the dataset. Write example JS scripts. +* Create and submit a GitHub pull request with these files. + +See also +[dataset acceptance criteria](https://developers.google.com/earth-engine/help_collection_criteria). + +## Detailed steps for adding a new dataset: + +1. File a bug +[to add a new dataset](https://issuetracker.google.com/issues?q=status:(open%20%7C%20new%20%7C%20assigned%20%7C%20accepted)%20componentid:1161680&p=1) +or +[to update an existing one](https://issuetracker.google.com/issues?q=status:(open%20%7C%20new%20%7C%20assigned%20%7C%20accepted)%20componentid:1161653). +Reference the existing user-uploaded asset id and make sure the asset is +publicly readable. + +1. Get a general confirmation from the Earth Engine Data team that the dataset +will be accepted. + +1. Choose a public dataset id that the data will be mirrored to. + +1. Wait until Earth Engine Data team configures asset mirroring. + +1. If this is your first time creating a pull request (PR) for EE datasets, +create and submit a trivial PR modifying [this file](spaces.md) (for example, +add or remove a space). This will make it easier to run automated checks on +your real PRs later. + +1. Create a jsonnet file describing the dataset using any of the existing files +as a starting point. See also [template files with field +annotations](../catalog/TEMPLATE). + +1. Make sure the `gee:terms_of_use` field describes the data license and the +`links` fields contains `ee.link.license()` pointing at the URL with the +licensing terms. + +1. New dataset will not be activated at first. To indicate that, set +`'gee:skip_indexing': true` at the top level. Don't yet add the new dataset to +the `catalog.jsonnet` file. + +1. In the examples/ directory, create a JavaScript file that will be used as +the main example. + +1. In the same directory, create another JavaScript file that generates a +256x256 preview thumbnail. This thumbnail will be used in the catalog to +identify the dataset, so choose a representative and good-looking +visualization. Make sure to hide the basemap (e.g., by using a single-color +background). + +1. Create a GitHub pull request with all the files you changed or added. + +1. This will trigger automatic syntax and validity checks. Their results can be +seen in the "Checks" section of the pull request UI. Fix as many issues as you +can, and ask the Earth Engine Data team for help with the rest. + +1. When all the checks pass, ask the Earth Engine Data team to review the PR. + +1. When the PR is approved, submit it. Wait for the Earth Engine Data team to +activate the dataset (this usually simply means adding the thumbnails generated +by the preview script). + +1. If your jsonnet file specified classification band colors or image +attributes that you'd like to preserve on catalog datasets, make sure the Earth +Engine Data team runs the mirroring job once again to set those fields. + +1. Review the dataset page in the [HTML catalog](https://developers.google.com/earth-engine/datasets) to make sure everything looks good. + +## Data normalization + +One of the benefits of Earth Engine for end users is uniform presentation of +data with few surprises. This is achieved by making sure the datasets are +normalized to a common form as much as possible during the +ingestion/preparation phase, which means a little bit more work up front for +data producers. + +Here is some advice for data normalization. + +1. For global datasets, prefer single assets over tiled mosaics. + +1. Images with the same band signatures should be in the same image collection. +However, collections should be homogenous - if not all assets in a collection +have the same band names and types, either reingest the assets to make the +bands the same or use multiple collections. + +1. Use human-readable band names, not the default 'b1', 'b2', etc. + +1. Set UTC start and end times on all assets. + +1. Make sure bands with non-continuous values (e.g., classification or bitmask +bands) are ingested with the pyramiding policy MODE, not the default policy +MEAN. + +1. Don't mix continuous and classification values in the same band - create two +separate bands in such cases. + +1. If your datasets have multiple versions, create successor/predecessor links +using the versioning approach [similar to this +one](https://github.com/google/earthengine-catalog/tree/main/catalog/UMD): put +a version map into a file named `dataset.libsonnet`, then use this map in +every jsonnet file. Mark all but the most recent versions with `deprecated: +true`. + +1. Don't create new single-use keywords. If you feel a new keyword would make sense, propose other existing datasets where it should also be added. diff --git a/docs/spaces.md b/docs/spaces.md new file mode 100644 index 000000000..c611cb474 --- /dev/null +++ b/docs/spaces.md @@ -0,0 +1 @@ +Add or remove spaces here. To be used for trivial PRs from first-time contributors.