From f8c6f0af36c3de3756a375b09d7c89fa6a67e30b Mon Sep 17 00:00:00 2001
From: Simon Ilyushchenko <simonf@google.com>
Date: Fri, 5 Jan 2024 13:49:55 -0800
Subject: [PATCH] add public docs for the overall workflow of adding external
 datasets

PiperOrigin-RevId: 596075960
---
 README.md               |   6 ++-
 docs/adding_datasets.md | 116 ++++++++++++++++++++++++++++++++++++++++
 docs/spaces.md          |   1 +
 3 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 docs/adding_datasets.md
 create mode 100644 docs/spaces.md

diff --git a/README.md b/README.md
index 8077e4e49..a2751c1c7 100644
--- a/README.md
+++ b/README.md
@@ -46,9 +46,11 @@ You can use these external services to browse the EE STAC catalog:
 - [gee.stac.cloud](https://gee.stac.cloud/)
 
 
-# Install / Build instructions
+# Local Install Instructions
 
-See [install](docs/install.md)
+If you'd like to run validity checks locally (not via GitHub actions), see
+[the local installation instructions](docs/install.md). Most people won't
+need this.
 
 # Non-commercial datasets
 
diff --git a/docs/adding_datasets.md b/docs/adding_datasets.md
new file mode 100644
index 000000000..08359aea5
--- /dev/null
+++ b/docs/adding_datasets.md
@@ -0,0 +1,116 @@
+These instructions outline the workflow of adding a new dataset
+to the Earth Engine catalog. See [docs on making simple edits](simple_edits.md)
+for pointers on creating pull requests in GitHub.
+
+## Overview
+
+Currently, adding user-uploaded assets to the public catalog
+involves mirroring these assets into public Earth Engine folders.
+However, the source user-uploaded assets still need to be kept in user folders
+for as long as the dataset is present in the catalog.
+
+To add a new dataset:
+
+* File a request and get confirmation that the dataset will be accepted.
+* Write a jsonnet file describing the dataset. Write example JS scripts.
+* Create and submit a GitHub pull request with these files.
+
+See also
+[dataset acceptance criteria](https://developers.google.com/earth-engine/help_collection_criteria).
+
+## Detailed steps for adding a new dataset:
+
+1. File a bug
+[to add a new dataset](https://issuetracker.google.com/issues?q=status:(open%20%7C%20new%20%7C%20assigned%20%7C%20accepted)%20componentid:1161680&p=1)
+or
+[to update an existing one](https://issuetracker.google.com/issues?q=status:(open%20%7C%20new%20%7C%20assigned%20%7C%20accepted)%20componentid:1161653).
+Reference the existing user-uploaded asset id and make sure the asset is
+publicly readable.
+
+1. Get a general confirmation from the Earth Engine Data team that the dataset
+will be accepted.
+
+1. Choose a public dataset id that the data will be mirrored to.
+
+1. Wait until Earth Engine Data team configures asset mirroring.
+
+1. If this is your first time creating a pull request (PR) for EE datasets,
+create and submit a trivial PR modifying [this file](spaces.md) (for example,
+add or remove a space). This will make it easier to run automated checks on
+your real PRs later.
+
+1. Create a jsonnet file describing the dataset using any of the existing files
+as a starting point. See also [template files with field
+annotations](../catalog/TEMPLATE).
+
+1. Make sure the `gee:terms_of_use` field describes the data license and the
+`links` fields contains `ee.link.license()` pointing at the URL with the
+licensing terms.
+
+1. New dataset will not be activated at first. To indicate that, set
+`'gee:skip_indexing': true` at the top level. Don't yet add the new dataset to
+the `catalog.jsonnet` file.
+
+1. In the examples/ directory, create a JavaScript file that will be used as
+the main example.
+
+1. In the same directory, create another JavaScript file that generates a
+256x256 preview thumbnail. This thumbnail will be used in the catalog to
+identify the dataset, so choose a representative and good-looking
+visualization. Make sure to hide the basemap (e.g., by using a single-color
+background).
+
+1. Create a GitHub pull request with all the files you changed or added.
+
+1. This will trigger automatic syntax and validity checks. Their results can be
+seen in the "Checks" section of the pull request UI. Fix as many issues as you
+can, and ask the Earth Engine Data team for help with the rest.
+
+1. When all the checks pass, ask the Earth Engine Data team to review the PR.
+
+1. When the PR is approved, submit it. Wait for the Earth Engine Data team to
+activate the dataset (this usually simply means adding the thumbnails generated
+by the preview script).
+
+1. If your jsonnet file specified classification band colors or image
+attributes that you'd like to preserve on catalog datasets, make sure the Earth
+Engine Data team runs the mirroring job once again to set those fields.
+
+1. Review the dataset page in the [HTML catalog](https://developers.google.com/earth-engine/datasets) to make sure everything looks good.
+
+## Data normalization
+
+One of the benefits of Earth Engine for end users is uniform presentation of
+data with few surprises. This is achieved by making sure the datasets are
+normalized to a common form as much as possible during the
+ingestion/preparation phase, which means a little bit more work up front for
+data producers.
+
+Here is some advice for data normalization.
+
+1. For global datasets, prefer single assets over tiled mosaics.
+
+1. Images with the same band signatures should be in the same image collection.
+However, collections should be homogenous - if not all assets in a collection
+have the same band names and types, either reingest the assets to make the
+bands the same or use multiple collections.
+
+1. Use human-readable band names, not the default 'b1', 'b2', etc.
+
+1. Set UTC start and end times on all assets.
+
+1. Make sure bands with non-continuous values (e.g., classification or bitmask
+bands) are ingested with the pyramiding policy MODE, not the default policy
+MEAN.
+
+1. Don't mix continuous and classification values in the same band - create two
+separate bands in such cases.
+
+1. If your datasets have multiple versions, create successor/predecessor links
+using the versioning approach [similar to this
+one](https://github.com/google/earthengine-catalog/tree/main/catalog/UMD): put
+a version map into a file named `dataset.libsonnet`, then use this map in
+every jsonnet file. Mark all but the most recent versions with `deprecated:
+true`.
+
+1. Don't create new single-use keywords. If you feel a new keyword would make sense, propose other existing datasets where it should also be added.
diff --git a/docs/spaces.md b/docs/spaces.md
new file mode 100644
index 000000000..c611cb474
--- /dev/null
+++ b/docs/spaces.md
@@ -0,0 +1 @@
+Add or remove spaces here. To be used for trivial PRs from first-time contributors.