docs: clean up, update, mention all migrate scripts

move taxos section to sub-readme
cca · Jun 27, 2024 · 2674215 · 2674215
1 parent a942d04
commit 2674215
Show file tree

Hide file tree

Showing 2 changed files with 60 additions and 54 deletions.
diff --git a/readme.md b/readme.md
@@ -2,7 +2,7 @@
 
 Tools, ideas, and data.
 
-Semantics: EQUELLA objects are _items_ with _attachments_. Invenio objects are _records_ with _files_. EQUELLA has taxonomies; Invenio has vocabularies. We try to use these terms consistently so it's clear what format an object is in (e.g. `python migrate/record.py item.json > record.json` converts an _item_ into a _record_).
+Semantics: EQUELLA objects are _items_ with _attachments_. Invenio objects are _records_ with _files_. EQUELLA has _taxonomies_; Invenio has _vocabularies_. We use these terms consistently so it's clear what format an object is in (e.g. `python migrate/record.py item.json > record.json` converts an _item_ into a _record_).
 
 ## Setup & Tests
 
@@ -13,68 +13,35 @@ python -m spacy download en_core_web_lg # download spacy model for Named Entity
 pytest -v migrate/tests.py # run tests
 ```
 
-The migrate/record.py file is a class for Invenio records. On the command line, you can pass it JSON EQUELLA item(s) and it will return a JSON Invenio record(s). The migrate/api.py file will create Invenio records from EQUELLA items and attempt to POST them to a local Invenio instance. It requires `INVENIO_TOKEN` or `TOKEN` variable in our environment or .env file. To create a token: sign in as an admin and go to Applications > Personal access tokens.
+Migrate scripts that create records require an `INVENIO_TOKEN` or `TOKEN` variable in our environment or .env file. To create a token: sign in as an admin and go to Applications > Personal access tokens.
 
 ## Vocabularies
 
-Invenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names and user accounts. They're stored in the app_data directory and loaded only when an instance is initialized. Lists in contribution wizards and EQUELLA taxonomies might be mapped to vocabularies.
+Invenio uses [vocabularies](https://inveniordm.docs.cern.ch/customize/vocabularies/) to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.
 
-Two ways to export vocabularies:
+The **taxos** dir contains exported EQUELLA taxonomies and tools for working with them. The **vocab** dir contains YAML files for Invenio vocabularies.
 
-- `eq tax --name $NAME --terms` e.g. `eq tax --name 'LIBRARIES - cca/c subject' --terms > taxonomies/libraries-ccac-subject.json`
-- Admin Console > Taxonomies > Select taxo and **Export**. Yields a zip archive which has each term as an XML file in a nested directory structure. It's less easy to work with but exports hierarchical taxonomies in one go.
+### Subjects
 
-```sh
-eq tax --path '?length=500' > all.json # download list of all taxos
-# download all LIBRARIES taxos (top-level terms only)
-for t in (jq -r .results[].name all.json | grep LIBRARIES);
-    eq tax --name $t --terms > $t.json
-end
-```
-
-One way to download a (two-tier) hierarchical taxonomy (archives series). This doesn't quite work because some archives series parent terms (VI. Photographs, VIII. Periodicals and Other Publications, IX. Graduate and Undergraduate Theses) have no children so they end up as empty arrays.
-
-```sh
-eq tax 347c7b42-594a-4a7d-8d73-3054ab05a079 --terms > "Archives Series.json"
-set idx 1
-for term in (jq -r .[].term "Archives Series.json")
-    eq tax 347c7b42-594a-4a7d-8d73-3054ab05a079 --term $term > $idx.tmp.json
-    set idx (math $idx + 1)
-end
-mlr --json cat *.tmp.json > taxo.json
-rm *.tmp.json
-```
-
-Personal names are nested two layers deep in the "LIBRARIES - subject name" taxonomy: authority > personal or conference > name, e.g. "oclc\personal\Ferrea, Elizabeth, 1880-1925".
+We create two subject vocabularies: one for Library of Congress subjects with URIs from one of their authorities and one for CCA local subjects not present in any LC authority.
 
-```sh
-eq tax 657fdbec-c17c-4497-aa31-5b4bc7d9e9e5 --terms > "subject name.json"
-set idx 1
-for term in (jq -r .[].term "subject name.json")
-    eq tax 657fdbec-c17c-4497-aa31-5b4bc7d9e9e5 --term $term\\personal > $idx.tmp.json
-    set idx (math $idx + 1)
-end
-mlr --json cat *.tmp.json > taxo.json
-rm *.tmp.json
-```
-
-`mlr` is [miller](https://miller.readthedocs.io/) (`brew install miller`). It's possible to concatenate the JSON files with `jq` or editing them together, but not as easy.
-
-## Subjects
+Download our [subjects sheet](https://docs.google.com/spreadsheets/d/1la_wsFPOkHLjpv4-f3tWwMsCd0_xzuqZ5xp_p1zAAoA/edit#gid=1465207925) and run `python migrate/mk_subjects.py data/subjects.csv` to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used to convert the text of VAULT subject terms into Invenio identifiers or ID-less keyword subjects.
 
-Download our [subjects sheet](https://docs.google.com/spreadsheets/d/1la_wsFPOkHLjpv4-f3tWwMsCd0_xzuqZ5xp_p1zAAoA/edit#gid=1465207925) and run `python migrate/mk_subjects.py subjects.csv` to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as the migrate/subjects_map.json file which is used to convert the text of VAULT subject terms into Invenio IDs or ID-less keyword subjects.
-
-Copy the YAML files into the app_data/vocabularies directory of our Invenio instance. The site needs to be rebuilt to load the changes (`invenio-cli services destroy` and then `invenio-cli services setup` again). Eventually (Invenio v12) there will be CLI terms to alter vocabularies without rebuilding the site.
+Copy the YAML vocabularies into the app_data/vocabularies directory of our Invenio instance. The site needs to be rebuilt to load the changes (`invenio-cli services destroy` and then `invenio-cli services setup` again). Eventually (Invenio v12) there will be a CLI command to alter vocabularies without rebuilding the site.
 
 ## Creating Records in Invenio
 
-This repo contains scripts to create JSON record files and then `POST` them to a locally running Invenio instance. First, create a personal access token for an administrator account in Invenio:
+- **migrate/record.py**: Converts EQUELLA item JSON into Invenio record JSON
+- **migrate/api.py**: Converts an item and `POST`s it to Invenio to create a record
+- **migrate/import.py**: Imports an item _directory_ (created by [the export tool](https://github.com/cca/equella_scripts/tree/main/collection-export)) with its attachments to Invenio
+
+To use these scripts, we must create a personal access token for an administrator account in Invenio:
 
 1. Sign in as an admin
 2. Go to **Applications** > **Personal access tokens**
-3. Create one—name and the solitary `user:email` scope (as of v11) do not matter
-4. Copy it to your clipboard and **Save**
-5. Set it as an environment variable e.g. `export INVENIO_TOKEN=your_token_here` in bash or `set -x INVENIO_TOKEN=xyz` in fish
+3. Create one—its name and the solitary `user:email` scope (as of v11) do not matter
+4. Copy it to clipboard and **Save**
+5. Paste in .env and/or set it as an env var, e.g. `set -x INVENIO_TOKEN=xyz` in fish
 
 Below, we migrate a VAULT item to an Invenio record and post it to Invenio.
 
@@ -88,8 +55,6 @@ HTTP 202
 https://127.0.0.1:5000/records/k7qk8-fqq15
 ```
 
-The api.py script uses the `Record` class in migrate/record.py to convent the provided VAULT item JSON into Invenio record JSON. It then does dual API calls to create a draft and publish it.
-
 You can sometimes trip over yourself because Poetry automatically loads the `.env` file in the project root, which might contain an outdated personal access token. If API calls fail with 403 errors, check that the `TOKEN` and/or `INVENIO_TOKEN` environment variables are set correctly.
 
 Rerunning the script with the same input creates multiple records, it doesn't update existing ones.
@@ -100,7 +65,7 @@ We could write scripts to directly take an item from EQUELLA using its API, perf
 
 We need to load the necessary fixtures, including user accounts, before adding to Invenio. For instance, the item owner needs to already be in Invenio before we can add them as owner of a record. If we attempt to load a record with a subject `id` that doesn't exist yet, we get a 500 error.
 
-We can download metadata for all items using equella-cli and a script like this:
+We download metadata for all items using equella-cli and a script like this:
 
 ```sh
 #!/usr/bin/env fish
@@ -113,12 +78,11 @@ for i in (seq 0 $pages)
   # NOTE: no attachment info, use "--info all" for both attachments & metadata
   eq search -l $length --info metadata --start $start > json/$i.json
 end
-
 ```
 
 ## Metadata Crosswalk
 
-We can use the `item.metadata` XML of existing VAULT items for testing. Generally, `poetry run python migrate.py items/item.json | jq` to see the JSON Invenio record. See [our crosswalk diagrams](https://cca.github.io/vault_migration/crosswalk.html).
+We can use the `item.metadata` XML of existing VAULT items for testing. Generally, `poetry run python migrate/record.py items/item.json | jq` to see the JSON Invenio record. See [our crosswalk diagrams](https://cca.github.io/vault_migration/crosswalk.html).
 
 Schemas:
 

diff --git a/taxos/readme.md b/taxos/readme.md
@@ -0,0 +1,42 @@
+# Taxonomies
+
+Two ways to export taxonomies:
+
+- `eq tax --name $NAME --terms` e.g. `eq tax --name 'LIBRARIES - cca/c subject' --terms > taxonomies/libraries-ccac-subject.json`
+- Admin Console > Taxonomies > Select taxo and **Export**. Yields a zip archive which has each term as an XML file in a nested directory structure. It's less easy to work with but exports hierarchical taxonomies in one go.
+
+```sh
+eq tax --path '?length=500' > all.json # download list of all taxos
+# download all LIBRARIES taxos (top-level terms only)
+for t in (jq -r .results[].name all.json | grep LIBRARIES);
+    eq tax --name $t --terms > $t.json
+end
+```
+
+One way to download a (two-tier) hierarchical taxonomy (archives series). This doesn't quite work because some archives series parent terms (VI. Photographs, VIII. Periodicals and Other Publications, IX. Graduate and Undergraduate Theses) have no children so they end up as empty arrays.
+
+```sh
+eq tax 347c7b42-594a-4a7d-8d73-3054ab05a079 --terms > "Archives Series.json"
+set idx 1
+for term in (jq -r .[].term "Archives Series.json")
+    eq tax 347c7b42-594a-4a7d-8d73-3054ab05a079 --term $term > $idx.tmp.json
+    set idx (math $idx + 1)
+end
+mlr --json cat *.tmp.json > taxo.json
+rm *.tmp.json
+```
+
+Personal names are nested two layers deep in the "LIBRARIES - subject name" taxonomy: authority > personal or conference > name, e.g. "oclc\personal\Ferrea, Elizabeth, 1880-1925".
+
+```sh
+eq tax 657fdbec-c17c-4497-aa31-5b4bc7d9e9e5 --terms > "subject name.json"
+set idx 1
+for term in (jq -r .[].term "subject name.json")
+    eq tax 657fdbec-c17c-4497-aa31-5b4bc7d9e9e5 --term $term\\personal > $idx.tmp.json
+    set idx (math $idx + 1)
+end
+mlr --json cat *.tmp.json > taxo.json
+rm *.tmp.json
+```
+
+`mlr` is [miller](https://miller.readthedocs.io/) (`brew install miller`). It's possible to concatenate the JSON files with `jq` or editing them together, but not as easy.