Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add import and export of OAISTree #5

Open
pdurbin opened this issue May 16, 2019 · 10 comments
Open

Add import and export of OAISTree #5

pdurbin opened this issue May 16, 2019 · 10 comments
Assignees
Labels
info:help wanted Extra attention is needed pkg:models models related activities prio:low status:deferred Will be looked at later. type:feature New feature

Comments

@pdurbin
Copy link
Member

pdurbin commented May 16, 2019

As I mentioned at IQSS/dataverse#5235 (comment) I'm curious if the "DVTree" (Dataverse Tree) format could be used to upload sample data to a brand new Dataverse installation for use in demos and usability testing.

I would love to see some docs. Or a pointer to the code for now. Thanks! 😄

@skasberger
Copy link
Member

The structure so far is not defined, it's just a rough idea I had, inspired by @petermr CTree structure. I definitely want to talk with some of the Dataverse Devs about the idea -> if this would work right now and in the long run. The idea in general is, that only the filenames and the structure of the folders and files tell about, what should/can be inside and how to treat different files then. Like, every dataverse folder must have a metadata file, same for datasets. The content of the metadata file then must not be strictly defined, but most likely will also have mandatory attributes. This then can be used to create a local export independent of OS and connecting programming language, which also can be used by humans.

Here my first draft of the structure:

Naming Conventions:

  • Dataverse: dv_IDENTIFIER, prefix dv_, id = alias
  • Dataset: ds_IDENTIFIER, prefix ds_, id = id
  • Datafile: FILENAME
├── dv_harvard/
│     ├── metadata.json
│     └── dv_iqss/
│            ├── metadata.json
│            └── ds_microcensus-2018/
│                    ├── metadata.json
│                    └── datafiles/
│                           ├── documentation.pdf
│                           └── data.csv
│     └── metadata.json
└── dv_aussda/
       └── ds_survey-labour-2016/
              ├── metadata.json
              └── datafiles/
                     ├── docs.pdf
                     └── data.tsv

Some open questions:

  • are the filenaming conventions compatible, possibly? e.g. is it always okay/possible to convert the dataverse alias to a filename string and store it on every operating system?
  • is the filename the best identifier for the datafiles? or is it's hash better?
  • how to handle versioning? is a DVTree only for one version possible or should there be another level of folders, like v1/?
  • do we need to seperate metadata into 1) general metadata and 2) metadata for API upload (add api.json or so)?

@pdurbin
Copy link
Member Author

pdurbin commented May 17, 2019

@skasberger thanks for this great write up! I just posted on the (ancient) "Round tripping the contents of DVN" thread at https://groups.google.com/d/msg/dataverse-community/07h0Ca-Ai1I/qyq3l-lakc0J with a link to this issue. I'm hoping to spur some good discussion. 😄

skasberger added a commit that referenced this issue May 20, 2019
@skasberger skasberger changed the title docs for "DVTree" (Dataverse Tree) format Draft DVTree Jun 22, 2019
@skasberger
Copy link
Member

As mentioned several times at the Dataverse Community Conference: BagIt seems to be very similar, and could be a good inspiration. https://en.wikipedia.org/wiki/BagIt

@skasberger skasberger added discussion type:review Review info:help wanted Extra attention is needed pkg:models models related activities labels Jun 22, 2019
@skasberger skasberger self-assigned this Jul 2, 2019
@skasberger
Copy link
Member

After developing the first proof of concept, I recommend to rename it to OAISTree (Open Archival Information System), cause the related processes are the guidance for the directory structure and it's conventions. Here my actual draft.

ROOT_DIR
├── YYYYMMDD_dataverses.csv
├── YYYYMMDD_datasets.csv
├── YYYYMMDD_datafiles.csv
├── terms-of-access.html
├── terms-of-use.html`
├── PICKE_FILE.pickle: Pickle Files
└── OAISTrees/
        └──DATASET_ID/
               ├── DATASET_ID_history.json
               └── SIP/
                       └── RAW_DATA_FILENAME
               └── AIP/
                       ├── DATASET_ID_metadata.json
                       └── DATASET_ID_DATAFILE_ID_metadata.json
               └── DIP/
                       ├── terms-of-use.html
                       └── terms-of-access.html
        └──DATASET_ID/
        └──DATASET_ID/
        └──DATASET_ID/

@petermr
Copy link

petermr commented Dec 2, 2019 via email

@skasberger
Copy link
Member

Hi Peter,

not really dependent. It's a first proposal for a standardized folder/data structure for OAIS related data, which then can be used to convert data from and to different systems. We use Dataverse, but others use iRODS or other software solutions.

@petermr
Copy link

petermr commented Dec 2, 2019 via email

@pdurbin
Copy link
Member Author

pdurbin commented Dec 2, 2019

@skasberger this looks great. Have you considered how to support dataverses of arbitrary depth?

For example the dataset below about test taking is in a dataverse called "JOPD" which is inside a dataverse called "Ubiquity Press":

Screen Shot 2019-12-02 at 10 48 41 AM

For our "dataverse-sample-data" repo I ended up using nested directories to support this. The idea is that there can be an arbitrary number of "dataverses" directories to trigger the next level:

Screen Shot 2019-12-02 at 10 49 15 AM

I tend to have the sample data loaded up at https://dev2.dataverse.org if you'd like to take a look.

Here's the "sample data" repo: https://github.com/IQSS/dataverse-sample-data

In it I use pyDataverse! Thanks! 😄

@skasberger skasberger changed the title Draft DVTree Draft OAISTree Dec 4, 2019
@skasberger skasberger added this to the v0.3.0 milestone Jun 19, 2020
@skasberger skasberger modified the milestones: v0.3.0, Later Jun 26, 2020
@skasberger skasberger changed the title Draft OAISTree Add import and export of OAISTree Jun 26, 2020
@skasberger
Copy link
Member

skasberger commented Jun 26, 2020

Some notes on how to develop the oaistree. I already have some code for this locally running for AUSSDA purpose, so if you want to contribute, please get in touch with me first.

Workflow

  • Create Directory structure
  • Copy Raw data
  • Create History file
  • Create Dataset JSON
  • Create Datafile JSON

Functinoalities:

  • sub-folders in DIP: use the categories (data, documentation) or file-tags/filenames verwenden, to create sub-folder

Development

  • create classes to organize the OAISTree
  • Function names: from_oaistree(), to_oaistree()
  • Question: can Dataverse alias or dataset id or datafile id always be used for directoy or filename naming? Look for a fitting solutions for organizing this (look at bagit for this). Must work for different OS.
  • Easy synchronization of pyDataverse objects and OISTree.
  • Preservation:
    • how to manage delete and destroy of data?
      • new pid allowed?
      • history must be preserved!
  • Manage Create, Update and Delete steps: JSON creation, history.
  • Think together with history feature
  • integrate with history function (#43)

@skasberger skasberger added status:deferred Will be looked at later. status:wip Work in progress and removed discussion labels Jun 26, 2020
@skasberger skasberger added status:confirmed Is a valid issue and will be moved forward soon. and removed status:confirmed Is a valid issue and will be moved forward soon. labels Jul 21, 2020
@skasberger skasberger removed the status:wip Work in progress label Jul 21, 2020
@skasberger skasberger added type:feature New feature and removed type:review Review labels Jan 26, 2021
skasberger added a commit that referenced this issue Apr 7, 2021
@pdurbin
Copy link
Member Author

pdurbin commented Mar 4, 2024

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python

@pdurbin pdurbin removed this from the Later milestone Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info:help wanted Extra attention is needed pkg:models models related activities prio:low status:deferred Will be looked at later. type:feature New feature
Projects
None yet
Development

No branches or pull requests

3 participants