-
Notifications
You must be signed in to change notification settings - Fork 260
data_packages
See also discussion on Python datapackage package issues
We propose an API for installing data packages locally, so that libraries can find installed data packages.
We often find that we would like to run code tests or examples against fairly large datasets.
We can't usually put the datasets in the code repository because:
- The datasets may be too large;
- The dataset may be shared across projects, therefore requiring the datasets go in both repositories, or that one project depends on the (development) tree of the other;
- The datasets may not be properly licensed, or have a license that is incompatible with the source code.
My own use case is that I need suites of example images in different image
formats to test image loading in nibabel
and nipy
. In particular, I
need a library of DICOM images of different manufacturers and modalities to
test DICOM image conversion. These images will be too large for the code
repository, and it may not be possible to release the images for public
distribution.
The data needs to be versioned, because, as the code and examples evolve, the datasets may also evolve, usually adding more data, but maybe fixing broken data.
The code test or example may therefore need to specify a minimum version of the data that is compatible with the test or example, maybe:
>>> from datatool import find_package >>> templates = find_package('nipy-templates', version='>=0.3')
We want to be able to keep a local copy of the data, because:
- If the data is too large for the repository, it will also likely be too large to do repeated downloads to a temporary directory;
- In the spirit of distributed version control, we would like to be able to work offline.
We would like an API that could do something like:
data-tool install nipy-templates>=0.3
– such that we could run the code above (repeated here):
>>> from datatool import find_package >>> templates = find_package('nipy-templates', version='>=0.3')
– and templates
would be an object that knows where to find the data.
The install
facility could be a later addition. For now we might be
content to do something like:
curl -O http://an.example.org/data-packages/nipy-templates-0.3.zip unzip nipy-templates-0.3.zip # to nipy-templates-0.3 directory data-tool pkg-path add nipy-templates-0.3
We might also want to have directories in which data package utilities expect to find data packages. For example:
data-tool container-path add . # enclosing directory curl -O http://an.example.org/data-packages/nipy-data-0.2.zip unzip nipy-data-0.2.zip curl -O http://an.example.org/data-packages/nipy-other-1.1.zip unzip nipy-other-1.1.zip
– where nipy-data-0.2
and nipy-other-1.1
would be found by a data
package utility because their directories are a registered container path.
- PACKAGE_PATH – a directory containing a data package. The directory will
contain a
datapackage.json
file (see below); - CONTAINER_PATH – A path that contains directories that are PACKAGE_PATHs.
Imagine a code package my-code
.
It could be installed by me, a user, in my home directories, or it could be
installed by the system administrator, in the system directories, for all
users. I could have installed my own more recent copy on a system with
my-code
installed system-wide.
Imagine there are some tests and examples in my-code
that require data,
call this my-data
.
If there is a system copy of my-code
, there may also be a system copy of
my-data
. Even if I installed a user copy of my-code
, I will still
hope to use the system my-data
if it has a high enough version to support
the new tests and examples. If there is no system copy of my-data
I will
need to install my-data
somewhere in my home space.
If we have install
, then I might want to do:
data-tool install --user my-data
or:
data-tool install --system my-data
Similarly:
data-tool pkg-path add --user /path/to/data/my-data
and:
data-tool container-path add --user /path/to/data
We may well want to keep data packages under version control, perhaps using
something like git annex
.
The data will develop with the code using it, so it will be common for a developer to have the data packages checked out on their local machine.
The data files are large enough that it would be wasteful of space and time to have to make local package archives and then install them, in order to use the local data package repository.
Therefore we would like to be able to use the data package repository as a PACKAGE_PATH. Something like:
git clone git://my-provider.org/my-data.git data-tool pkg-path add $PWD/my-data
In this case, we would like to be able to get the data package version from
version control with tools like git describe
:
$ data-tool version my-data 0.3.0-23-g123bca0
We can use the OKFN data package format, defined in detail in the data package spec.
This is a very simple format that makes a data package from a directory
containing a datapackage.json
file of given format.
The only absolute requirement for datapackage.json
is that it should have
a URL-usable name.
See Python datapackage for Python code for working with data packages following this format.
We might want to make a suite containing a set of packages for a particular
use such as testing with my-code
. Git submodules can help:
mkdir my-suite cd my-suite git init git submodule add https://github.com/yarikoptic/nitest-balls1 git submodule add https://github.com/yarikoptic/nitest-balls2 git commit -m "Added some data packages" cd ..
Register by recording the enclosing path as a CONTAINER_PATH:
data-tool container-path add my-suite
or specify that the record should go in the user
configuration (see
above):
data-tool container-path add --user my-suite
or in the system
configuration (see below):
data-tool container-path add --system .
Unregister with:
data-tool container-path rm --user .
You can instead add the contained directories as PACKAGE_PATHs with:
data-tool pkg-path add nitests-balls1 data-tool pkg-path add nitests-balls2
For a package name nipy-templates
:
>>> from datatool import find_package >>> templates = find_package('nipy-templates') >>> templates.path '/usr/local/share/data/nipy-templates' >>> templates.version '0.1'
You can also specify a version string:
>>> templates = find_package('nipy-templates', '>=0.3')
Without a version string, find_package
returns the package with the
highest version.
You can get a package path from the command line too:
data-tool pkg-path find nipy-templates>=0.3
There is a utility to make data packages from files in a directory:
data-tool make-pkg .
This writes a default datapackage.json
file (see below).
A data package is a directory with a configuration file called
datapackage.json
. This must specify package name:
{ "name" : "nipy-templates", "version" : "0.1" }
It may also specify version:
{ "name" : "nipy-templates", "version" : "0.1" }
If there is no "version", or the version is null
, then the library should
get this from version control of the package directory, or fail. So this:
{ "name" : "nipy-templates", "version" : null }
would cause datatool to try git describe
in the first instance to get the
package version, then the equivalent hg
command [1]. If all
of these fail, the package is not valid.
Version comparisons use distutils.version.LooseVersion
:
>>> from distutils.version import LooseVersion >>> LooseVersion('1.3.1') > LooseVersion('1.3.0-519-ga1b925f') True
By default datatool will strip an initial v
before digits from the output
of git describe
– for example git describe
output of v0.1
will
give version 0.1
.
If you want a more complicated rule relating git describe
to version, use
vcs_version_regex
:
{ "name" : "nipy-templates", "version" : null, "vcs_version_regex" : "rel-(.*)" }
vcs_version_regex
is an extension to the data package spec.
vcs_version_regex
accepts the output of git describe
and returns a
single group containing the version string, as in:
>>> import re >>> git_describe_output = 'rel-0.1-111-g1234567' >>> re.match('rel-(.*)', git_describe_output).groups()[0] '0.1-111-g1234567'
This allows the package author to have their own preferred tag naming scheme.
datapackage.json
can also give MD5 hashes for the files in the archive:
{ "name" : "nipy-templates", "resources" : [ { "path" : "mni/T1.img", "hash" : "1ea8f4f1e41bc17a94602e48141fdbc8" }, { "path" : "mni/T2.img", "hash" : "f41f2e1516d880547fbf7d6a83884f0d" } ] }
See the data package spec for more detail on specifying resources.
Paths are always relative paths in Unix (/
) format, the data package
application will adapt Unix paths when validating MD5 hashes on Windows.
The verify
command checks the MD5 sums if present:
data-tool verify nipy-templates
Or, from Python:
>>> templates = find_package('nipy-templates') >>> result, message = templates.verify()
A data package will usually have both a Unix register
executable and a
Windows register.bat
executable. Running these will register the
PACKAGE_PATH with a specified application configuration files (see below). For
example, register
might be:
#!/bin/bash data-tool pkg-path add $(dirname $BASH_SOURCE[0]) $@
The default locations for configuration files are (in order of decreasing precedence):
- Contents of file named in
DATATOOL_CONFIG
environment variable; - Contents of
datatool.ini
in$HOME/.datatool
(more generally, directory returned bydatatool.environment.get_user_dir()
); - Contents of
datatool.ini
in/etc/datatool
(more generally, directory returned bydatatool.environment.get_system_dir()
).
In general, values in files with higher precedence override values in files with lower precedence.
If values are lists, files with higher precedence prepend values to the list, so the files with higher precedence put values earlier in the list.
The configuration file can have section data
, with optional subfields
package_containers
and package_paths
:
[data] package_containers : /usr/local/share/nipy/dipy /usr/share/nipy/dipy package_paths : /usr/local/share/data/nipy-templates /usr/local/share/data/nipy-data
package_paths
take precedence over paths found in package_containers
,
but a path in a package_containers
list, in a file with higher precedence,
overrides package_paths
in files with lower precedence. So, assuming this
is a file with lower precedence than the file above:
[data] package_paths : /usr/share/dipy/nipy-dicom
– then if /usr/share/nipy/dipy/
contains the same nipy-dicom
package, this package will override a package with the same name and version
contained in /usr/share/dipy/nipy-dicom
above.
The configuration files can also include other configuration files:
[data] include : ~/data/other_data.json ~/data/more_data.json package_paths : /usr/share/dipy/nipy-dicom
Values in included files take lower precedence than values in the file including them.
Tilde ~
will be expanded to the path of the users home directory for
all paths in the configuration file.
The default package container paths have the lowest precedence. The default package container paths are:
-
$HOME/.datatool/data
(more generally,data
subdirectory of directory returned bydatatool.environment.get_user_dir()
); -
/usr/share/datatool/data
and/usr/local/share/datatool/data
(more generally,data
subdirectories of directories returned bydatatool.environment.get_share_dirs()
).
Footnotes
[1] | Apparently the hg equivalent of git describe is
something like hg log -r . --template
'{latesttag}-{latesttagdistance}-h{node|short}\n'
|