There are two scripts here: getpypidata.py
and build_index.py
.
First you need to run getpypidata.py
which fetches the json metadata
for every project from pypi:
$ python getpypidata.py
Fetched: 1475, Total: 52757, Percent: 2.8, Errors: 0
The script will write protobufs with some metadata such as the
timestamp and the URL used for fetching into ./raw
. After running
for a few hours it should look like this:
$ ls raw/ | wc -l
52849
There is an example script which writes a hdf file with two pandas dataframes, one for the package metadata (license etc) and one for all the versions + their distribution type (sdist, wheel, etc):
$ python build_index.py
Note that for reasons unknown a fair amount of projects don't have json metadata attached. We're ignoring those.