warc-metadata-sidecar

This library is intended to extract data into a metadata sidecar from WARC/ARC files, convert the sidecar data into a CDXJ file, and then merge that CDXJ with a CDXJ created from the original WARC.

Installation

It is recommended to work in a virtual environment.

At the root folder of warc-metadata-sidecar, install:

$ pip install -e .

warc_metadata_sidecar.py

This script will consume a WARC or ARC file, read each record that is a response or resource type, and create a new record with metadata that will be stored in a sidecar file. The sidecar records will include the mimetype and puid, character set, and language if found. The character set and language will only be searched for 'text' formats. The extension 'meta' will be added to the sidecar file (i.e. filename.warc.gz becomes filename.warc.meta.gz and file.arc.gz becomes file.warc.meta.gz).

For usage instructions run:

$ warc_metadata_sidecar.py --help

Example:

$ warc_metadata_sidecar.py example_dirname warc_filename.warc.gz

$ warc_metadata_sidecar.py dir_name file.warc.gz --operator 'Operator Name' --publisher 'Name'

sidecar2cdxj.py

This script will take the URI, timestamp, and fields from the payload of each metadata record in a sidecar file and write the data into a file using the CDXJ format.

For usage instructions run:

$ sidecar2cdxj.py --help

Example:

$ sidecar2cdxj.py sidecar_filename.warc.meta.gz directory_name

merge_cdxj.py

This script will take a CDXJ from an original WARC and a metadata sidecar CDXJ, find the matching URI and timestamp from each file, collect certain fields from the metadata sidecar CDXJ (mime type, puid, charset, language, and soft-404), merge those fields with the original CDXJ data, and put the merged data into a new CDXJ.

For usage instructions run:

$ merge_cdxj.py --help

Example:

$ merge_cdxj.py -m sidecar.cdxj -w original.cdxj -d directory_name

Testing

$ pip install pytest

Run:

$ pytest

License

See LICENSE.

Contributors

Gracie Flores-Hays

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
merge_cdxj.py		merge_cdxj.py
requirements.txt		requirements.txt
setup.py		setup.py
sidecar2cdxj.py		sidecar2cdxj.py
warc_metadata_sidecar.py		warc_metadata_sidecar.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warc-metadata-sidecar

Installation

warc_metadata_sidecar.py

sidecar2cdxj.py

merge_cdxj.py

Testing

License

Contributors

About

Releases

Packages

Contributors 2

Languages

License

unt-libraries/warc-metadata-sidecar

Folders and files

Latest commit

History

Repository files navigation

warc-metadata-sidecar

Installation

warc_metadata_sidecar.py

sidecar2cdxj.py

merge_cdxj.py

Testing

License

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages