phila.gov-wordpress-scraper

Python CLI app that scrapes the phila.gov wordpress site to generate static HTML pages. Requires a WordPress API endpoint listing all WordPress-generated pages.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Installing

Setup your env.sh.

export SCRAPER_SLACK_URL=
export SCRAPER_HOSTNAMES_TO_FIND=
export SCRAPER_HOSTNAME_REPLACE=
export SCRAPER_HOST_FOR_URLS_AND_PAGES=
export SCRAPER_S3_BUCKET=
export SCRAPER_CLOUDFRONT_DISTRIBUTION=
export SCRAPER_CLOUDFRONT_MAX_INVLIDATIONS=
export SCRAPER_CLOUDFRONT_CLOUDWATCH_NAMESPACE=

After installing docker on your machine, cd into directory and run docker build . to create the image.
pipenv shell to activate the shell.
pipenv install to install project dependencies.*
source env to source your environment variables.
aws confiure sso inside the shell to connect to AWS.
Configure the default region export AWS_DEFAULT_REGION=us-east-1
python phila_site_scraper.py to run the scraper locally. Note: running the scraper against production will pull down production resources.

* When updating dependencies, make sure both requirements.txt and Pipfile are updated. The Dockerfile is using requirements.txt but when testing locally the Pipfile is used by pipenv.

Usage

Using local disk

python phila_site_scraper.py

Using S3

python phila_site_scraper.py --save-s3

Production

python phila_site_scraper.py --save-s3 --invalidate-cloudfront --notifications --publish-stats --heartbeat

Help

> python phila_site_scraper.py --help
Usage: phila_site_scraper.py [OPTIONS]

Options:
  --save-s3                       Save site to S3 bucket.
  --invalidate-cloudfront         Invalidates CloudFront paths that are
                                  updated.
  --logging-config TEXT           Python logging config file in YAML format.
  --num-worker-threads INTEGER    Number of workers.
  --notifications / --no-notifications
                                  Enable Slack/email error notifications.
  --publish-stats / --no-publish-stats
                                  Publish stats to Cloudwatch
  --heartbeat / --no-heartbeat    Cloudwatch hearbeat
  --help                          Show this message and exit.

Deployment

Find phila-gov-wordpress-scraper in AWS ECR Repositories.
Follow the View Push Commands instructions through step 2.
docker tag phila-gov-wordpress-scraper:latest 676612114792.dkr.ecr.us-east-1.amazonaws.com/phila-gov-wordpress-scraper:GITCOMMITSHA - Create and tag a local version of the image. Replace GITCOMMITSHA in the above example with the commit sha of the latest build. This essentally versions the image, instead of replacing the image tagged as LATEST (as AWS instructs).
docker push 676612114792.dkr.ecr.us-east-1.amazonaws.com/phila-gov-wordpress-scraper:GITCOMMITSHA - Create the image in the ECR Repository. Remember to replace GITCOMMITSHA with the SHA in the previous step.
Login to Terraform Enterprise and update the wordpress_scraper_image variable with the new scraper tag.

Environment Variables

Variable	Example	Description
`SCRAPER_SLACK_URL`	https://hooks.slack.com/services/...	A Slack webhook URL for an alerts channel.
`SCRAPER_HOSTNAMES_TO_FIND`	"admin.phila.website\|beta.phila.gov"	The hostnames to find for replacement in the scraped page content.
`SCRAPER_HOSTNAME_REPLACE`	www.phila.website	The new website host.
`SCRAPER_HOST_FOR_URLS_AND_PAGES`	Wordpress server host to scrape pages from
`SCRAPER_S3_BUCKET`	www.phila.website	S3 bucket to store scraped.
`SCRAPER_CLOUDFRONT_DISTRIBUTION`	EAURQRDQU47EO	For Cloudfront cache invalidation, the distrbution in front of the S3 bucket.
`SCRAPER_CLOUDFRONT_MAX_INVALIDATIONS`	50	Maximum number of invalidations to perform per run.
`SCRAPER_CLOUDFRONT_CLOUDWATCH_NAMESPACE`	'test-cloudfront'	A namespace for the scraper cloudfront metrics

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
oldphilagov.csv		oldphilagov.csv
phila_site_scraper.py		phila_site_scraper.py
remove_deleted_resources.py		remove_deleted_resources.py
requirements.txt		requirements.txt
staticfiles.csv		staticfiles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phila.gov-wordpress-scraper

Getting Started

Prerequisites

Installing

Usage

Deployment

Environment Variables

License

About

Releases

Packages

Contributors 6

Languages

License

CityOfPhiladelphia/phila.gov-wordpress-scraper

Folders and files

Latest commit

History

Repository files navigation

phila.gov-wordpress-scraper

Getting Started

Prerequisites

Installing

Usage

Deployment

Environment Variables

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages