Ben Teusch, @Facebook People Analytics, posted on LinkedIn about wanting to perform a textual, thematic representation of article content from David Green, an author on LinkedIn. The goal was to extract content from the author, and also extract the articles this author linked to in their "top" and "best" article series. There are a few challenges in overcoming this automated extraction. In order to give maximum flexibility in extracting this content, I opted to use a selenium browser with beautiful soup to scrape articles. Though selenium is known for being a test automation software, it is perfect for when you are attempting to emulate a user in order to scrape content.
This was written in Python 3.7 using PyCharm.
To run the code, you will need to import:
from supportingFunctions import *
The automation imports reside in the supportingFunctions.py
.
###Get Started
- You can simply view the data in the data folder
- Clone the repository
- Ensure you have a python environment setup (PyCharm community is fine, you can also use VSCode or others...)
- Set up your own secrets file and import your secrets as variables
- Run the script to watch the selenium browser scrape data
- You can comment out
supportingFunctions.build_article_file_of_david_green_articles(articles, browser)
in the__init__.py
to skip the creation of the file
Note: The secrets file referenced in supportingFunctions.py
is used for automating authentication into linkedin using account credentials. If you submit a pull request on this code, please don't commit your account credentials. You can remove them from your files staged for commit by using: git reset HEAD <secretsfile>
If you would like to get involved, there are a few ways you can do so:
- Fork the code and tinker around with the script on your own
- Help build out a backlog of items that need to be worked on. You can visit the current backlog on the project board
- This needs a front end, and also some scripting to clean up the article content and remove html within the article text
- Ask to join the project if you want to contribute with pull requests
- Submit issues if you discover bugs or think there are enhancements that would be beneficial to the project.
- Submit ideas for future projects or datasets
- If you prefer to work on a parallel project done in R, visit project by Keith McNulty in the links below
Below are a few resources to manage this project if you wish to get involved:
- Review the post that started the project
- Project Board to submit issues & requests
- CSV File for David Green's Top/Best Articles
- CSV File for David Green's 108 Articles and their text
- My LinkedIn (if you want to connect/contribute)
- View the RShiny App in progress by Keith McNulty
@keithmcnulty | Keith McNulty for his RShiney webscraper