Automated Extraction of LinkedIn Articles and their URLS

Ben Teusch, @Facebook People Analytics, posted on LinkedIn about wanting to perform a textual, thematic representation of article content from David Green, an author on LinkedIn. The goal was to extract content from the author, and also extract the articles this author linked to in their "top" and "best" article series. There are a few challenges in overcoming this automated extraction. In order to give maximum flexibility in extracting this content, I opted to use a selenium browser with beautiful soup to scrape articles. Though selenium is known for being a test automation software, it is perfect for when you are attempting to emulate a user in order to scrape content.

Code

This was written in Python 3.7 using PyCharm.

To run the code, you will need to import:

from supportingFunctions import *

The automation imports reside in the supportingFunctions.py.

###Get Started

You can simply view the data in the data folder
Clone the repository
Ensure you have a python environment setup (PyCharm community is fine, you can also use VSCode or others...)
Set up your own secrets file and import your secrets as variables
Run the script to watch the selenium browser scrape data
You can comment out supportingFunctions.build_article_file_of_david_green_articles(articles, browser) in the __init__.py to skip the creation of the file

Note: The secrets file referenced in supportingFunctions.py is used for automating authentication into linkedin using account credentials. If you submit a pull request on this code, please don't commit your account credentials. You can remove them from your files staged for commit by using: git reset HEAD <secretsfile>

Get Involved

If you would like to get involved, there are a few ways you can do so:

Fork the code and tinker around with the script on your own
Help build out a backlog of items that need to be worked on. You can visit the current backlog on the project board
- This needs a front end, and also some scripting to clean up the article content and remove html within the article text
Ask to join the project if you want to contribute with pull requests
Submit issues if you discover bugs or think there are enhancements that would be beneficial to the project.
Submit ideas for future projects or datasets
If you prefer to work on a parallel project done in R, visit project by Keith McNulty in the links below

Links

Below are a few resources to manage this project if you wish to get involved:

Contributors

Ben Teusch - Idea Originator

@keithmcnulty | Keith McNulty for his RShiney webscraper

@thecherrytree |Cason Cherry

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Extraction of LinkedIn Articles and their URLS

Code

Get Involved

Links

Contributors

About

Releases

Packages

Languages

License

thecherrytree/linkedInArticles

Folders and files

Latest commit

History

Repository files navigation

Automated Extraction of LinkedIn Articles and their URLS

Code

Get Involved

Links

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages