Article extraction

A collection of functions that can be used for extracting HTML and article text from a URL.

Example 1

Take the following URL:

    URL = "https://www.joelonsoftware.com/2006/09/01/language-wars/"

Using the library, the article text and HTML can be extracted:

    data = article_extraction.full_extraction(URL)

    print("Full HTML: ", data["html"])
	print("Article text: ", data["pattern_extraction"]["article_text"])

Functions

Function	Explaination
get_html(url)	Given a URL, will return the HTML using urllib3.
pattern_article_extraction(url)	Extract the article using Pattern. Pattern uses the url, not the HTML.
full_extraction(url)	Runs a complete end-to-end extraction using all other functions.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
article_extraction.py		article_extraction.py
example.py		example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article extraction

Example 1

Functions

License

About

Releases

Packages

Languages

License

serenpa/article_extraction

Folders and files

Latest commit

History

Repository files navigation

Article extraction

Example 1

Functions

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages