diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4392c0ea..32bf51e1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,20 +1,22 @@ ## How to contribute -Thank you for considering contributing to trafilatura! +Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better. + + +There are many ways to contribute, you could: + + * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content. + * Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool. + * Submit feature requests: Share your feedback and suggestions. + * Write code: Fix bugs or add new features. + Here are some important resources: * [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!) + * [Roadmap and milestones](https://github.com/adbar/trafilatura/milestones) * [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/) -There are many ways to contribute, you could: - - * Improve the documentation - * Find bugs and submit bug reports - * Submit feature requests - * Write tutorials or blog posts - * Write code - ## Submitting changes @@ -23,6 +25,9 @@ Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/traf **Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github) +A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in developing and enhancing Trafilatura. + + ## Testing and evaluating the code @@ -35,7 +40,7 @@ See also the [tests Readme](tests/README.rst) for more information on the evalua -For further questions you can contact me by way of [GitHub issues](https://github.com/adbar/trafilatura/issues), [Twitter](https://twitter.com/adbarbaresi) or [E-Mail](https://adrien.barbaresi.eu/). +For further questions you can contact me by way of [GitHub issues](https://github.com/adbar/trafilatura/issues), [X](https://x.com/adbarbaresi) or [E-Mail](https://adrien.barbaresi.eu/). Thanks, diff --git a/README.rst b/README.rst index f39bf0a8..4443c9ad 100644 --- a/README.rst +++ b/README.rst @@ -1,11 +1,11 @@ -A Python package & command-line tool to gather text on the Web -============================================================== +Trafilatura: Discover and Extract Text Data on the Web +====================================================== .. image:: docs/trafilatura-logo.png - :alt: Logo as PNG image - :align: center - :width: 60% + :alt: Trafilatura Logo + :align: center + :width: 60% | @@ -33,7 +33,6 @@ A Python package & command-line tool to gather text on the Web :target: https://aclanthology.org/2021.acl-demo.15/ :alt: Reference DOI: 10.18653/v1/2021.acl-demo.15 - | .. image:: docs/trafilatura-demo.gif @@ -43,46 +42,52 @@ A Python package & command-line tool to gather text on the Web :target: https://trafilatura.readthedocs.org/ -Description ------------ +Introduction +------------ + -Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to various commonly used formats. +Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to multiple commonly used formats. -Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents. +Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, first by **focusing on the actual content**, second by **avoiding the noise** caused by recurring elements (headers, footers etc.), and third by **making sense of the data** with information such as author and publication date. The extractor tries to strike a balance between limiting noise and including all valid parts. It also has to be **robust and reasonably fast** as it runs in production on millions of documents. -This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security. +The tool's versatility makes it useful for a wide range of applications leveraging web content for knowledge discovery such as **quantitative and data-driven approaches**. It is relevant to anyone interested in language modeling, data mining, information extraction. Scraping-intensive use cases include search engine optimization, business analytics and information security. Trafilatura is used in the academic domain, chiefly for data acquisition in corpus linguistics, natural language processing, and computational social science. Features ~~~~~~~~ -- Web crawling and text discovery: - - Focused crawling and politeness rules +- Advanced web crawling and text discovery: + - Focused crawling adhering to politeness rules - Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS) - - URL management (blacklists, filtering and de-duplication) -- Seamless and parallel processing, online and offline: - - URLs, HTML files or parsed HTML trees usable as input - - Efficient and polite processing of download queues - - Conversion of previously downloaded files -- Robust and efficient extraction: - - Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml) + - Smart navigation and URL management (blacklists, filtering and deduplication) +- Parallel processing of online and offline input: + - Live URLs, efficient and polite processing of download queues + - Previously downloaded HTML files and parsed HTML trees +- Robust and customizable extraction of key elements: + - Main text (common patterns and generic algorithms like jusText and readability) - Metadata (title, author, date, site name, categories and tags) - - Formatting and structural elements: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting - - Comments (if applicable) -- Output formats: + - Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting + - Optional elements: comments, links, images, tables + - Extensive configuration options +- Multiple output formats: - Text (minimal formatting or Markdown) - - CSV (with metadata, `tab-separated values `_) + - CSV (with metadata, tab-separated values) - JSON (with metadata) - XML (with metadata, text formatting and page structure) and `TEI-XML `_ -- Optional add-ons: +- Add-ons: - Language detection on extracted content - Graphical user interface (GUI) - Speed optimizations +- Actively maintained with support from the open-source community: + - Regular updates, feature additions, and optimizations + - Comprehensive documentation Evaluation and alternatives ~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content. + For more detailed results see the `benchmark `_ and `evaluation script `_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory. =============================== ========= ========== ========= ========= ====== @@ -113,60 +118,66 @@ Other evaluations: Usage and documentation ----------------------- -For more information please refer to `the documentation `_: +`Getting started with Trafilatura `_ is straightforward. For more information and detailed guides, visit `Trafilatura's documentation `_: - `Installation `_ - Usage: `On the command-line `_, `With Python `_, `With R `_ - `Core Python functions `_ -- Python Notebook `Trafilatura Overview `_ -- `Tutorials `_ +- Interactive Python Notebook: `Trafilatura Overview `_ +- `Tutorials and use cases `_ - `Text embedding for vector search `_ - `Custom web corpus `_ - `Word frequency list `_ For video tutorials see this Youtube playlist: -- `Web scraping how-tos and tutorials `_ +- `Web scraping tutorials and how-tos `_ License ------- -*Trafilatura* is distributed under the `GNU General Public License v3.0 `_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length `_, `multi-licensing `_ with `compatible licenses `_, or `contacting me `_. - -See also `GPL and free software licensing: What's in it for business? `_ +*Trafilatura* is distributed under the `GNU General Public License v3.0 `_. This license promotes collaboration in software development and ensures that Trafilatura's code remains publicly accessible. +If you wish to redistribute this library but are concerned about the license conditions, consider interacting `at arm's length `_, multi-licensing with `compatible licenses `_, or `contacting the author <#author>`_ for more options. - -Context -------- +For insights into GPL and free software licensing with emphasis on a business context, see `GPL and Free Software Licensing: What's in it for Business? `_ Contributing -~~~~~~~~~~~~ +------------ -Contributions are welcome! See `CONTRIBUTING.md `_ for more information. Bug reports can be filed on the `dedicated page `_. +Contributions of all kinds are welcome. Visit the `Contributing page `_ for more information. Bug reports can be filed on the `dedicated issue page `_. -Many thanks to the `contributors `_ who submitted features and bugfixes! +Many thanks to the `contributors `_ who extended the docs or submitted bug reports, features and bugfixes! -Roadmap -~~~~~~~ +Context +------- + +Developed with practical applications of academic research in mind, this software is part of a broader effort to derive information from web documents. Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. Web corpus construction involves numerous design decisions, this software package simplifies text data collection and enhances corpus quality. It is currently used to build `text databases for linguistic research `_. -For planned enhancements and relevant milestones see `issues page `_. +*Trafilatura* is an Italian word for `wire drawing `_ symbolizing the industrial-grade extraction, refinement and conversion process. Author ~~~~~~ -This effort is part of methods to derive information from web documents in order to build `text databases for research `_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality. +Reach out via the `contact page `_ for inquiries, collaborations, or feedback. See also `Twitter/X `_ for the latest updates. +This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications: - Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction `_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131. - Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software `_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019. - Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora `_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) `_, 2016. +Citing Trafilatura +~~~~~~~~~~~~~~~~~~ + + +If you use Trafilatura in your research or projects, we kindly ask you to cite this work, here is how: + .. image:: https://img.shields.io/badge/DOI-10.18653%2Fv1%2F2021.acl--demo.15-blue :target: https://aclanthology.org/2021.acl-demo.15/ :alt: Reference DOI: 10.18653/v1/2021.acl-demo.15 @@ -175,7 +186,6 @@ This effort is part of methods to derive information from web documents in order :target: https://doi.org/10.5281/zenodo.3460969 :alt: Zenodo archive DOI: 10.5281/zenodo.3460969 - .. code-block:: shell @inproceedings{barbaresi-2021-trafilatura, @@ -189,21 +199,21 @@ This effort is part of methods to derive information from web documents in order } -You can contact me via my `contact page `_ or on `GitHub `_. - - Software ecosystem ~~~~~~~~~~~~~~~~~~ +This software is part of a larger ecosystem. It is employed in a variety of academic and development projects, demonstrating its versatility and effectiveness. Case studies and publications are listed on the `Used By documentation page `_. + +Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis: + .. image:: docs/software-ecosystem.png - :alt: Software ecosystem + :alt: Software ecosystem :align: center :width: 65% -*Trafilatura*: `Italian word `_ for `wire drawing `_. -`Known uses of the software `_. +Corresponding posts can be found on `Bits of Language `_. The blog covers a range of topics from technical how-tos, updates on new features, to discussions on text mining challenges and solutions. -Corresponding posts on `Bits of Language `_ (blog). +Impressive, you have reached the end of the page: Thank you for your interest! diff --git a/docs/crawls.rst b/docs/crawls.rst index 165fab52..9c272eaf 100644 --- a/docs/crawls.rst +++ b/docs/crawls.rst @@ -3,8 +3,8 @@ Web crawling .. meta:: :description lang=en: - This tutorial shows how to perform web crawling tasks with Python and on the command-line. - The Trafilatura package allows for easy focused crawling. + Dive deep into the web with Python and on the command-line. Trafilatura supports + focused crawling, enforces politeness rules, and navigates through websites. @@ -15,15 +15,12 @@ Another use of web crawlers is in Web archiving, which involves large sets of we Other applications include data mining and text analytics, for example building web corpora for linguistic research. -This page shows how to perform certain web crawling tasks with Python and on the command-line. The `trafilatura` package allows for easy focused crawling (see definition below). +Dive deep into the web with crawling techniques. Trafilatura supports focused crawling, adhering to politeness rules, and efficiently navigates through sitemaps and feeds. This page shows how to perform certain web crawling tasks with Python and on the command-line. The `trafilatura` package allows for easy focused crawling (see definition below). .. Web crawlers require resources to run, so companies want to make sure they are using their resources as efficiently as possible, so they must be selective. -*New in version 0.9. Still experimental.* - - Design decisions ---------------- diff --git a/docs/evaluation.rst b/docs/evaluation.rst index 7551ccbf..daffbd95 100644 --- a/docs/evaluation.rst +++ b/docs/evaluation.rst @@ -3,9 +3,10 @@ Evaluation .. meta:: :description lang=en: - This benchmark tests how Python tools work on extraction of text from HTML code. Trafilatura - performs significantly better than the other comparable libraries in internal and external - evaluations. + See how Python tools work on main text extraction from HTML pages (html2txt). + Trafilatura consistently outperforms other open-source libraries, + showcasing its accuracy in extracting web content. + Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. Should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as possible to provide opportunistic ways of gathering information? @@ -50,7 +51,7 @@ Description **Errors**: The *boilerpy3*, *newspaper3k*, and *readabilipy* modules do not work without errors on every HTML file in the test set, probably because of malformed HTML, encoding or parsing bugs. These errors are ignored in order to complete the benchmark. -**Results**: The baseline beats a few systems, showing its interest. *justext* is highly configurable and tweaking its configuration (as it is done here) can lead to better performance than its generic settings. *goose3* is the most precise algorithm, albeit at a significant cost in terms of recall. The packages focusing on raw text extraction *html_text* and *inscriptis* are roughly comparable and achieve the best recall as they try to extract all the text. Rule-based approaches such as *trafilatura*'s obtain balanced results despite a lack of precision. Combined with an algorithmic approach they perform significantly better than the other tested solutions. +**Results**: The baseline beats a few systems, showing its interest. *justext* is highly configurable and tweaking its configuration (as it is done here) can lead to better performance than its generic settings. *goose3* is the most precise algorithm, albeit at a significant cost in terms of recall. The packages focusing on raw text extraction *html_text* and *inscriptis* are roughly comparable and achieve the best recall as they try to extract all the text. Rule-based approaches such as *trafilatura*'s obtain balanced results despite a lack of precision. Combined with an algorithmic approach they perform significantly better than the other tested solutions. Trafilatura consistently outperforms other open-source libraries, showcasing its efficiency and accuracy in extracting web content. **Roadmap**: Further evaluations will be run, including additional tools and languages. Comment extraction still has to be evaluated, although most libraries don not offer this functionality. diff --git a/docs/installation.rst b/docs/installation.rst index a83acc8f..3f1e2f07 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -1,6 +1,12 @@ Installation ============ +.. meta:: + :description lang=en: + Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step. + + +Setting up Trafilatura is straightforward. This installation guide walks you through the process step-by-step. Python diff --git a/docs/quickstart.rst b/docs/quickstart.rst index 5c9f34da..1bac2a92 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -2,7 +2,10 @@ Quickstart ========== -Primary installation method is with a Python package manager: ``pip install trafilatura``. See `installation documentation `_. +Trafilatura simplifies the process of turning raw HTML into structured, meaningful data. Getting started with it is straightforward. This page offers a walkthrough through its main functions. + + +Primary installation method is with a Python package manager: ``pip install trafilatura``. For more details see `installation documentation `_. With Python @@ -74,4 +77,12 @@ Extraction options are also available on the command-line, they can be combined: $ < myfile.html trafilatura --json --no-tables + +Further steps +------------- + + For more information please refer to `usage documentation `_ and `tutorials `_. + +.. hint:: + Explore Trafilatura's features interactively with this Python Notebook: `Trafilatura overview `_ diff --git a/docs/settings.rst b/docs/settings.rst index 5a2e7b7d..90e39777 100644 --- a/docs/settings.rst +++ b/docs/settings.rst @@ -1,13 +1,13 @@ -Default settings -================ +Settings and customization +========================== .. meta:: :description lang=en: - This documentation page explains how to adjust Trafilatura's default settings - for downloads and text extraction, along with examples for Python and the command-line. + Tailor Trafilatura to your needs. Its modular design and configuration options allow for + extensive customization. See examples for Python and the command-line. -There are two different files which can be edited in order to modify the default download and extraction settings: +Tailor Trafilatura to your needs, its modular design and configuration options allow for extensive customization. In a nutshell, there are two main files which can be edited in order to modify the default download and extraction behavior: 1. ``settings.cfg`` (values designed to be adapted by the user) 2. ``settings.py`` (package-wide settings, advanced) diff --git a/docs/tutorials.rst b/docs/tutorials.rst index 32126735..694b6a42 100644 --- a/docs/tutorials.rst +++ b/docs/tutorials.rst @@ -2,6 +2,9 @@ Tutorials ========= +Learn through practical examples. The following tutorials cover various scenarios, from text embedding for vector search to building custom web corpora and generating word frequency lists. + + .. toctree:: :maxdepth: 2 diff --git a/docs/usage-cli.rst b/docs/usage-cli.rst index 66847f71..989268c5 100644 --- a/docs/usage-cli.rst +++ b/docs/usage-cli.rst @@ -3,14 +3,14 @@ On the command-line .. meta:: :description lang=en: - This tutorial focuses on text extraction from HTML web pages without writing code. - Bulk parallel processing and data mining are also described. + Trafilatura offers a robust CLI. Learn how to download and extract text from HTML web pages without writing code, + including parallel processing and data mining capabilities. Introduction ------------ -Trafilatura includes a `command-line interface `_ and can be conveniently used without writing code. +Trafilatura offers a robust `command-line interface `_ and can be conveniently used without writing code. Learn how to perform various tasks and leverage the full power of the tool from the terminal. For the very first steps please refer to this multilingual, step-by-step `Introduction to the command-line interface `_ and this `section of the Introduction to Cultural Analytics & Python `_. diff --git a/docs/usage-r.rst b/docs/usage-r.rst index 469b7723..72fa5097 100644 --- a/docs/usage-r.rst +++ b/docs/usage-r.rst @@ -1,12 +1,19 @@ With R ====== +.. meta:: + :description lang=en: + Trafilatura extends its download and extractions capabilities to the R community. + Discover how to use Trafilatura in your R projects with this dedicated guide. + Introduction ------------ -`R `_ is a free software environment for statistical computing and graphics. The `reticulate `_ package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments. +`R `_ is a free software environment for statistical computing and graphics. Trafilatura extends its capabilities to the R community. Discover how to use Trafilatura in your R projects with this dedicated guide. + +The `reticulate `_ package provides a comprehensive set of tools for seamless interoperability between Python and R. It basically allows for execution of Python code inside an R session, so that Python packages can be used with minimal adaptations, which is ideal for those who would rather operate from R than having to go back and forth between languages and environments. The package provides several ways to integrate Python code into R projects: diff --git a/docs/used-by.rst b/docs/used-by.rst index e38925f2..0afaafbd 100644 --- a/docs/used-by.rst +++ b/docs/used-by.rst @@ -3,18 +3,23 @@ Uses & citations .. meta:: :description lang=en: - Trafilatura is used at several institutions, included in other software packages and cited in research publications. This page lists projects and publications mentioning the library. + Trafilatura's versatility makes it ideal for a wide range of applications, it is included in other software packages and cited in research publications. Known uses and case studies are listed here. -Trafilatura is used at several institutions, included in other software packages and cited in research publications, especially in linguistics and natural language processing, social science, information science, and in the context of large language models. This page lists projects and publications mentioning the library. +Trafilatura is a gateway to understanding and leveraging web content for knowledge discovery and data-driven insights. The tool's versatility makes it ideal for a wide range of applications, from academic research in linguistics and social sciences to practical uses in SEO, business analytics, and cybersecurity. The tool has received accolades for its effectiveness, including being recognized as the most efficient open-source library in article extraction benchmarks and receiving praise in academic evaluations. -To add further references, please `edit this page `_ and suggest changes. +As such, it is used at several institutions, included in other software packages and cited in research publications, especially in linguistics and natural language processing, social science, information science, and in the context of large language models. This page lists projects and publications mentioning the library. + +If you wish to add further references, please `edit this page `_ and suggest changes. Notable projects using this software ------------------------------------ +Trafilatura has been employed in a variety of contexts and projects. Some of the known uses include academic research (e.g. data-driven studies in linguistics and social sciences), refinement of large language models (LLMs), applications like business analytics and search engine optimization, and further inclusion in other open source packages. + + Known institutional users ^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/tests/unit_tests.py b/tests/unit_tests.py index 48fef5f0..8ee092dd 100644 --- a/tests/unit_tests.py +++ b/tests/unit_tests.py @@ -6,8 +6,10 @@ import logging import os import sys +import time import pytest + from lxml import etree, html try: @@ -1125,6 +1127,52 @@ class Person:\xa0 def __init__(self, name, age):\xa0\xa0\xa0 assert expected in testresult and 'quote' not in testresult +def test_mixed_content_extraction(): + """ + Test extraction from HTML with mixed content. + """ + html_content = '

Text here