Skip to content

Commit

Permalink
Enhancements to documentation and testing (#456)
Browse files Browse the repository at this point in the history
* Refined README, Enhanced Tests, and Settings Optimization - Updated README with detailed features and guides. - Added advanced tests in unit_tests.py for robustness. - Optimized settings.py for adaptive performance.

* fix tests

* add psutil to setup (for now)

* fix lang test

* scrap untested lines

* scrap functions not used elsewhere in the code

* first attempt at merging/cleaning the readme

* docs: split content across subpages

* cleanup README

* scrap logger re-defined in settings

* last review prior to merge

---------

Co-authored-by: Adrien Barbaresi <[email protected]>
Co-authored-by: Adrien Barbaresi <[email protected]>
  • Loading branch information
3 people authored Jan 8, 2024
1 parent 6a79511 commit 9fbf9fb
Show file tree
Hide file tree
Showing 13 changed files with 185 additions and 88 deletions.
25 changes: 15 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
## How to contribute

Thank you for considering contributing to trafilatura!
Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better.


There are many ways to contribute, you could:

* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
* Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool.
* Submit feature requests: Share your feedback and suggestions.
* Write code: Fix bugs or add new features.


Here are some important resources:

* [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
* [Roadmap and milestones](https://github.com/adbar/trafilatura/milestones)
* [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/)

There are many ways to contribute, you could:

* Improve the documentation
* Find bugs and submit bug reports
* Submit feature requests
* Write tutorials or blog posts
* Write code


## Submitting changes

Expand All @@ -23,6 +25,9 @@ Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/traf
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)


A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in developing and enhancing Trafilatura.



## Testing and evaluating the code

Expand All @@ -35,7 +40,7 @@ See also the [tests Readme](tests/README.rst) for more information on the evalua



For further questions you can contact me by way of [GitHub issues](https://github.com/adbar/trafilatura/issues), [Twitter](https://twitter.com/adbarbaresi) or [E-Mail](https://adrien.barbaresi.eu/).
For further questions you can contact me by way of [GitHub issues](https://github.com/adbar/trafilatura/issues), [X](https://x.com/adbarbaresi) or [E-Mail](https://adrien.barbaresi.eu/).

Thanks,

Expand Down
110 changes: 60 additions & 50 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
A Python package & command-line tool to gather text on the Web
==============================================================
Trafilatura: Discover and Extract Text Data on the Web
======================================================


.. image:: docs/trafilatura-logo.png
:alt: Logo as PNG image
:align: center
:width: 60%
:alt: Trafilatura Logo
:align: center
:width: 60%

|
Expand Down Expand Up @@ -33,7 +33,6 @@ A Python package & command-line tool to gather text on the Web
:target: https://aclanthology.org/2021.acl-demo.15/
:alt: Reference DOI: 10.18653/v1/2021.acl-demo.15


|
.. image:: docs/trafilatura-demo.gif
Expand All @@ -43,46 +42,52 @@ A Python package & command-line tool to gather text on the Web
:target: https://trafilatura.readthedocs.org/


Description
-----------
Introduction
------------


Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to various commonly used formats.
Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to multiple commonly used formats.

Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents.
Smart navigation and going from HTML bulk to essential parts can alleviate many problems related to text quality, first by **focusing on the actual content**, second by **avoiding the noise** caused by recurring elements (headers, footers etc.), and third by **making sense of the data** with information such as author and publication date. The extractor tries to strike a balance between limiting noise and including all valid parts. It also has to be **robust and reasonably fast** as it runs in production on millions of documents.

This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
The tool's versatility makes it useful for a wide range of applications leveraging web content for knowledge discovery such as **quantitative and data-driven approaches**. It is relevant to anyone interested in language modeling, data mining, information extraction. Scraping-intensive use cases include search engine optimization, business analytics and information security. Trafilatura is used in the academic domain, chiefly for data acquisition in corpus linguistics, natural language processing, and computational social science.


Features
~~~~~~~~

- Web crawling and text discovery:
- Focused crawling and politeness rules
- Advanced web crawling and text discovery:
- Focused crawling adhering to politeness rules
- Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
- URL management (blacklists, filtering and de-duplication)
- Seamless and parallel processing, online and offline:
- URLs, HTML files or parsed HTML trees usable as input
- Efficient and polite processing of download queues
- Conversion of previously downloaded files
- Robust and efficient extraction:
- Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml)
- Smart navigation and URL management (blacklists, filtering and deduplication)
- Parallel processing of online and offline input:
- Live URLs, efficient and polite processing of download queues
- Previously downloaded HTML files and parsed HTML trees
- Robust and customizable extraction of key elements:
- Main text (common patterns and generic algorithms like jusText and readability)
- Metadata (title, author, date, site name, categories and tags)
- Formatting and structural elements: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
- Comments (if applicable)
- Output formats:
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
- Optional elements: comments, links, images, tables
- Extensive configuration options
- Multiple output formats:
- Text (minimal formatting or Markdown)
- CSV (with metadata, `tab-separated values <https://en.wikipedia.org/wiki/Tab-separated_values>`_)
- CSV (with metadata, tab-separated values)
- JSON (with metadata)
- XML (with metadata, text formatting and page structure) and `TEI-XML <https://tei-c.org/>`_
- Optional add-ons:
- Add-ons:
- Language detection on extracted content
- Graphical user interface (GUI)
- Speed optimizations
- Actively maintained with support from the open-source community:
- Regular updates, feature additions, and optimizations
- Comprehensive documentation


Evaluation and alternatives
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Trafilatura consistently outperforms other open-source libraries in text extraction benchmarks, showcasing its efficiency and accuracy in extracting web content.

For more detailed results see the `benchmark <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_ and `evaluation script <https://github.com/adbar/trafilatura/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.

=============================== ========= ========== ========= ========= ======
Expand Down Expand Up @@ -113,60 +118,66 @@ Other evaluations:
Usage and documentation
-----------------------

For more information please refer to `the documentation <https://trafilatura.readthedocs.io/>`_:
`Getting started with Trafilatura <https://trafilatura.readthedocs.io/en/latest/quickstart.html>`_ is straightforward. For more information and detailed guides, visit `Trafilatura's documentation <https://trafilatura.readthedocs.io/>`_:

- `Installation <https://trafilatura.readthedocs.io/en/latest/installation.html>`_
- Usage: `On the command-line <https://trafilatura.readthedocs.io/en/latest/usage-cli.html>`_, `With Python <https://trafilatura.readthedocs.io/en/latest/usage-python.html>`_, `With R <https://trafilatura.readthedocs.io/en/latest/usage-r.html>`_
- `Core Python functions <https://trafilatura.readthedocs.io/en/latest/corefunctions.html>`_
- Python Notebook `Trafilatura Overview <docs/Trafilatura_Overview.ipynb>`_
- `Tutorials <https://trafilatura.readthedocs.io/en/latest/tutorials.html>`_
- Interactive Python Notebook: `Trafilatura Overview <docs/Trafilatura_Overview.ipynb>`_
- `Tutorials and use cases <https://trafilatura.readthedocs.io/en/latest/tutorials.html>`_
- `Text embedding for vector search <https://trafilatura.readthedocs.io/en/latest/tutorial-epsilla.html>`_
- `Custom web corpus <https://trafilatura.readthedocs.io/en/latest/tutorial0.html>`_
- `Word frequency list <https://trafilatura.readthedocs.io/en/latest/tutorial1.html>`_

For video tutorials see this Youtube playlist:

- `Web scraping how-tos and tutorials <https://www.youtube.com/watch?v=8GkiOM17t0Q&list=PL-pKWbySIRGMgxXQOtGIz1-nbfYLvqrci>`_
- `Web scraping tutorials and how-tos <https://www.youtube.com/watch?v=8GkiOM17t0Q&list=PL-pKWbySIRGMgxXQOtGIz1-nbfYLvqrci>`_


License
-------

*Trafilatura* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/trafilatura/blob/master/LICENSE>`_. If you wish to redistribute this library but feel bounded by the license conditions please try interacting `at arms length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, `multi-licensing <https://en.wikipedia.org/wiki/Multi-licensing>`_ with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting me <https://github.com/adbar/trafilatura#author>`_.

See also `GPL and free software licensing: What's in it for business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_
*Trafilatura* is distributed under the `GNU General Public License v3.0 <https://github.com/adbar/trafilatura/blob/master/LICENSE>`_. This license promotes collaboration in software development and ensures that Trafilatura's code remains publicly accessible.

If you wish to redistribute this library but are concerned about the license conditions, consider interacting `at arm's length <https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem>`_, multi-licensing with `compatible licenses <https://en.wikipedia.org/wiki/GNU_General_Public_License#Compatibility_and_multi-licensing>`_, or `contacting the author <#author>`_ for more options.


Context
-------
For insights into GPL and free software licensing with emphasis on a business context, see `GPL and Free Software Licensing: What's in it for Business? <https://web.archive.org/web/20230127221311/https://www.techrepublic.com/article/gpl-and-free-software-licensing-whats-in-it-for-business/>`_


Contributing
~~~~~~~~~~~~
------------

Contributions are welcome! See `CONTRIBUTING.md <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_.
Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.

Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!
Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!


Roadmap
~~~~~~~
Context
-------

Developed with practical applications of academic research in mind, this software is part of a broader effort to derive information from web documents. Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. Web corpus construction involves numerous design decisions, this software package simplifies text data collection and enhances corpus quality. It is currently used to build `text databases for linguistic research <https://www.dwds.de/d/k-web>`_.

For planned enhancements and relevant milestones see `issues page <https://github.com/adbar/trafilatura/milestones>`_.
*Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the industrial-grade extraction, refinement and conversion process.


Author
~~~~~~

This effort is part of methods to derive information from web documents in order to build `text databases for research <https://www.dwds.de/d/k-web>`_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.
Reach out via the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also `Twitter/X <https://x.com/adbarbaresi>`_ for the latest updates.

This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:

- Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.


Citing Trafilatura
~~~~~~~~~~~~~~~~~~


If you use Trafilatura in your research or projects, we kindly ask you to cite this work, here is how:

.. image:: https://img.shields.io/badge/DOI-10.18653%2Fv1%2F2021.acl--demo.15-blue
:target: https://aclanthology.org/2021.acl-demo.15/
:alt: Reference DOI: 10.18653/v1/2021.acl-demo.15
Expand All @@ -175,7 +186,6 @@ This effort is part of methods to derive information from web documents in order
:target: https://doi.org/10.5281/zenodo.3460969
:alt: Zenodo archive DOI: 10.5281/zenodo.3460969


.. code-block:: shell
@inproceedings{barbaresi-2021-trafilatura,
Expand All @@ -189,21 +199,21 @@ This effort is part of methods to derive information from web documents in order
}
You can contact me via my `contact page <https://adrien.barbaresi.eu/>`_ or on `GitHub <https://github.com/adbar>`_.


Software ecosystem
~~~~~~~~~~~~~~~~~~

This software is part of a larger ecosystem. It is employed in a variety of academic and development projects, demonstrating its versatility and effectiveness. Case studies and publications are listed on the `Used By documentation page <https://trafilatura.readthedocs.io/en/latest/used-by.html>`_.

Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:


.. image:: docs/software-ecosystem.png
:alt: Software ecosystem
:alt: Software ecosystem
:align: center
:width: 65%


*Trafilatura*: `Italian word <https://en.wiktionary.org/wiki/trafilatura>`_ for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_.

`Known uses of the software <https://trafilatura.readthedocs.io/en/latest/used-by.html>`_.
Corresponding posts can be found on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_. The blog covers a range of topics from technical how-tos, updates on new features, to discussions on text mining challenges and solutions.

Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).
Impressive, you have reached the end of the page: Thank you for your interest!
9 changes: 3 additions & 6 deletions docs/crawls.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ Web crawling

.. meta::
:description lang=en:
This tutorial shows how to perform web crawling tasks with Python and on the command-line.
The Trafilatura package allows for easy focused crawling.
Dive deep into the web with Python and on the command-line. Trafilatura supports
focused crawling, enforces politeness rules, and navigates through websites.



Expand All @@ -15,15 +15,12 @@ Another use of web crawlers is in Web archiving, which involves large sets of we
Other applications include data mining and text analytics, for example building web corpora for linguistic research.


This page shows how to perform certain web crawling tasks with Python and on the command-line. The `trafilatura` package allows for easy focused crawling (see definition below).
Dive deep into the web with crawling techniques. Trafilatura supports focused crawling, adhering to politeness rules, and efficiently navigates through sitemaps and feeds. This page shows how to perform certain web crawling tasks with Python and on the command-line. The `trafilatura` package allows for easy focused crawling (see definition below).

..
Web crawlers require resources to run, so companies want to make sure they are using their resources as efficiently as possible, so they must be selective.

*New in version 0.9. Still experimental.*


Design decisions
----------------

Expand Down
Loading

0 comments on commit 9fbf9fb

Please sign in to comment.