From eac4e681dacc99f4db7ad8b5128e5f64959a97e1 Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Sat, 28 Dec 2024 12:35:09 +0100 Subject: [PATCH] update docs --- CONTRIBUTING.md | 26 +++++++++++--------------- README.md | 9 ++++----- docs/conf.py | 5 ++--- docs/corpus-data.rst | 6 +++--- docs/index.rst | 11 +++++------ docs/requirements.txt | 2 +- docs/sources.rst | 12 ++++-------- docs/troubleshooting.rst | 4 ++-- docs/tutorial-dwds.rst | 2 +- docs/tutorial-epsilla.rst | 9 +-------- docs/tutorial0.rst | 2 +- docs/usage-api.rst | 24 +++++++++++------------- docs/usage-gui.rst | 2 +- docs/usage-r.rst | 2 +- docs/used-by.rst | 16 +++++++--------- 15 files changed, 55 insertions(+), 77 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4423a4ec..48eb7fbc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,14 +1,17 @@ ## How to contribute -Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura. +If you value this software or depend on it for your product, +consider sponsoring it and contributing to its codebase. +Your support will help ensure the sustainability and growth of the project. -There are many ways to contribute, you could: +There are many ways to contribute: - * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content. + * Sponsor the project: Show your appreciation [on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi). * Find bugs and submit bug reports: Help making Trafilatura an even more robust tool. + * Write code: Fix bugs or add new features by writing [pull requests](https://docs.github.com/en/pull-requests) with a list of what you have done. + * Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content. * Submit feature requests: Share your feedback and suggestions. - * Write code: Fix bugs or add new features. Here are some important resources: @@ -16,27 +19,20 @@ Here are some important resources: * [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!) * [How to contribute to open source](https://opensource.guide/how-to-contribute/) +A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura. + ## Testing and evaluating the code -Here is how you can run the tests and code quality checks: +Here is how you can run the tests and code quality checks. Pull requests will only be accepted if the changes are tested and if they there are no errors. - Install the necessary packages with `pip install trafilatura[dev]` - Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py` - Run `mypy` on the directory: `mypy trafilatura/` -- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark - -Pull requests will only be accepted if they there are no errors in pytest and mypy. If you work on text extraction it is useful to check if performance is equal or better on the benchmark. - -## Submitting changes - -Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)). - -**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github) - +See the [tests Readme](tests/README.rst) for more information. For further questions you can use [GitHub issues](https://github.com/adbar/trafilatura/issues) and discussion pages, or [E-Mail](https://adrien.barbaresi.eu/). diff --git a/README.md b/README.md index f17387a6..28f5ab22 100644 --- a/README.md +++ b/README.md @@ -141,13 +141,12 @@ This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. Initially launched to create text databases for research purposes at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units), -this package continues to be maintained but its future development -depends on community support. +this package continues to be maintained but its future depends on community support. **If you value this software or depend on it for your product, consider -sponsoring it and contributing to its codebase**. Your support will -help maintain and enhance this popular package, ensuring its growth, -robustness, and accessibility for developers and users around the world. +sponsoring it and contributing to its codebase**. Your support +[on GitHub](https://github.com/sponsors/adbar) or [ko-fi.com](https://ko-fi.com/adbarbaresi) +will help maintain and enhance this popular package. *Trafilatura* is an Italian word for [wire drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the diff --git a/docs/conf.py b/docs/conf.py index d56b9d7c..ae200a7c 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -20,8 +20,8 @@ # -- Project information ----------------------------------------------------- -project = 'trafilatura' -copyright = '2024, Adrien Barbaresi' +project = 'Trafilatura' +copyright = '2025, Adrien Barbaresi' html_show_sphinx = False author = 'Adrien Barbaresi' version = trafilatura.__version__ @@ -88,7 +88,6 @@ ## pydata options html_theme_options = { "github_url": "https://github.com/adbar/trafilatura", - "twitter_url": "https://twitter.com/adbarbaresi", "external_links": [ {"name": "Blog", "url": "https://adrien.barbaresi.eu/blog/tag/trafilatura.html"}, ], diff --git a/docs/corpus-data.rst b/docs/corpus-data.rst index a34cf733..d0a7e661 100644 --- a/docs/corpus-data.rst +++ b/docs/corpus-data.rst @@ -45,7 +45,7 @@ Formats and software used in corpus linguistics Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics. -- Han., N.-R. (2022). "`Transforming Data `_", The Open Handbook of Linguistic Data. +- Han., N.-R. (2022). "Transforming Data", The Open Handbook of Linguistic Data. The XML and XML-TEI formats @@ -62,9 +62,9 @@ Corpus analysis tools - `CorpusExplorer `_ supports CSV, TXT and various XML formats - `Corpus Workbench (CWB) `_ uses verticalized texts whose origin can be in TXT or XML format - `LancsBox `_ support various formats, notably TXT & XML -- `TXM `_ (textometry platform) can take TXT, XML & XML-TEI files as input +- `TXM `_ (textometry platform) can take TXT, XML & XML-TEI files as input - `Voyant `_ support various formats, notably TXT, XML & XML-TEI -- `Wmatrix `_ can work with TXT and XML +- `Wmatrix `_ can work with TXT and XML - `WordSmith `_ supports TXT and XML Further corpus analysis software can be found on `corpus-analysis.com `_. diff --git a/docs/index.rst b/docs/index.rst index fb40e4b3..d3c3427f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -134,13 +134,12 @@ This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. Initially launched to create text databases for research purposes at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units), -this package continues to be maintained but its future development -depends on community support. +this package continues to be maintained but its future depends on community support. **If you value this software or depend on it for your product, consider -sponsoring it and contributing to its codebase**. Your support will -help maintain and enhance this popular package, ensuring its growth, -robustness, and accessibility for developers and users around the world. +sponsoring it and contributing to its codebase**. Your support +`on GitHub `_ or `ko-fi.com `_ +will help maintain and enhance this popular package. *Trafilatura* is an Italian word for `wire drawing `_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed. @@ -225,8 +224,8 @@ Further documentation usage tutorials evaluation - corefunctions used-by + corefunctions background :ref:`genindex` diff --git a/docs/requirements.txt b/docs/requirements.txt index 2c243702..5bb841f9 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,6 +1,6 @@ # with version specifier sphinx>=8.1.3 -pydata-sphinx-theme>=0.16.0 +pydata-sphinx-theme>=0.16.1 docutils>=0.21.2 sphinx-sitemap>=2.6.0 diff --git a/docs/sources.rst b/docs/sources.rst index 1ae76f1d..f3b786f0 100644 --- a/docs/sources.rst +++ b/docs/sources.rst @@ -39,18 +39,18 @@ Corpora URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives. - `Sources for the Internet Corpora `_ of the Leeds Centre for Translation Studies -- `Link data sets `_ of the COW project +- `Link data sets `_ of the COW project URL directories ~~~~~~~~~~~~~~~ -- `Overview of the Web archiving community `_ +- `Overview of the Web archiving community `_ - `lazynlp list of sources `_ DMOZ (now an archive) and Wikipedia work quite well as primary sources: -- `Qualification of URLs extracted from DMOZ and Wikipedia `_ (PhD thesis section) +- `Qualification of URLs extracted from DMOZ and Wikipedia `_ (PhD thesis section) .. https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/ @@ -130,14 +130,10 @@ Social networks Series of surface scrapers that crawl the networks without even logging in, thus circumventing the API restrictions. Development of such software solutions is fast-paced, so no links will be listed here at the moment. -Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. see for instance: - -- `Twitter datasets for research and archiving `_ -- `Search GitHub for Tweet IDs `_ +Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitter in bulk. Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…). - For further ideas from previous projects see references below. diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst index 5c377dfc..e2dfea07 100644 --- a/docs/troubleshooting.rst +++ b/docs/troubleshooting.rst @@ -34,7 +34,7 @@ Beyond raw HTML While downloading and processing raw HTML documents is much faster, it can be necessary to fully render the web page before further processing, e.g. because a page makes exhaustive use of JavaScript or because content is injected from multiple sources. -In such cases the way to go is to use a browser automation library like `Playwright `_. For available alternatives see this `list of headless browsers `_. +In such cases the way to go is to use a browser automation library like Playwright. For available alternatives see this `list of headless browsers `_. For more refined masking and automation methods, see the `nodriver `_ and `browserforge `_ packages. @@ -43,7 +43,7 @@ For more refined masking and automation methods, see the `nodriver `_ or `this alternative by magnolia1234 `_. +A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. iamdamdev's bypass-paywalls-chrome and available alternatives. diff --git a/docs/tutorial-dwds.rst b/docs/tutorial-dwds.rst index 3572eeeb..2fb9435d 100644 --- a/docs/tutorial-dwds.rst +++ b/docs/tutorial-dwds.rst @@ -82,7 +82,7 @@ So wird die DWDS-Plattform zu einer Art Meta-Suchmaschine. Der Vorteil besteht d Hier finden Sie eine `Liste der Webkorpora auf der DWDS-Plattform `_. -Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 `_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt. +Bei größeren Webkorpora ist die Filterung hinsichtlich der Relevanz und der Textqualität meistens quantitativer Natur, siehe `Barbaresi 2015 (Diss.) Kapitel 4 `_ für Details. Im Übrigen haben wir das Schlimmste aus dem Web manuell ausgegrenzt. Download und Verarbeitung der Daten diff --git a/docs/tutorial-epsilla.rst b/docs/tutorial-epsilla.rst index 8b647363..d613bd7e 100644 --- a/docs/tutorial-epsilla.rst +++ b/docs/tutorial-epsilla.rst @@ -32,11 +32,7 @@ Alternatives include `Qdrant `_, `Redis `_ with a free tier. You can sign up and get a server running in a few steps. - -Alternatively, you can start one locally with a `Docker `_ image. +In this tutorial, we will need an Epsilla database server. You can start one locally with a `Docker `_ image. .. code-block:: bash @@ -155,7 +151,4 @@ We can now perform a vector search to find the most relevant project based on a You will see the returned response is React! That is the correct answer. React is a modern frontend library, but PyTorch and Tensorflow are not. -.. image:: https://static.scarf.sh/a.png?x-pxid=51f549d1-aabf-473c-b971-f8d9c3ac8ac5 - :alt: - diff --git a/docs/tutorial0.rst b/docs/tutorial0.rst index e39a1fee..fb5a469c 100644 --- a/docs/tutorial0.rst +++ b/docs/tutorial0.rst @@ -193,7 +193,7 @@ The output directory can be created on demand, but it must be writable. # output in XML format, backup of HTML files $ trafilatura --xml -i list.txt -o xmlfiles/ --backup-dir htmlfiles/ -The second and third instructions create a collection of `XML files `_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE such as the `Atom editor `_. +The second and third instructions create a collection of `XML files `_ which can be edited with a basic text editor or a full-fledged text-editing software or IDE. .. hint:: diff --git a/docs/usage-api.rst b/docs/usage-api.rst index 1143b8a0..d0990190 100644 --- a/docs/usage-api.rst +++ b/docs/usage-api.rst @@ -3,32 +3,32 @@ API .. meta:: :description lang=en: - See how to use the official Trafilatura API to download and extract data for free or for larger volumes. + See how to use the official Trafilatura API to download and extract data. Introduction ------------ -Simplify the process of turning URLs and HTML into structured, meaningful data! - -Use the last version of the software straight from the application programming interface. The API allows you to access the capabilities of Trafilatura, a web scraping and data extraction library, directly from your applications and projects. - With the Trafilatura API, you can: - Download URLs or provide your own data, including web scraping capabilities - Configure the output format to suit your needs, with support for multiple use cases - This is especially useful if you want to try out the software without installing it or if you want to support the project while saving time. -Endpoints ---------- +.. warning:: + The API is currently unavailable, feel free to get in touch for any inquiries. -The Trafilatura API comes in two versions, available from two different gateways: -- `Free for demonstration purposes `_ (including documentation page) -- `For a larger volume of requests `_ (documentation with code snippets and plans) +.. + Endpoints + --------- + + The Trafilatura API comes in two versions, available from two different gateways: + + - `Free for demonstration purposes `_ (including documentation page) + - `For a larger volume of requests `_ (documentation with code snippets and plans) Making JSON requests @@ -103,5 +103,3 @@ Further information ------------------- Please note that the underlying code is not currently open-sourced, feel free to reach out for specific use cases or collaborations. - -With the API, you can focus on building your applications and projects, while leaving the heavy lifting to Trafilatura. diff --git a/docs/usage-gui.rst b/docs/usage-gui.rst index 8b99896c..18d0dc60 100644 --- a/docs/usage-gui.rst +++ b/docs/usage-gui.rst @@ -39,7 +39,7 @@ Troubleshooting Mac OS X: - ``This program needs access to the screen...`` This problem is related to the way you installed Python or the shell you're running: - 1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source `_) + 1. Clone the repository and start with "python trafilatura_gui/interface.py" (`source `_) 2. `Configure your virtual environment `_ (Python3 and wxpython 4.1.0) diff --git a/docs/usage-r.rst b/docs/usage-r.rst index 8e27b528..c2e949c5 100644 --- a/docs/usage-r.rst +++ b/docs/usage-r.rst @@ -11,7 +11,7 @@ Introduction ------------ -R is a free software environment for statistical computing and graphics. `Reticulate `_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R. +R is a free software environment for statistical computing and graphics. `Reticulate `_ is an R package that enables easy interoperability between R and Python. With Reticulate, you can import Python modules as if they were R packages and call Python functions from R. This allows R users to leverage the vast array of Python packages and tools and basically allows for execution of Python code inside an R session. Python packages can then be used with minimal adaptations rather than having to go back and forth between languages and environments. diff --git a/docs/used-by.rst b/docs/used-by.rst index 021953e8..0e4eae83 100644 --- a/docs/used-by.rst +++ b/docs/used-by.rst @@ -10,9 +10,6 @@ Initially released to collect data for linguistic research and lexicography at t The tool earns accolades as the most efficient open-source library in benchmarks and academic evaluations. It supports language modeling by providing high-quality text data, aids data mining with efficient web data retrieval, and streamlines information extraction from unstructured content. In SEO and business analytics it gathers online data for insights and in information security, it monitors websites for threat detection. -If you wish to add further references, please `edit this page `_ and suggest changes by submitting a pull request. - - Notable projects using this software ------------------------------------ @@ -23,13 +20,13 @@ Companies and research centers - Allen Institute for AI with the `Dolma toolkit `_ used to pre-train the OLMo LLM - HuggingFace with `DataTrove `_ to process, filter and deduplicate text data - IBM's `Data-Prep-Kit `_, a toolkit for data preparation in a LLM context -- `Media Cloud platform `_ for media analysis +- `Media Cloud platform `_ for media analysis - SciencesPo médialab through its `Minet `_ webmining package - Stanford Open Virtual Assistant Lab's `STORM `_, a LLM system that writes Wikipedia-like articles - Swedish national center for applied AI with `SWEB: A large dataset for Scandinavian languages `_ - Technology Innovation Institute Abu Dhabi with Falcon LLM and its underlying `RefinedWeb Dataset `_ - `Teclis search engine `_ (related to Kagi) -- The Internet Archive's `sandcrawler `_ which crawls and processes the scholarly web for the `Fatcat catalog `_ of research publications +- The Internet Archive's `sandcrawler `_ which crawls and processes the scholarly web - Tokyo Institute of Technology with a `Japanese Web Corpus for Large Language Models `_ - Turku University, NLP department with `FinGPT `_ models - University of Munich (LMU), Center for Language and Information Processing, `GlotWeb project `_ @@ -43,12 +40,11 @@ Various software repositories - `Benson `_, to turn a list of URLs into mp3s of the contents of each web page - `CommonCrawl downloader `_, to derive massive amounts of language data - `Ethical ad server `_ on ReadTheDocs (hosting these doc pages) -- `GLAM Workbench `_ for cultural heritage (web archives section) -- `llama-hub `_, a library of data loaders for large language models +- `GLAM Workbench `_ for cultural heritage (web archives section) - `LlamaIndex `_, a data framework for LLM applications -- `Obsei `_, a text collection and analysis tool +- `Obsei `_, a text collection and analysis tool - `Vulristics `_, a framework for analyzing publicly available information about vulnerabilities -- `Website-to-Chatbot `_, a personalized chatbot +- `Website-to-Chatbot `_, a personalized chatbot For more see this list of `software using Trafilatura `_. @@ -119,6 +115,8 @@ The date extraction component ``htmldate`` is referenced in the following public Publications citing Trafilatura ------------------------------- +https://www.degruyter.com/document/doi/10.1515/9783110729603-009/html + - Alakukku, L. (2022). "Domain specific boilerplate removal from web pages with entropy and clustering", Master's thesis, University of Aalto. - Alexandrescu, A., & Butincu, C.N. (2023). Decentralized news-retrieval architecture using blockchain technology. Mathematics, 11(21), 4542.