community[minor]: added new document loaders based on dedoc library #24303

alexander1999-hub · 2024-07-16T09:31:39Z

Description

This pull request added new document loaders to load documents of various formats using Dedoc:

DedocFileLoader (determine file types automatically and parse)
DedocPDFLoader (for PDF and images parsing)
DedocAPIFileLoader (determine file types automatically and parse using Dedoc API without library installation)

Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers.

Dedoc supports DOCX, XLSX, PPTX, EML, HTML, PDF, images and more.
Full list of supported formats can be found here.
For PDF documents, Dedoc allows to determine textual layer correctness and split the document into paragraphs.

Issue

This pull request extends variety of document loaders supported by langchain_community allowing users to choose the most suitable option for raw documents parsing.

Dependencies

The PR added a new (optional) dependency dedoc>=2.2.5 (library documentation) to the extended_testing_deps.txt

Twitter handle

None

Add tests and docs

Test for the integration: libs/community/tests/integration_tests/document_loaders/test_dedoc.py
Example notebook: docs/docs/integrations/document_loaders/dedoc.ipynb
Information about the library: docs/docs/integrations/providers/dedoc.mdx

Lint and test

Done locally:

make format
make lint
make integration_tests
make docs_build (from the project root)

vercel · 2024-07-16T09:31:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 22, 2024 2:43pm

eyurtsev · 2024-07-16T14:00:47Z

libs/community/langchain_community/document_loaders/dedoc.py

+from langchain_community.document_loaders.base import BaseLoader
+
+
+class DedocBaseLoader(BaseLoader, ABC):


I'd recommend implementing use a BlobParser which will have access to mimetype information and it'll plug in nicely into the existing blob generators.

See here:
https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/custom/#baseblobparser

We were thinking about implementing a BlobParser, but, in this case, it won't be always applicable.
For example:

blob = Blob(data=b"some text")

Here blob.path = None and blob.mimetype = None and dedoc can't parse the content.

The usage of dedoc is similar to unstructured library, it is also used in langchain in document loaders only (e.g. UnstructuredFileLoader).

Would it be appropriate if we don't implement a BlobParser?

You're free to keep this as is, though I'd recommend using a blob parser to avoid repeating design mistakes in existing loaders.

Many of the document loaders make an unnecessary assumption about the files having to existing on the local file system. This makes them more difficult to use in a production setting as content can't just be loaded from s3 / google storage into memory, but has to be duplicated to the local file system first.

If you create a blob parser, it allows for the content to be loaded from places other than the file system -- you're still free to wrap it in a document loader if you'd like.

As a blob parser users will be able to also use it with existing blob loaders that allow loading files from directories on the local file system or from cloud storage

https://python.langchain.com/v0.2/docs/how_to/document_loader_custom/#generic-loader

https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader.html#langchain_community.document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader

Let me know what you decide

We consulted and decided to keep this as is

eyurtsev · 2024-07-16T14:03:24Z

libs/community/langchain_community/document_loaders/dedoc.py

+        for row in table["cells"]:
+            table_html += "<tr>\n"
+            for cell in row:
+                cell_text = "\n".join(line["text"] for line in cell["lines"])


Could you add HTML escaping here?

Do you mean changing < to <, > to >, & to & in cell_text?

import html
html.escape('<script>malicious</script>')
'<script>malicious</script>'

eyurtsev · 2024-07-16T14:04:35Z

libs/community/langchain_community/document_loaders/dedoc.py

+        self, url: str, file_path: str, parameters: dict
+    ) -> Dict[str, Union[list, dict, str]]:
+        """Send POST-request to `dedoc` API and return the results"""
+        import json


nit: These are standard library imports you can push them to the global scope

Thank you, we'll fix it!

eyurtsev · 2024-07-18T13:26:41Z

libs/community/langchain_community/document_loaders/dedoc.py

+    def __init__(
+        self,
+        file_path: str,
+        url: str = "http://0.0.0.0:1231",


would suggest using * prior to named arguments with defaults to make it bullet proof for future refactors

eyurtsev · 2024-07-18T13:27:31Z

libs/community/langchain_community/document_loaders/dedoc.py

+        file_path: str,
+        split: str = "document",
+        with_tables: bool = True,
+        **dedoc_kwargs: Union[str, bool],


nit: Expand the kwargs into arguments to make it easier to use -- it's a bit of extra typing for the developer, but a much nicer user experience

eyurtsev · 2024-07-18T13:27:44Z

libs/community/langchain_community/document_loaders/dedoc.py

+    def __init__(
+        self,
+        file_path: str,
+        split: str = "document",


would suggest using * prior to named arguments with defaults to make it bullet proof for future refactors

eyurtsev · 2024-07-19T17:34:47Z

Triggering integration docs lint: #22866 could you update the doc-strings to match?

…angchain-ai#24303) ### Description This pull request added new document loaders to load documents of various formats using [Dedoc](https://github.com/ispras/dedoc): - `DedocFileLoader` (determine file types automatically and parse) - `DedocPDFLoader` (for `PDF` and images parsing) - `DedocAPIFileLoader` (determine file types automatically and parse using Dedoc API without library installation) [Dedoc](https://dedoc.readthedocs.io) is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers. `Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more. Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1). For `PDF` documents, `Dedoc` allows to determine textual layer correctness and split the document into paragraphs. ### Issue This pull request extends variety of document loaders supported by `langchain_community` allowing users to choose the most suitable option for raw documents parsing. ### Dependencies The PR added a new (optional) dependency `dedoc>=2.2.5` ([library documentation](https://dedoc.readthedocs.io)) to the `extended_testing_deps.txt` ### Twitter handle None ### Add tests and docs 1. Test for the integration: `libs/community/tests/integration_tests/document_loaders/test_dedoc.py` 2. Example notebook: `docs/docs/integrations/document_loaders/dedoc.ipynb` 3. Information about the library: `docs/docs/integrations/providers/dedoc.mdx` ### Lint and test Done locally: - `make format` - `make lint` - `make integration_tests` - `make docs_build` (from the project root) --------- Co-authored-by: Nasty <[email protected]>

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 16, 2024

vercel bot deployed to Preview July 16, 2024 09:40 View deployment

eyurtsev reviewed Jul 16, 2024

View reviewed changes

eyurtsev self-assigned this Jul 16, 2024

vercel bot had a problem deploying to Preview July 18, 2024 08:30 Failure

NastyBoget force-pushed the dedoc_loader branch from cb6f22d to dea8aba Compare July 18, 2024 08:47

vercel bot deployed to Preview July 18, 2024 08:56 View deployment

NastyBoget force-pushed the dedoc_loader branch from dea8aba to 0732019 Compare July 18, 2024 09:06

vercel bot deployed to Preview July 18, 2024 09:14 View deployment

eyurtsev approved these changes Jul 18, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 18, 2024

eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 18, 2024

NastyBoget force-pushed the dedoc_loader branch from 430eb69 to a759b12 Compare July 18, 2024 15:02

vercel bot deployed to Preview July 18, 2024 15:13 View deployment

NastyBoget added 3 commits July 19, 2024 15:58

community: Add dedoc support

adac492

review fixes

e7586f7

review fixes part 2

c4f0d70

NastyBoget force-pushed the dedoc_loader branch from a759b12 to c4f0d70 Compare July 19, 2024 12:58

vercel bot deployed to Preview July 19, 2024 13:06 View deployment

eyurtsev removed the waiting-on-author PR Status: Confirmation from author is required label Jul 19, 2024

eyurtsev changed the title ~~community: added new document loaders based on dedoc library~~ community[minor]: added new document loaders based on dedoc library Jul 19, 2024

eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 19, 2024

documentation and dependencies fix

712e448

vercel bot deployed to Preview July 22, 2024 14:43 View deployment

eyurtsev approved these changes Jul 23, 2024

View reviewed changes

eyurtsev enabled auto-merge (squash) July 23, 2024 01:55

eyurtsev merged commit 2a70a07 into langchain-ai:master Jul 23, 2024
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: added new document loaders based on dedoc library #24303

community[minor]: added new document loaders based on dedoc library #24303

alexander1999-hub commented Jul 16, 2024

vercel bot commented Jul 16, 2024 •

edited

Loading

eyurtsev Jul 16, 2024

NastyBoget Jul 17, 2024

eyurtsev Jul 18, 2024

eyurtsev Jul 18, 2024

NastyBoget Jul 18, 2024

eyurtsev Jul 16, 2024

NastyBoget Jul 17, 2024

eyurtsev Jul 17, 2024

NastyBoget Jul 18, 2024

eyurtsev Jul 16, 2024

NastyBoget Jul 17, 2024

eyurtsev Jul 18, 2024

NastyBoget Jul 18, 2024

eyurtsev Jul 18, 2024

NastyBoget Jul 18, 2024

eyurtsev Jul 18, 2024

NastyBoget Jul 18, 2024

eyurtsev commented Jul 19, 2024

		from langchain_community.document_loaders.base import BaseLoader


		class DedocBaseLoader(BaseLoader, ABC):

community[minor]: added new document loaders based on dedoc library #24303

community[minor]: added new document loaders based on dedoc library #24303

Conversation

alexander1999-hub commented Jul 16, 2024

Description

Issue

Dependencies

Twitter handle

Add tests and docs

Lint and test

vercel bot commented Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented Jul 19, 2024

vercel bot commented Jul 16, 2024 •

edited

Loading