Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: added new document loaders based on dedoc library #24303

Merged
merged 4 commits into from
Jul 23, 2024

Conversation

alexander1999-hub
Copy link
Contributor

Description

This pull request added new document loaders to load documents of various formats using Dedoc:

  • DedocFileLoader (determine file types automatically and parse)
  • DedocPDFLoader (for PDF and images parsing)
  • DedocAPIFileLoader (determine file types automatically and parse using Dedoc API without library installation)

Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers.

Dedoc supports DOCX, XLSX, PPTX, EML, HTML, PDF, images and more.
Full list of supported formats can be found here.
For PDF documents, Dedoc allows to determine textual layer correctness and split the document into paragraphs.

Issue

This pull request extends variety of document loaders supported by langchain_community allowing users to choose the most suitable option for raw documents parsing.

Dependencies

The PR added a new (optional) dependency dedoc>=2.2.5 (library documentation) to the extended_testing_deps.txt

Twitter handle

None

Add tests and docs

  1. Test for the integration: libs/community/tests/integration_tests/document_loaders/test_dedoc.py
  2. Example notebook: docs/docs/integrations/document_loaders/dedoc.ipynb
  3. Information about the library: docs/docs/integrations/providers/dedoc.mdx

Lint and test

Done locally:

  • make format
  • make lint
  • make integration_tests
  • make docs_build (from the project root)

Copy link

vercel bot commented Jul 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 22, 2024 2:43pm

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 16, 2024
from langchain_community.document_loaders.base import BaseLoader


class DedocBaseLoader(BaseLoader, ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend implementing use a BlobParser which will have access to mimetype information and it'll plug in nicely into the existing blob generators.

See here:
https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/custom/#baseblobparser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were thinking about implementing a BlobParser, but, in this case, it won't be always applicable.
For example:

blob = Blob(data=b"some text")

Here blob.path = None and blob.mimetype = None and dedoc can't parse the content.

The usage of dedoc is similar to unstructured library, it is also used in langchain in document loaders only (e.g. UnstructuredFileLoader).

Would it be appropriate if we don't implement a BlobParser?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're free to keep this as is, though I'd recommend using a blob parser to avoid repeating design mistakes in existing loaders.


Many of the document loaders make an unnecessary assumption about the files having to existing on the local file system. This makes them more difficult to use in a production setting as content can't just be loaded from s3 / google storage into memory, but has to be duplicated to the local file system first.

If you create a blob parser, it allows for the content to be loaded from places other than the file system -- you're still free to wrap it in a document loader if you'd like.

As a blob parser users will be able to also use it with existing blob loaders that allow loading files from directories on the local file system or from cloud storage

https://python.langchain.com/v0.2/docs/how_to/document_loader_custom/#generic-loader

https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader.html#langchain_community.document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you decide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consulted and decided to keep this as is

for row in table["cells"]:
table_html += "<tr>\n"
for cell in row:
cell_text = "\n".join(line["text"] for line in cell["lines"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add HTML escaping here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean changing < to &lt;, > to &gt;, & to &amp; in cell_text?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import html
html.escape('<script>malicious</script>')
'<script>malicious</script>'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self, url: str, file_path: str, parameters: dict
) -> Dict[str, Union[list, dict, str]]:
"""Send POST-request to `dedoc` API and return the results"""
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These are standard library imports you can push them to the global scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, we'll fix it!

def __init__(
self,
file_path: str,
url: str = "http://0.0.0.0:1231",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would suggest using * prior to named arguments with defaults to make it bullet proof for future refactors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

file_path: str,
split: str = "document",
with_tables: bool = True,
**dedoc_kwargs: Union[str, bool],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Expand the kwargs into arguments to make it easier to use -- it's a bit of extra typing for the developer, but a much nicer user experience

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __init__(
self,
file_path: str,
split: str = "document",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would suggest using * prior to named arguments with defaults to make it bullet proof for future refactors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 18, 2024
@eyurtsev eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 18, 2024
@eyurtsev eyurtsev removed the waiting-on-author PR Status: Confirmation from author is required label Jul 19, 2024
@eyurtsev eyurtsev changed the title community: added new document loaders based on dedoc library community[minor]: added new document loaders based on dedoc library Jul 19, 2024
@eyurtsev
Copy link
Collaborator

Triggering integration docs lint: #22866 could you update the doc-strings to match?

@eyurtsev eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label Jul 19, 2024
@eyurtsev eyurtsev enabled auto-merge (squash) July 23, 2024 01:55
@eyurtsev eyurtsev merged commit 2a70a07 into langchain-ai:master Jul 23, 2024
45 checks passed
olgamurraft pushed a commit to olgamurraft/langchain that referenced this pull request Aug 16, 2024
…angchain-ai#24303)

### Description
This pull request added new document loaders to load documents of
various formats using [Dedoc](https://github.com/ispras/dedoc):
  - `DedocFileLoader` (determine file types automatically and parse)
  - `DedocPDFLoader` (for `PDF` and images parsing)
- `DedocAPIFileLoader` (determine file types automatically and parse
using Dedoc API without library installation)

[Dedoc](https://dedoc.readthedocs.io) is an open-source library/service
that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats. The
library is actively developed and maintained by a group of developers.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images
and more.
Full list of supported formats can be found
[here](https://dedoc.readthedocs.io/en/latest/#id1).
For `PDF` documents, `Dedoc` allows to determine textual layer
correctness and split the document into paragraphs.


### Issue
This pull request extends variety of document loaders supported by
`langchain_community` allowing users to choose the most suitable option
for raw documents parsing.

### Dependencies
The PR added a new (optional) dependency `dedoc>=2.2.5` ([library
documentation](https://dedoc.readthedocs.io)) to the
`extended_testing_deps.txt`

### Twitter handle
None

### Add tests and docs
1. Test for the integration:
`libs/community/tests/integration_tests/document_loaders/test_dedoc.py`
2. Example notebook:
`docs/docs/integrations/document_loaders/dedoc.ipynb`
3. Information about the library:
`docs/docs/integrations/providers/dedoc.mdx`

### Lint and test

Done locally:

  - `make format`
  - `make lint`
  - `make integration_tests`
  - `make docs_build` (from the project root)

---------

Co-authored-by: Nasty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XXL This PR changes 1000+ lines, ignoring generated files. waiting-on-author PR Status: Confirmation from author is required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants