-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community[minor]: added new document loaders based on dedoc library #24303
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
from langchain_community.document_loaders.base import BaseLoader | ||
|
||
|
||
class DedocBaseLoader(BaseLoader, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend implementing use a BlobParser which will have access to mimetype information and it'll plug in nicely into the existing blob generators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were thinking about implementing a BlobParser
, but, in this case, it won't be always applicable.
For example:
blob = Blob(data=b"some text")
Here blob.path = None
and blob.mimetype = None
and dedoc
can't parse the content.
The usage of dedoc
is similar to unstructured library, it is also used in langchain
in document loaders only (e.g. UnstructuredFileLoader).
Would it be appropriate if we don't implement a BlobParser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're free to keep this as is, though I'd recommend using a blob parser to avoid repeating design mistakes in existing loaders.
Many of the document loaders make an unnecessary assumption about the files having to existing on the local file system. This makes them more difficult to use in a production setting as content can't just be loaded from s3 / google storage into memory, but has to be duplicated to the local file system first.
If you create a blob parser, it allows for the content to be loaded from places other than the file system -- you're still free to wrap it in a document loader if you'd like.
As a blob parser users will be able to also use it with existing blob loaders that allow loading files from directories on the local file system or from cloud storage
https://python.langchain.com/v0.2/docs/how_to/document_loader_custom/#generic-loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you decide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consulted and decided to keep this as is
for row in table["cells"]: | ||
table_html += "<tr>\n" | ||
for cell in row: | ||
cell_text = "\n".join(line["text"] for line in cell["lines"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add HTML escaping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean changing <
to <
, >
to >
, &
to &
in cell_text
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import html
html.escape('<script>malicious</script>')
'<script>malicious</script>'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self, url: str, file_path: str, parameters: dict | ||
) -> Dict[str, Union[list, dict, str]]: | ||
"""Send POST-request to `dedoc` API and return the results""" | ||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These are standard library imports you can push them to the global scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, we'll fix it!
def __init__( | ||
self, | ||
file_path: str, | ||
url: str = "http://0.0.0.0:1231", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest using *
prior to named arguments with defaults to make it bullet proof for future refactors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
file_path: str, | ||
split: str = "document", | ||
with_tables: bool = True, | ||
**dedoc_kwargs: Union[str, bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Expand the kwargs into arguments to make it easier to use -- it's a bit of extra typing for the developer, but a much nicer user experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def __init__( | ||
self, | ||
file_path: str, | ||
split: str = "document", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest using *
prior to named arguments with defaults to make it bullet proof for future refactors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Triggering integration docs lint: #22866 could you update the doc-strings to match? |
…angchain-ai#24303) ### Description This pull request added new document loaders to load documents of various formats using [Dedoc](https://github.com/ispras/dedoc): - `DedocFileLoader` (determine file types automatically and parse) - `DedocPDFLoader` (for `PDF` and images parsing) - `DedocAPIFileLoader` (determine file types automatically and parse using Dedoc API without library installation) [Dedoc](https://dedoc.readthedocs.io) is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers. `Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more. Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1). For `PDF` documents, `Dedoc` allows to determine textual layer correctness and split the document into paragraphs. ### Issue This pull request extends variety of document loaders supported by `langchain_community` allowing users to choose the most suitable option for raw documents parsing. ### Dependencies The PR added a new (optional) dependency `dedoc>=2.2.5` ([library documentation](https://dedoc.readthedocs.io)) to the `extended_testing_deps.txt` ### Twitter handle None ### Add tests and docs 1. Test for the integration: `libs/community/tests/integration_tests/document_loaders/test_dedoc.py` 2. Example notebook: `docs/docs/integrations/document_loaders/dedoc.ipynb` 3. Information about the library: `docs/docs/integrations/providers/dedoc.mdx` ### Lint and test Done locally: - `make format` - `make lint` - `make integration_tests` - `make docs_build` (from the project root) --------- Co-authored-by: Nasty <[email protected]>
Description
This pull request added new document loaders to load documents of various formats using Dedoc:
DedocFileLoader
(determine file types automatically and parse)DedocPDFLoader
(forPDF
and images parsing)DedocAPIFileLoader
(determine file types automatically and parse using Dedoc API without library installation)Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers.
Dedoc
supportsDOCX
,XLSX
,PPTX
,EML
,HTML
,PDF
, images and more.Full list of supported formats can be found here.
For
PDF
documents,Dedoc
allows to determine textual layer correctness and split the document into paragraphs.Issue
This pull request extends variety of document loaders supported by
langchain_community
allowing users to choose the most suitable option for raw documents parsing.Dependencies
The PR added a new (optional) dependency
dedoc>=2.2.5
(library documentation) to theextended_testing_deps.txt
Twitter handle
None
Add tests and docs
libs/community/tests/integration_tests/document_loaders/test_dedoc.py
docs/docs/integrations/document_loaders/dedoc.ipynb
docs/docs/integrations/providers/dedoc.mdx
Lint and test
Done locally:
make format
make lint
make integration_tests
make docs_build
(from the project root)