Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: added new document loaders based on dedoc library #24303

Merged
merged 4 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
484 changes: 484 additions & 0 deletions docs/docs/integrations/document_loaders/dedoc.ipynb

Large diffs are not rendered by default.

56 changes: 56 additions & 0 deletions docs/docs/integrations/providers/dedoc.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Dedoc

>[Dedoc](https://dedoc.readthedocs.io) is an [open-source](https://github.com/ispras/dedoc)
library/service that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more.
Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1).

## Installation and Setup

### Dedoc library

You can install `Dedoc` using `pip`.
In this case, you will need to install dependencies,
please go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html)
to get more information.

```bash
pip install dedoc
```

### Dedoc API

If you are going to use `Dedoc` API, you don't need to install `dedoc` library.
In this case, you should run the `Dedoc` service, e.g. `Docker` container (please see
[the documentation](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker)
for more details):

```bash
docker pull dedocproject/dedoc
docker run -p 1231:1231
```

## Document Loader

* For handling files of any formats (supported by `Dedoc`), you can use `DedocFileLoader`:

```python
from langchain_community.document_loaders import DedocFileLoader
```

* For handling PDF files (with or without a textual layer), you can use `DedocPDFLoader`:

```python
from langchain_community.document_loaders import DedocPDFLoader
```

* For handling files of any formats without library installation,
you can use `Dedoc API` with `DedocAPIFileLoader`:

```python
from langchain_community.document_loaders import DedocAPIFileLoader
```

Please see a [usage example](/docs/integrations/document_loaders/dedoc) for more details.
1 change: 1 addition & 0 deletions libs/community/extended_testing_deps.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ cloudpickle>=2.0.0
cohere>=4,<6
databricks-vectorsearch>=0.21,<0.22
datasets>=2.15.0,<3
dedoc>=2.2.6,<3
dgml-utils>=0.3.0,<0.4
elasticsearch>=8.12.0,<9
esprima>=4.0.1,<5
Expand Down
11 changes: 11 additions & 0 deletions libs/community/langchain_community/document_loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,10 @@
from langchain_community.document_loaders.dataframe import (
DataFrameLoader,
)
from langchain_community.document_loaders.dedoc import (
DedocAPIFileLoader,
DedocFileLoader,
)
from langchain_community.document_loaders.diffbot import (
DiffbotLoader,
)
Expand Down Expand Up @@ -340,6 +344,7 @@
)
from langchain_community.document_loaders.pdf import (
AmazonTextractPDFLoader,
DedocPDFLoader,
MathpixPDFLoader,
OnlinePDFLoader,
PagedPDFSplitter,
Expand Down Expand Up @@ -570,6 +575,9 @@
"CubeSemanticLoader": "langchain_community.document_loaders.cube_semantic",
"DataFrameLoader": "langchain_community.document_loaders.dataframe",
"DatadogLogsLoader": "langchain_community.document_loaders.datadog_logs",
"DedocAPIFileLoader": "langchain_community.document_loaders.dedoc",
"DedocFileLoader": "langchain_community.document_loaders.dedoc",
"DedocPDFLoader": "langchain_community.document_loaders.pdf",
"DiffbotLoader": "langchain_community.document_loaders.diffbot",
"DirectoryLoader": "langchain_community.document_loaders.directory",
"DiscordChatLoader": "langchain_community.document_loaders.discord",
Expand Down Expand Up @@ -771,6 +779,9 @@ def __getattr__(name: str) -> Any:
"CubeSemanticLoader",
"DataFrameLoader",
"DatadogLogsLoader",
"DedocAPIFileLoader",
"DedocFileLoader",
"DedocPDFLoader",
"DiffbotLoader",
"DirectoryLoader",
"DiscordChatLoader",
Expand Down
Loading
Loading