Skip to content

Commit

Permalink
community[minor]: added new document loaders based on dedoc library (#…
Browse files Browse the repository at this point in the history
…24303)

### Description
This pull request added new document loaders to load documents of
various formats using [Dedoc](https://github.com/ispras/dedoc):
  - `DedocFileLoader` (determine file types automatically and parse)
  - `DedocPDFLoader` (for `PDF` and images parsing)
- `DedocAPIFileLoader` (determine file types automatically and parse
using Dedoc API without library installation)

[Dedoc](https://dedoc.readthedocs.io) is an open-source library/service
that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats. The
library is actively developed and maintained by a group of developers.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images
and more.
Full list of supported formats can be found
[here](https://dedoc.readthedocs.io/en/latest/#id1).
For `PDF` documents, `Dedoc` allows to determine textual layer
correctness and split the document into paragraphs.


### Issue
This pull request extends variety of document loaders supported by
`langchain_community` allowing users to choose the most suitable option
for raw documents parsing.

### Dependencies
The PR added a new (optional) dependency `dedoc>=2.2.5` ([library
documentation](https://dedoc.readthedocs.io)) to the
`extended_testing_deps.txt`

### Twitter handle
None

### Add tests and docs
1. Test for the integration:
`libs/community/tests/integration_tests/document_loaders/test_dedoc.py`
2. Example notebook:
`docs/docs/integrations/document_loaders/dedoc.ipynb`
3. Information about the library:
`docs/docs/integrations/providers/dedoc.mdx`

### Lint and test

Done locally:

  - `make format`
  - `make lint`
  - `make integration_tests`
  - `make docs_build` (from the project root)

---------

Co-authored-by: Nasty <[email protected]>
  • Loading branch information
alexander1999-hub and NastyBoget authored Jul 23, 2024
1 parent 5ac936a commit 2a70a07
Show file tree
Hide file tree
Showing 8 changed files with 1,346 additions and 0 deletions.
484 changes: 484 additions & 0 deletions docs/docs/integrations/document_loaders/dedoc.ipynb

Large diffs are not rendered by default.

56 changes: 56 additions & 0 deletions docs/docs/integrations/providers/dedoc.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Dedoc

>[Dedoc](https://dedoc.readthedocs.io) is an [open-source](https://github.com/ispras/dedoc)
library/service that extracts texts, tables, attached files and document structure
(e.g., titles, list items, etc.) from files of various formats.

`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more.
Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1).

## Installation and Setup

### Dedoc library

You can install `Dedoc` using `pip`.
In this case, you will need to install dependencies,
please go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html)
to get more information.

```bash
pip install dedoc
```

### Dedoc API

If you are going to use `Dedoc` API, you don't need to install `dedoc` library.
In this case, you should run the `Dedoc` service, e.g. `Docker` container (please see
[the documentation](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker)
for more details):

```bash
docker pull dedocproject/dedoc
docker run -p 1231:1231
```

## Document Loader

* For handling files of any formats (supported by `Dedoc`), you can use `DedocFileLoader`:

```python
from langchain_community.document_loaders import DedocFileLoader
```

* For handling PDF files (with or without a textual layer), you can use `DedocPDFLoader`:

```python
from langchain_community.document_loaders import DedocPDFLoader
```

* For handling files of any formats without library installation,
you can use `Dedoc API` with `DedocAPIFileLoader`:

```python
from langchain_community.document_loaders import DedocAPIFileLoader
```

Please see a [usage example](/docs/integrations/document_loaders/dedoc) for more details.
1 change: 1 addition & 0 deletions libs/community/extended_testing_deps.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ cloudpickle>=2.0.0
cohere>=4,<6
databricks-vectorsearch>=0.21,<0.22
datasets>=2.15.0,<3
dedoc>=2.2.6,<3
dgml-utils>=0.3.0,<0.4
elasticsearch>=8.12.0,<9
esprima>=4.0.1,<5
Expand Down
11 changes: 11 additions & 0 deletions libs/community/langchain_community/document_loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,10 @@
from langchain_community.document_loaders.dataframe import (
DataFrameLoader,
)
from langchain_community.document_loaders.dedoc import (
DedocAPIFileLoader,
DedocFileLoader,
)
from langchain_community.document_loaders.diffbot import (
DiffbotLoader,
)
Expand Down Expand Up @@ -340,6 +344,7 @@
)
from langchain_community.document_loaders.pdf import (
AmazonTextractPDFLoader,
DedocPDFLoader,
MathpixPDFLoader,
OnlinePDFLoader,
PagedPDFSplitter,
Expand Down Expand Up @@ -570,6 +575,9 @@
"CubeSemanticLoader": "langchain_community.document_loaders.cube_semantic",
"DataFrameLoader": "langchain_community.document_loaders.dataframe",
"DatadogLogsLoader": "langchain_community.document_loaders.datadog_logs",
"DedocAPIFileLoader": "langchain_community.document_loaders.dedoc",
"DedocFileLoader": "langchain_community.document_loaders.dedoc",
"DedocPDFLoader": "langchain_community.document_loaders.pdf",
"DiffbotLoader": "langchain_community.document_loaders.diffbot",
"DirectoryLoader": "langchain_community.document_loaders.directory",
"DiscordChatLoader": "langchain_community.document_loaders.discord",
Expand Down Expand Up @@ -771,6 +779,9 @@ def __getattr__(name: str) -> Any:
"CubeSemanticLoader",
"DataFrameLoader",
"DatadogLogsLoader",
"DedocAPIFileLoader",
"DedocFileLoader",
"DedocPDFLoader",
"DiffbotLoader",
"DirectoryLoader",
"DiscordChatLoader",
Expand Down
Loading

0 comments on commit 2a70a07

Please sign in to comment.