Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf to Markdown duplication issue #790

Open
AbzyS1 opened this issue Jan 23, 2025 · 3 comments
Open

Pdf to Markdown duplication issue #790

AbzyS1 opened this issue Jan 23, 2025 · 3 comments
Assignees
Labels
PDF parsing pdf question Further information is requested

Comments

@AbzyS1
Copy link

AbzyS1 commented Jan 23, 2025

Question

I am new to Docling and have been experimenting. This is how I have currently implemented Docling to parse my PDF:

```python
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(force_ocr=True)
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
source = "data/my-document.pdf"
result = converter.convert(source)
print(result.document.export_to_markdown())

The issue is in the output, every sentence is duplicated.

Here is an example of what I mean:

4. Characteristics of Investment Agency 4. Characteristics of Investment Agency
* 4/1 Investment agency contracts, whether remunerated or unremuner4/1 Investment agency contracts, whether remunerated or unremunerated, are binding on institutions because they are invariably fixed ated, are binding on institutions because they are invariably fixed term contracts in which both parties agree not to terminate within term contracts in which both parties agree not to terminate within a specified period.
* 4/2 Where the parties agreed to terminate for a specified period, it is 4/2 Where the parties agreed to terminate for a specified period, it is permissible for the contract to stipulate the right of one of the parties permissible for the contract to stipulate the right of one of the parties to terminate the contract unilaterally in specific circumstances. to terminate the contract unilaterally in specific circumstances.
* 4/3 When the term of an agency expires, the agent is required not to enter 4/3 When the term of an agency expires, the agent is required not to enter into new investment activities, but may not liquidate ongoing existing into new investment activities, but may not liquidate ongoing existing investments. investments.

As you can see, it happens in both titles and the text itself. Any ideas why? I have tried a few things (i.e., changing force_ocr = False, etc.) but can't seem to pinpoint the issue.

All help would be appreciated.

@AbzyS1 AbzyS1 added the question Further information is requested label Jan 23, 2025
@vagenas
Copy link
Contributor

vagenas commented Jan 27, 2025

@AbzyS1 I ran your snippet against a document that matched your pasted content on Google (I assume it's the same, otherwise please share your document if it is possible) and it works without duplication issues.

Can you share the outputs of docling --version and python -c "import platform; print(platform.platform())"?

In case you have an old Docling version, I would update to the latest and try again.

@vagenas vagenas self-assigned this Jan 27, 2025
@PeterStaar-IBM
Copy link
Contributor

@AbzyS1 Can you add an example pdf where this happens?

@AbzyS1
Copy link
Author

AbzyS1 commented Jan 28, 2025

Shariaa-Standards-extract-ENG.pdf

The above is an extract from the document in question.

Following from Vagenas's questions here are the outputs:

Docling version: 2.8.3
Docling Core version: 2.14.0
Docling IBM Models version: 2.0.8
Docling Parse version: 2.1.2


Python 3.11.5


macOS-14.5-arm64-arm-64bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDF parsing pdf question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants