You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am new to Docling and have been experimenting. This is how I have currently implemented Docling to parse my PDF:
```pythonfrom docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
pipeline_options = PdfPipelineOptions(force_ocr=True)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
source ="data/my-document.pdf"
result = converter.convert(source)
print(result.document.export_to_markdown())
The issue is in the output, every sentence is duplicated.
Here is an example of what I mean:
4. Characteristics of Investment Agency 4. Characteristics of Investment Agency
* 4/1 Investment agency contracts, whether remunerated or unremuner4/1 Investment agency contracts, whether remunerated or unremunerated, are binding on institutions because they are invariably fixed ated, are binding on institutions because they are invariably fixed term contracts in which both parties agree not to terminate within term contracts in which both parties agree not to terminate within a specified period.
* 4/2 Where the parties agreed to terminate for a specified period, it is 4/2 Where the parties agreed to terminate for a specified period, it is permissible for the contract to stipulate the right of one of the parties permissible for the contract to stipulate the right of one of the parties to terminate the contract unilaterally in specific circumstances. to terminate the contract unilaterally in specific circumstances.
* 4/3 When the term of an agency expires, the agent is required not to enter 4/3 When the term of an agency expires, the agent is required not to enter into new investment activities, but may not liquidate ongoing existing into new investment activities, but may not liquidate ongoing existing investments. investments.
As you can see, it happens in both titles and the text itself. Any ideas why? I have tried a few things (i.e., changing force_ocr = False, etc.) but can't seem to pinpoint the issue.
All help would be appreciated.
The text was updated successfully, but these errors were encountered:
@AbzyS1 I ran your snippet against a document that matched your pasted content on Google (I assume it's the same, otherwise please share your document if it is possible) and it works without duplication issues.
Can you share the outputs of docling --version and python -c "import platform; print(platform.platform())"?
In case you have an old Docling version, I would update to the latest and try again.
Question
I am new to Docling and have been experimenting. This is how I have currently implemented Docling to parse my PDF:
The issue is in the output, every sentence is duplicated.
Here is an example of what I mean:
As you can see, it happens in both titles and the text itself. Any ideas why? I have tried a few things (i.e., changing
force_ocr = False
, etc.) but can't seem to pinpoint the issue.All help would be appreciated.
The text was updated successfully, but these errors were encountered: