You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternative method tried:
Based on the provided information, it appears that you are encountering a KeyError while extracting text from a PDF file, specifically when the code attempts to access a key named 'refs' in a dictionary that doesn't contain it. To address this issue, we can add error handling to ensure that we gracefully deal with situations where the expected data is not present.
Here’s a revised version of your original code to enhance its robustness:
frommarker.config.parserimportConfigParserfrommarker.converters.pdfimportPdfConverterfrommarker.modelsimportcreate_model_dictdefpdf_to_markdown(file: str) ->str:
config= {
"output_format": "markdown",
"disable_image_extraction": "true",
"disable_links": "true",
}
config_parser=ConfigParser(config)
converter=PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=create_model_dict(),
renderer=config_parser.get_renderer(),
)
try:
rendered=converter(file)
output=rendered.markdownreturnoutputexceptKeyErrorase:
print(f"Error occurred during PDF processing: {e}")
return""if__name__=="__main__":
file="file_location"output=pdf_to_markdown(file)
print(output)
ifoutput:
print(output)
else:
print("The output is empty due to an error in PDF processing.")
Expected behaviour
I expected to get the markdown file(bash command)/string(python function) without HTML tags (e.g. , <script>, etc.)
Current behaviour
I ran into this error:
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'
Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16
Traceback (most recent call last):
File "/root/Debian/test/.venv/bin/marker_single", line 8, in
sys.exit(convert_single_cli())
^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli
rendered = converter(fpath)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 151, in call
document = self.build_document(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 140, in build_document
provider = provider_cls(filepath, self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 94, in init
self.page_lines = self.pdftext_extraction(doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'
FIX PROPOSAL
Here is a proposal to fix this issue.
Thanks in advance for your time.
The text was updated successfully, but these errors were encountered:
Hi and thanks for your work.
Description of the process
marker_single [pdf_file] --disable_image_extraction --output_format markdown --disable_links --output_dir .
Based on the provided information, it appears that you are encountering a
KeyError
while extracting text from a PDF file, specifically when the code attempts to access a key named'refs'
in a dictionary that doesn't contain it. To address this issue, we can add error handling to ensure that we gracefully deal with situations where the expected data is not present.Here’s a revised version of your original code to enhance its robustness:
Expected behaviour
I expected to get the markdown file(bash command)/string(python function) without HTML tags (e.g. , <script>, etc.)
Current behaviour
I ran into this error:
COMPLETE LOG
root-IHT/testing-.venv-~/Debian/test - marker_single tests/resources/test_pdf/fox.pdf --disable_image_extraction --output_format markdown --disable_links --output_dir .
Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16
Traceback (most recent call last):
File "/root/Debian/test/.venv/bin/marker_single", line 8, in
sys.exit(convert_single_cli())
^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli
rendered = converter(fpath)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 151, in call
document = self.build_document(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 140, in build_document
provider = provider_cls(filepath, self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 94, in init
self.page_lines = self.pdftext_extraction(doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'
FIX PROPOSAL
Here is a proposal to fix this issue.
Thanks in advance for your time.
The text was updated successfully, but these errors were encountered: