Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using marker_single with parameter "--disable_links" #584

Open
fatualux opened this issue Feb 26, 2025 · 0 comments · May be fixed by #585
Open

Error when using marker_single with parameter "--disable_links" #584

fatualux opened this issue Feb 26, 2025 · 0 comments · May be fixed by #585

Comments

@fatualux
Copy link

fatualux commented Feb 26, 2025

Hi and thanks for your work.

Description of the process

  • Conversion of PDF to markdown.
  • Command used: marker_single [pdf_file] --disable_image_extraction --output_format markdown --disable_links --output_dir .
  • Alternative method tried:
    Based on the provided information, it appears that you are encountering a KeyError while extracting text from a PDF file, specifically when the code attempts to access a key named 'refs' in a dictionary that doesn't contain it. To address this issue, we can add error handling to ensure that we gracefully deal with situations where the expected data is not present.

Here’s a revised version of your original code to enhance its robustness:

from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict


def pdf_to_markdown(file: str) -> str:
    config = {
        "output_format": "markdown",
        "disable_image_extraction": "true",
        "disable_links": "true",
    }
    config_parser = ConfigParser(config)

    converter = PdfConverter(
        config=config_parser.generate_config_dict(),
        artifact_dict=create_model_dict(),
        renderer=config_parser.get_renderer(),
    )

    try:
    rendered = converter(file)
    output = rendered.markdown
    return output
    except KeyError as e:
        print(f"Error occurred during PDF processing: {e}")
        return ""

if __name__ == "__main__":
    file = "file_location"
    output = pdf_to_markdown(file)

    print(output)
    if output:
        print(output)
    else:
        print("The output is empty due to an error in PDF processing.")

Expected behaviour
I expected to get the markdown file(bash command)/string(python function) without HTML tags (e.g. , <script>, etc.)

Current behaviour
I ran into this error:

File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
    self.page_refs[page_id] = page["refs"]
                              ~~~~^^^^^^^^
KeyError: 'refs'
COMPLETE LOG root-IHT/testing-.venv-~/Debian/test - marker_single tests/resources/test_pdf/fox.pdf --disable_image_extraction --output_format markdown --disable_links --output_dir .

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16
Traceback (most recent call last):
File "/root/Debian/test/.venv/bin/marker_single", line 8, in
sys.exit(convert_single_cli())
^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli
rendered = converter(fpath)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 151, in call
document = self.build_document(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 140, in build_document
provider = provider_cls(filepath, self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 94, in init
self.page_lines = self.pdftext_extraction(doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'

FIX PROPOSAL
Here is a proposal to fix this issue.

Thanks in advance for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant