Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code usability improvement #190

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 51 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,4 +206,54 @@ This work would not have been possible without amazing open source models and da
- DocLayNet from IBM
- ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!
# How to use this library for Python 3.10
```shell
pip install -r https://raw.githubusercontent.com/VikParuchuri/marker/master/requirements.txt
```
```shell
import os

from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models
from marker.output import save_markdown


def convert_pdf_to_markdown(filename, output_dir, max_pages=470, start_page=1, batch_multiplier=1, langs=["en"]):
"""
Converts a PDF file to Markdown format using the Marker library.

Args:
filename (str): Path to the PDF file.
output_dir (str): Directory to save the converted Markdown file.
max_pages (int, optional): Maximum number of pages to convert. Defaults to 10.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the start page not default to 0?

I had an error when trying this with a single page pdf until I changed start_page from 1 to 0

start_page (int, optional): Page number to start conversion from. Defaults to 1.
batch_multiplier (int, optional): Multiplier for batch processing. Defaults to 2.
langs (str, optional): Languages to be detected in the PDF. Defaults to "English".

Returns:
str: Path to the folder containing the converted Markdown file.

Raises:
ImportError: If the Marker library is not installed.
"""
configure_logging()

model_lst = load_all_models()
full_text, images, out_meta = convert_single_pdf(filename, model_lst, max_pages=max_pages, langs=langs,
batch_multiplier=batch_multiplier, start_page=start_page)
fname = os.path.basename(filename)
subfolder_path = save_markdown(output_dir, fname, full_text, images, out_meta)
print(f"Saved markdown to the {subfolder_path} folder")

return subfolder_path

filename = "input file.pdf"
output = "output_folder"

# Example usage
converted_folder = convert_pdf_to_markdown(filename,output)

```
Thank you to the authors of these models and datasets for making them available to the community!

66 changes: 66 additions & 0 deletions requirements.txt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project already contains a poetry lock file, so no need for requirements

Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
aiosignal==1.3.1
analytics-python==1.4.post1
annotated-types==0.7.0
certifi==2024.6.2
charset-normalizer==3.3.2
docutils==0.20.1
environs==9.5.0
filelock==3.14.0
filetype==1.2.0
fsspec==2024.6.0
ftfy==6.2.0
grpcio==1.64.1
huggingface-hub==0.23.3
idna==3.7
Jinja2==3.1.4
joblib==1.4.2
marker-pdf==0.2.13
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
networkx==3.3
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
opencv-python==4.10.0.82
packaging==24.1
pdftext==0.3.10
pillow==10.3.0
pydub==0.25.1
pypdfium2==4.30.0
python-dateutil==2.8.2
python-dotenv==1.0.1
PyYAML==6.0.1
rapidfuzz==3.9.3
readme-renderer==41.0
regex==2024.5.15
requests==2.32.3
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
sniffio==1.3.0
surya-ocr==0.4.12
sympy==1.12.1
tabulate==0.9.0
texify==0.1.9
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.3.1
tqdm==4.66.4
transformers==4.41.2
triton==2.3.1
twine==4.0.2
typing_extensions==4.12.2
urllib3==2.2.1
wcwidth==0.2.13
Loading