Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: 04 - Refactoring PDFMiner parser #29526

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Jan 31, 2025

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the XXX parser.

For more details, see PR 28970.

Copy link

vercel bot commented Jan 31, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 31, 2025 4:26pm

@pprados pprados mentioned this pull request Jan 31, 2025
2 tasks
@pprados pprados changed the title community[minor]: 03 - Refactoring PDFMiner parser community[minor]: 04 - Refactoring PDFMiner parser Jan 31, 2025
@pprados pprados force-pushed the pprados/04-pdfminer branch 3 times, most recently from 69b07aa to 278c6d2 Compare January 31, 2025 15:57
@pprados pprados force-pushed the pprados/04-pdfminer branch from 278c6d2 to b14176c Compare January 31, 2025 16:02
@pprados pprados marked this pull request as ready for review January 31, 2025 16:27
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jan 31, 2025
@pprados
Copy link
Contributor Author

pprados commented Jan 31, 2025

@eyurtsev
The next one ;-)

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 31, 2025
@pprados pprados marked this pull request as draft January 31, 2025 16:43
@pprados pprados marked this pull request as ready for review January 31, 2025 16:46
@eyurtsev eyurtsev self-assigned this Jan 31, 2025
pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
images_parser: Optional[BaseImageBlobParser] = None,
images_inner_format: Literal["text", "markdown-img", "html-img"] = "text",
concatenate_pages: Optional[bool] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change w/ respect due to the change in the default.

Could we restore to how it was, update unit tests w/ an exception for this class (i.e., specifically pass concatenate_pages=False)


For making breaking changes in the API, let's not do this as part of the large PRs since they're hard to catch, and we'll want to document them properly / figure out how to roll them out.

Copy link
Collaborator

@eyurtsev eyurtsev Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK if the content of the Document changes due to improvements (e.g., the content is extracted in a format that better matches the PDF). We just don't want to cause silent failures in user code; e.g.,

user assumed it would be a single page and they're code does something like extract only the first document emitted -- so they would end up missing a bunch of content from the PDF and not know it

@eyurtsev
Copy link
Collaborator

need to resolve: #29470 to make sure this and other loaders aren't affected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants