-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community[minor]: 04 - Refactoring PDFMiner parser #29526
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
69b07aa
to
278c6d2
Compare
278c6d2
to
b14176c
Compare
@eyurtsev |
pages_delimiter: str = _DEFAULT_PAGES_DELIMITER, | ||
images_parser: Optional[BaseImageBlobParser] = None, | ||
images_inner_format: Literal["text", "markdown-img", "html-img"] = "text", | ||
concatenate_pages: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change w/ respect due to the change in the default.
Could we restore to how it was, update unit tests w/ an exception for this class (i.e., specifically pass concatenate_pages=False)
For making breaking changes in the API, let's not do this as part of the large PRs since they're hard to catch, and we'll want to document them properly / figure out how to roll them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK if the content of the Document changes due to improvements (e.g., the content is extracted in a format that better matches the PDF). We just don't want to cause silent failures in user code; e.g.,
user assumed it would be a single page and they're code does something like extract only the first document emitted -- so they would end up missing a bunch of content from the PDF and not know it
need to resolve: #29470 to make sure this and other loaders aren't affected |
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the XXX parser.
For more details, see PR 28970.