community[minor]: 04 - Refactoring PDFMiner parser #29526

pprados · 2025-01-31T15:40:31Z

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the XXX parser.

For more details, see PR 28970.

vercel · 2025-01-31T15:40:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 31, 2025 4:26pm

pprados · 2025-01-31T16:27:20Z

@eyurtsev
The next one ;-)

eyurtsev · 2025-01-31T17:35:12Z

libs/community/langchain_community/document_loaders/parsers/pdf.py

+        pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
+        images_parser: Optional[BaseImageBlobParser] = None,
+        images_inner_format: Literal["text", "markdown-img", "html-img"] = "text",
+        concatenate_pages: Optional[bool] = None,


This is a breaking change w/ respect due to the change in the default.

Could we restore to how it was, update unit tests w/ an exception for this class (i.e., specifically pass concatenate_pages=False)

For making breaking changes in the API, let's not do this as part of the large PRs since they're hard to catch, and we'll want to document them properly / figure out how to roll them out.

It's OK if the content of the Document changes due to improvements (e.g., the content is extracted in a format that better matches the PDF). We just don't want to cause silent failures in user code; e.g.,

user assumed it would be a single page and they're code does something like extract only the first document emitted -- so they would end up missing a bunch of content from the PDF and not know it

eyurtsev · 2025-01-31T18:12:01Z

need to resolve: #29470 to make sure this and other loaders aren't affected

pprados mentioned this pull request Jan 31, 2025

Refactoring PDF loaders: all #28970

Draft

2 tasks

pprados changed the title ~~community[minor]: 03 - Refactoring PDFMiner parser~~ community[minor]: 04 - Refactoring PDFMiner parser Jan 31, 2025

pprados force-pushed the pprados/04-pdfminer branch 3 times, most recently from 69b07aa to 278c6d2 Compare January 31, 2025 15:57

Refactor pdfminer

b14176c

pprados force-pushed the pprados/04-pdfminer branch from 278c6d2 to b14176c Compare January 31, 2025 16:02

Merge branch 'master' into pprados/04-pdfminer

ee61770

vercel bot deployed to Preview January 31, 2025 16:26 View deployment

pprados marked this pull request as ready for review January 31, 2025 16:27

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jan 31, 2025

dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 31, 2025

pprados marked this pull request as draft January 31, 2025 16:43

pprados marked this pull request as ready for review January 31, 2025 16:46

eyurtsev self-assigned this Jan 31, 2025

eyurtsev reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[minor]: 04 - Refactoring PDFMiner parser #29526

community[minor]: 04 - Refactoring PDFMiner parser #29526

pprados commented Jan 31, 2025

vercel bot commented Jan 31, 2025 •

edited

Loading

pprados commented Jan 31, 2025

eyurtsev Jan 31, 2025

eyurtsev Jan 31, 2025 •

edited

Loading

eyurtsev commented Jan 31, 2025

community[minor]: 04 - Refactoring PDFMiner parser #29526

Are you sure you want to change the base?

community[minor]: 04 - Refactoring PDFMiner parser #29526

Conversation

pprados commented Jan 31, 2025

vercel bot commented Jan 31, 2025 • edited Loading

pprados commented Jan 31, 2025

eyurtsev Jan 31, 2025

Choose a reason for hiding this comment

eyurtsev Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

eyurtsev commented Jan 31, 2025

vercel bot commented Jan 31, 2025 •

edited

Loading

eyurtsev Jan 31, 2025 •

edited

Loading