-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextSplitters: Refactor HTMLHeaderTextSplitter
for Enhanced Maintainability and Readability
#29397
Open
AhmedTammaa
wants to merge
13
commits into
langchain-ai:master
Choose a base branch
from
AhmedTammaa:html-simple-split
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dosubot
bot
added
the
size:M
This PR changes 30-99 lines, ignoring generated files.
label
Jan 23, 2025
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
dosubot
bot
added
size:S
This PR changes 10-29 lines, ignoring generated files.
🤖:nit
Small modifications/deletions, fixes, deps or improvements to existing code or docs
and removed
size:M
This PR changes 30-99 lines, ignoring generated files.
labels
Jan 24, 2025
ccurme
reviewed
Jan 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AhmedTammaa, this appears to just be formatting changes. Was this intended?
dosubot
bot
added
size:L
This PR changes 100-499 lines, ignoring generated files.
and removed
size:S
This PR changes 10-29 lines, ignoring generated files.
labels
Jan 27, 2025
Hi @ccurme, It was not intended, my bad. I have committed the refactored class. Apologies for the inconvenience. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
🤖:nit
Small modifications/deletions, fixes, deps or improvements to existing code or docs
size:L
This PR changes 100-499 lines, ignoring generated files.
Ɑ: text splitters
Related to text splitters package
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please see PR #27678 for context
Overview
This pull request presents a refactor of the
HTMLHeaderTextSplitter
class aimed at improving its maintainability and readability. The primary enhancements include simplifying the internal structure by consolidating multiple private helper functions into a single private method, thereby reducing complexity and making the codebase easier to understand and extend. Importantly, all existing functionalities and public interfaces remain unchanged.PR Goals
Simplify Internal Logic:
_header_level
,_dom_depth
,_get_elements
) to manage different aspects of HTML parsing and document generation. This fragmentation increased cognitive load and potential maintenance overhead._generate_documents
), the class now offers a more straightforward flow, making it easier for developers to trace and understand the processing steps. (Thanks to @eyurtsev)Enhance Readability:
_generate_documents
, which handles both HTML traversal and document creation in a cohesive manner.Improve Maintainability:
Maintain Backward Compatibility:
split_text
,split_text_from_url
,split_text_from_file
) and their signatures remain unchanged, ensuring that existing integrations and usage patterns are unaffected.Detailed Changes
Removed Redundant Private Methods:
_header_level
,_dom_depth
, and_get_elements
: These methods were merged into the_generate_documents
method, centralizing the logic for HTML parsing and document generation.Consolidated Document Generation Logic:
_generate_documents
: This method now handles the entire process of parsing HTML, tracking active headers, managing document chunks, and yieldingDocument
instances. This consolidation reduces the number of moving parts and simplifies the overall processing flow.Simplified Header Management:
_generate_documents
, ensuring that headers are added or removed from the active headers dictionary in real-time based on their DOM depth and hierarchy.chunk_dom_depth
Attribute: The need to track chunk DOM depth separately has been eliminated, as header scopes are now directly managed within the traversal logic.Streamlined Chunk Finalization:
finalize_chunk
Function: The chunk finalization process has been simplified to directly yield a singleDocument
when needed, without maintaining an intermediate list. This change reduces unnecessary list operations and makes the logic more straightforward.Improved Variable Naming and Flow:
current_chunk
andnode_text
provide clear insights into their roles within the processing logic.Preserved Comprehensive Docstrings:
Testing
All existing test cases from
test_html_header_text_splitter.py
have been executed against the refactored code. The results confirm that:Document
objects with correct metadata.This example remains fully operational and behaves as before, returning a list of
Document
objects with the expected metadata and content splits.Conclusion
This refactor achieves a more maintainable and readable codebase by simplifying the internal structure of the
HTMLHeaderTextSplitter
class. By consolidating multiple private methods into a single, cohesive private method, the class becomes easier to understand, debug, and extend. All existing functionalities are preserved, and comprehensive tests confirm that the refactor maintains the expected behavior. These changes align with LangChain’s standards for clean, maintainable, and efficient code.