You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
msword_backend.py doesn't parse docx files with headings ("Heading 1, Heading 2", etc) properly. When Outline format is applied to a heading:
and has "Numbering style" activated as it follows:
the parser label that paragraph as a list (which is true) but no as a section_header ( which is true as well). That's ambiguous and it potentially prone to unnoticed errors( Imagine when you have to process a lot of docx files in which you are NOT allowed to edit...)
The consequences are:
If there is no proper parsing, that carries errors to the HierarchicalChunker too.
The user cannot proper access to the document structure, proceeding as explained here
---- ./bug_example_with_list_headings.docx ----
0: unspecified with name=_root_
1: list with name=list # should be header
2: list_item
3: paragraph
4: paragraph
5: paragraph
6: list with name=list # should be header
7: list_item
8: paragraph
9: paragraph
10: paragraph
11: list with name=list # should be header
12: list_item
13: paragraph
14: paragraph
15: paragraph
16: paragraph
17: paragraph
18: list with name=list # should be header
19: list_item
20: paragraph
21: paragraph
22: paragraph
23: list with name=list # should be header
24: list_item
25: paragraph
26: paragraph
27: paragraph
(.venv)
Be aware that in the first example, there is no name (label.name) in section_headers
Yep, indeed, that's what I mean (In the additional context section). In fact, the mechanism for assigning label_str and label_level is language-dependant if you rely on "style_id", and could be fixed if you take name instead of (label name seems to be language-invariant, but I didn't have the time to check it out carefully)
Microsoft Word changes the style_id depending on the language settings. I found a post in Microsoft's forum about that.
Bug
msword_backend.py doesn't parse docx files with headings ("Heading 1, Heading 2", etc) properly. When Outline format is applied to a heading:
and has "Numbering style" activated as it follows:
the parser label that paragraph as a list (which is true) but no as a section_header ( which is true as well). That's ambiguous and it potentially prone to unnoticed errors( Imagine when you have to process a lot of docx files in which you are NOT allowed to edit...)
The consequences are:
Here is an screenshot of what I mean:
Clean headings (No outline format)
Headings with outline format
Steps to reproduce
Here are the docx samples:
bug_example.docx
bug_example.docx
bug_example_with_list_headings.docx
bug_example_with_list_headings.docx
Here is the output:
Docling version
Python version
Additional Context
Digging into the code, I notice some other things worth to explore too:
[split_text_and_number](
docling/docling/backend/msword_backend.py
Line 170 in 1976584
That implies wrong label parsing
Regarding style
line 200
¿Shouldn't be name instead of style_id?
Hope it helps,
Let me know if you need more information.
Have a nice day!
The text was updated successfully, but these errors were encountered: