Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docx parsing headings with outline format numbered-style applied is not working properly. #795

Open
jbtelice opened this issue Jan 23, 2025 · 3 comments
Labels
bug Something isn't working docx issue related to docx backend

Comments

@jbtelice
Copy link

Bug

msword_backend.py doesn't parse docx files with headings ("Heading 1, Heading 2", etc) properly. When Outline format is applied to a heading:

Image

and has "Numbering style" activated as it follows:

Image

the parser label that paragraph as a list (which is true) but no as a section_header ( which is true as well). That's ambiguous and it potentially prone to unnoticed errors( Imagine when you have to process a lot of docx files in which you are NOT allowed to edit...)

The consequences are:

  • If there is no proper parsing, that carries errors to the HierarchicalChunker too.
  • The user cannot proper access to the document structure, proceeding as explained here

Here is an screenshot of what I mean:

Clean headings (No outline format)

Image

Headings with outline format

Image

Steps to reproduce

Here are the docx samples:

bug_example.docx

bug_example.docx

bug_example_with_list_headings.docx

bug_example_with_list_headings.docx

from docling.document_converter import DocumentConverter

DOC_SOURCE_ORIGINAL = "./bug_example.docx"
DOC_SOURCE_MODIFIED = "./bug_example_with_list_headings.docx"

doc_original = DocumentConverter().convert(source=DOC_SOURCE_ORIGINAL).document
doc_edited = DocumentConverter().convert(source=DOC_SOURCE_MODIFIED).document

print(f"---- {DOC_SOURCE_ORIGINAL} ----")

doc_original.print_element_tree()

print(f"---- {DOC_SOURCE_MODIFIED} ----")

doc_edited.print_element_tree()

Here is the output:

---- ./bug_example.docx ----
 0: unspecified with name=_root_
  1: section with name=header-0
   2: section_header    
    3: paragraph  # empty
    4: paragraph
    5: paragraph  # empty
    6: section_header  
     7: paragraph # empty
     8: paragraph
     9: paragraph # empty
    10: section_header
     11: paragraph   # empty
     12: paragraph
     13: paragraph  # empty
     14: paragraph
     15: paragraph  # empty
     16: section_header
      17: paragraph  # empty
      18: paragraph
      19: paragraph  # empty
   20: section_header
    21: paragraph  # empty
    22: paragraph
    23: paragraph  # empty
---- ./bug_example_with_list_headings.docx ----
 0: unspecified with name=_root_
  1: list with name=list    # should be header
   2: list_item
  3: paragraph
  4: paragraph
  5: paragraph
  6: list with name=list   # should be header
   7: list_item
  8: paragraph
  9: paragraph
  10: paragraph
  11: list with name=list   # should be header
   12: list_item
  13: paragraph
  14: paragraph
  15: paragraph
  16: paragraph
  17: paragraph
  18: list with name=list   # should be header
   19: list_item
  20: paragraph
  21: paragraph
  22: paragraph
  23: list with name=list   # should be header
   24: list_item
  25: paragraph
  26: paragraph
  27: paragraph
(.venv) 

Be aware that in the first example, there is no name (label.name) in section_headers

Docling version

Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

Python version

Python 3.11.9

Additional Context

Digging into the code, I notice some other things worth to explore too:

[split_text_and_number](

def split_text_and_number(self, input_string):
) . That regex is not trimming the match.groups()

Image

Image

That implies wrong label parsing

Regarding style

line 200

¿Shouldn't be name instead of style_id?

Image

Hope it helps,

Let me know if you need more information.

Have a nice day!

@jbtelice jbtelice added the bug Something isn't working label Jan 23, 2025
@MiguelAngelTorres
Copy link

I found the same problem parsing other file that combines both outline format and non-outline format in the same file.

Image

The non-outline format produces that the outline format output is neither generated correctly.

Image

Docling version

Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

@MiguelAngelTorres
Copy link

Update

It seems the problem is not the outline. Testing with different outlines and formats doesn't resolve the issue.

The point is the label parsing:

  • There is an additional whitespace on labels.
  • Microsoft Word changes the style_id depending on the language settings. I found a post in Microsoft's forum about that.

Coding in local the changes suggested by @jbtelice worked for me, but it would be nice to include them in the core code.

@PeterStaar-IBM PeterStaar-IBM added the docx issue related to docx backend label Jan 28, 2025
@jbtelice
Copy link
Author

jbtelice commented Jan 28, 2025

Hi!

There is an additional whitespace on labels.

Yep, indeed, that's what I mean (In the additional context section). In fact, the mechanism for assigning label_str and label_level is language-dependant if you rely on "style_id", and could be fixed if you take name instead of (label name seems to be language-invariant, but I didn't have the time to check it out carefully)

Image

Microsoft Word changes the style_id depending on the language settings. I found a post in Microsoft's forum about that.

Thanks @MiguelAngelTorres for pointing out the reference. 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docx issue related to docx backend
Projects
None yet
Development

No branches or pull requests

3 participants