docx parsing headings with outline format numbered-style applied is not working properly. #795

jbtelice · 2025-01-23T19:45:02Z

Bug

msword_backend.py doesn't parse docx files with headings ("Heading 1, Heading 2", etc) properly. When Outline format is applied to a heading:

and has "Numbering style" activated as it follows:

the parser label that paragraph as a list (which is true) but no as a section_header ( which is true as well). That's ambiguous and it potentially prone to unnoticed errors( Imagine when you have to process a lot of docx files in which you are NOT allowed to edit...)

The consequences are:

If there is no proper parsing, that carries errors to the HierarchicalChunker too.
The user cannot proper access to the document structure, proceeding as explained here

Here is an screenshot of what I mean:

Clean headings (No outline format)

Headings with outline format

Steps to reproduce

Here are the docx samples:

bug_example.docx

bug_example.docx

bug_example_with_list_headings.docx

bug_example_with_list_headings.docx

from docling.document_converter import DocumentConverter

DOC_SOURCE_ORIGINAL = "./bug_example.docx"
DOC_SOURCE_MODIFIED = "./bug_example_with_list_headings.docx"

doc_original = DocumentConverter().convert(source=DOC_SOURCE_ORIGINAL).document
doc_edited = DocumentConverter().convert(source=DOC_SOURCE_MODIFIED).document

print(f"---- {DOC_SOURCE_ORIGINAL} ----")

doc_original.print_element_tree()

print(f"---- {DOC_SOURCE_MODIFIED} ----")

doc_edited.print_element_tree()

Here is the output:

---- ./bug_example.docx ----
 0: unspecified with name=_root_
  1: section with name=header-0
   2: section_header    
    3: paragraph  # empty
    4: paragraph
    5: paragraph  # empty
    6: section_header  
     7: paragraph # empty
     8: paragraph
     9: paragraph # empty
    10: section_header
     11: paragraph   # empty
     12: paragraph
     13: paragraph  # empty
     14: paragraph
     15: paragraph  # empty
     16: section_header
      17: paragraph  # empty
      18: paragraph
      19: paragraph  # empty
   20: section_header
    21: paragraph  # empty
    22: paragraph
    23: paragraph  # empty

---- ./bug_example_with_list_headings.docx ----
 0: unspecified with name=_root_
  1: list with name=list    # should be header
   2: list_item
  3: paragraph
  4: paragraph
  5: paragraph
  6: list with name=list   # should be header
   7: list_item
  8: paragraph
  9: paragraph
  10: paragraph
  11: list with name=list   # should be header
   12: list_item
  13: paragraph
  14: paragraph
  15: paragraph
  16: paragraph
  17: paragraph
  18: list with name=list   # should be header
   19: list_item
  20: paragraph
  21: paragraph
  22: paragraph
  23: list with name=list   # should be header
   24: list_item
  25: paragraph
  26: paragraph
  27: paragraph
(.venv)

Be aware that in the first example, there is no name (label.name) in section_headers

Docling version

Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

Python version

Python 3.11.9

Additional Context

Digging into the code, I notice some other things worth to explore too:

[split_text_and_number](

docling/docling/backend/msword_backend.py

Line 170 in 1976584

def split_text_and_number(self, input_string):

) . That regex is not trimming the match.groups()

That implies wrong label parsing

Regarding style

line 200

¿Shouldn't be name instead of style_id?

Hope it helps,

Let me know if you need more information.

Have a nice day!

MiguelAngelTorres · 2025-01-27T11:20:30Z

I found the same problem parsing other file that combines both outline format and non-outline format in the same file.

The non-outline format produces that the outline format output is neither generated correctly.

Docling version

Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

MiguelAngelTorres · 2025-01-27T17:17:54Z

Update

It seems the problem is not the outline. Testing with different outlines and formats doesn't resolve the issue.

The point is the label parsing:

There is an additional whitespace on labels.
Microsoft Word changes the style_id depending on the language settings. I found a post in Microsoft's forum about that.

Coding in local the changes suggested by @jbtelice worked for me, but it would be nice to include them in the core code.

jbtelice · 2025-01-28T07:10:19Z

Hi!

There is an additional whitespace on labels.

Yep, indeed, that's what I mean (In the additional context section). In fact, the mechanism for assigning label_str and label_level is language-dependant if you rely on "style_id", and could be fixed if you take name instead of (label name seems to be language-invariant, but I didn't have the time to check it out carefully)

Microsoft Word changes the style_id depending on the language settings. I found a post in Microsoft's forum about that.

Thanks @MiguelAngelTorres for pointing out the reference. 👀

jbtelice added the bug Something isn't working label Jan 23, 2025

MiguelAngelTorres mentioned this issue Jan 27, 2025

Numbered headings in Word documents appear as list items #612

Open

PeterStaar-IBM added the docx issue related to docx backend label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx parsing headings with outline format numbered-style applied is not working properly. #795

docx parsing headings with outline format numbered-style applied is not working properly. #795

jbtelice commented Jan 23, 2025

MiguelAngelTorres commented Jan 27, 2025

MiguelAngelTorres commented Jan 27, 2025

jbtelice commented Jan 28, 2025 •

edited

Loading

docx parsing headings with outline format numbered-style applied is not working properly. #795

docx parsing headings with outline format numbered-style applied is not working properly. #795

Comments

jbtelice commented Jan 23, 2025

Bug

Steps to reproduce

Docling version

Python version

Additional Context

[split_text_and_number](

Regarding style

MiguelAngelTorres commented Jan 27, 2025

Docling version

MiguelAngelTorres commented Jan 27, 2025

Update

jbtelice commented Jan 28, 2025 • edited Loading

jbtelice commented Jan 28, 2025 •

edited

Loading