Skip to content

Conversation

rateixei
Copy link
Contributor

This PR implements a way to import Docx DrawingML objects as PNG images into DoclingDocument objects. This includes diagrams, hand-drawn shapes, and Word/Excel charts.

This is performed with the following steps:

  • Once a DrawingML object is found, an empty copy of the docx file is created (this is needed to keep the formatting styles there are defined in the file). One file is created for each paragraph that contains a DrawingML object.
  • The file is populated with the DrawingML object of that paragraph
  • The docx file is temporarily saved, and is exported to a temporary PDF either with LibreOffice (if available) [1], or Word+docx2pdf (if available) [2] or pypandoc (if available). If nothing is available, a warning is displayed.
  • The temporary PDF is read as PNG, which is then cropped to avoid the page rectangle, and stored as a pillow image.

An example docx file containing diagrams, figures and charts is attached, along with the DoclingDocument export.

Leaving the PR as WIP as this has only been tested on MacOS so far.

Notes:
[1] The LibreOffice executable needs to be in PATH
[2] On MacOS, the user needs to explicitly grant access to Word for each individual temp file that is created.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

drawingml_example.json
drawingml_example.docx

@rateixei rateixei requested a review from dolfim-ibm August 14, 2025 14:27
Copy link

mergify bot commented Aug 14, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce conventional commit

This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link
Contributor

DCO Check Passed

Thanks @rateixei, all your commits are properly signed off. 🎉

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 36.44860% with 68 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/docx/drawingml/utils.py 33.82% 45 Missing ⚠️
docling/backend/msword_backend.py 41.02% 23 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm dolfim-ibm self-assigned this Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants