A Python tool for converting Microsoft DOCX files to Markdown format with image and table preservation. Optimized for 3GPP specification documents.
- Text Conversion: Convert DOCX text content to Markdown with proper heading detection
- Table Preservation: Convert Word tables to Markdown table syntax
- Image Extraction: Extract images and embed them in Markdown
- Automatic SVG Conversion: Convert WMF/EMF files to SVG using Inkscape or placeholders
- Inkscape Integration: Automatic detection and use of Inkscape for high-quality conversions
- Interactive Path Input: Manual Inkscape path input when automatic detection fails
- Persistent Storage: Remember Inkscape path for future use
- 3GPP Specialization: Enhanced support for 3GPP specification document formatting
- High-Quality Output: Generate clean, readable Markdown files
pip install docx2md-
Clone the repository:
git clone <repository-url> cd docx2md
-
Install in development mode:
pip install -e .
# Basic conversion
docx2md -i input.docx
# Specify output file
docx2md -i input.docx -o output.md
# Skip image extraction
docx2md -i input.docx --no-images
# Custom image directory
docx2md -i input.docx --images-dir my_images
# Verbose output with validation
docx2md -i input.docx -v --validate
# Get help
docx2md --help# Required arguments
-i, --input INPUT Input DOCX file path
# Optional arguments
-o, --output OUTPUT Output Markdown file path (default: input filename with .md extension)
--images-dir DIR Directory to extract images (default: images)
--inkscape-path PATH Custom path to Inkscape executable
--no-images Skip image extraction (only convert text and tables)
--validate Run validation after conversion
-v, --verbose Enable verbose output
-h, --help Show help message# Convert 3GPP specification document with full validation
docx2md -i nprach.docx -o nprach.md --validate -v
# Convert with custom Inkscape path
docx2md -i document.docx --inkscape-path "C:\Program Files\Inkscape\bin\inkscape.exe"
# Convert multiple documents (batch processing)
for doc in *.docx; do
docx2md -i "$doc" -o "${doc%.docx}.md"
done
# Convert without images (text and tables only)
docx2md -i document.docx --no-images
# Convert with custom image directory
docx2md -i document.docx --images-dir assets
# Convert with custom image directory and Inkscape path
docx2md -i document.docx --images-dir assets --inkscape-path "/usr/bin/inkscape"from docx2md.src.convert_embedded_images import EmbeddedImagesDocx2MdConverter
# Basic conversion
converter = EmbeddedImagesDocx2MdConverter()
results = converter.convert("input.docx", "output.md")
# With custom Inkscape path
inkscape_path = r"C:\Program Files\Inkscape\bin\inkscape.exe"
converter = EmbeddedImagesDocx2MdConverter(inkscape_path=inkscape_path)
results = converter.convert("input.docx", "output.md")
# With custom image directory and Inkscape path
converter = EmbeddedImagesDocx2MdConverter(
base_image_path="my_images",
inkscape_path=inkscape_path
)
results = converter.convert("input.docx", "output.md")
# Skip image extraction
results = converter.convert(
"input.docx",
"output.md",
extract_images=False
)docx2md/
├── docx2md/ # Main package
│ ├── __init__.py # Package initialization
│ ├── docx2md.py # Command line interface
│ └── src/ # Source modules
│ ├── __init__.py
│ ├── convert_embedded_images.py # Main converter class
│ ├── docx_parser.py # DOCX content extraction
│ ├── markdown_generator.py # Markdown generation
│ ├── paragraph_with_images_extractor.py # Paragraph and image extraction
│ ├── validate_embedded_images.py # Validation tools
│ └── xml_based_extractor.py # XML-based extraction
├── tests/ # Unit tests
├── requirements.txt # Dependencies
├── setup.py # Package setup
├── LICENSE # License file
└── README.md # This file
python-docx>=0.8.11- DOCX file parsingPillow>=9.0.0- Image processingnumpy>=1.21.0- Numerical operationspytest>=7.0.0- Testing framework
Run the test suite:
python -m pytest tests/ -v- Naming: snake_case for all variables and functions
- Comments: No Chinese in comments or print statements
- Testing: Unit tests for all functions
- Type Support: Full type hinting support
Convert a 3GPP specification document:
docx2md -i nprach.docx -o nprach.md --validateThis will:
- Extract text content with proper heading detection
- Convert tables to Markdown format
- Extract and save images to an
imagesdirectory - Generate a high-quality Markdown file with embedded images
- Validate the conversion results
- PNG, JPEG: Extracted directly from DOCX and embedded in Markdown
- WMF/EMF (Windows Metafile): Automatically converted to SVG format
- SVG: Preferred format for vector graphics
The tool automatically handles WMF/EMF to SVG conversion:
- With Inkscape: If Inkscape is installed, it performs high-quality conversion
- Without Inkscape: Generates informative placeholder SVGs with installation instructions
- File Size Based: Larger files (>50KB) get detailed placeholders with conversion instructions
For optimal results, install Inkscape:
# Windows (using chocolatey)
choco install inkscape
# Windows (Microsoft Store)
# Search for "Inkscape" in Microsoft Store
# macOS (using homebrew)
brew install inkscape
# Linux (Ubuntu/Debian)
sudo apt install inkscapeAutomatic Detection: The tool automatically detects Inkscape in:
- System PATH
- Common installation directories
- Microsoft Store installations
- Portable installations
Interactive Path Input: If Inkscape is not found automatically, the tool provides an interactive prompt with four options:
- Option 1: Enter Inkscape executable path manually (use for this session only)
- Option 2: Enter Inkscape path and remember for future use
- Option 3: Skip SVG conversion (use placeholder images)
- Option 4: Cancel conversion
Persistent Storage: When you choose option 2, the Inkscape path is saved to a configuration file and automatically used in future conversions.
Non-Interactive Support: In batch processing or script environments, the tool automatically detects non-interactive mode and uses placeholder images without prompting.
With Inkscape installed, the tool will automatically convert WMF/EMF files to high-quality SVG format during conversion.
If Inkscape is not found automatically, the tool provides an interactive prompt:
============================================================
Inkscape not found automatically
Inkscape is required for converting WMF/EMF files to SVG
============================================================
Options:
1. Enter Inkscape executable path manually
2. Skip SVG conversion (use placeholder images)
3. Cancel conversion
Please choose an option (1/2/3):
If you choose option 1, you'll be prompted to enter the full path to the Inkscape executable:
Please enter the full path to Inkscape executable:
Example paths:
- Windows:
C:\Program Files\Inkscape\bin\inkscape.exe - Windows (Microsoft Store):
C:\Program Files\WindowsApps\25415Inkscape.Inkscape_1.4.21.0_x64__9waqn51p1ttv2\VFS\ProgramFilesX64\Inkscape\bin\inkscape.exe - macOS:
/Applications/Inkscape.app/Contents/MacOS/inkscape - Linux:
/usr/bin/inkscape
The tool will validate the path and confirm if Inkscape is working correctly.
If you choose option 2, the tool will generate informative placeholder SVG images with installation instructions instead of performing actual SVG conversion.
If you choose option 3, the conversion process will be cancelled.
In batch processing, scripts, or other non-interactive environments, the tool automatically detects the non-interactive mode and uses placeholder images without prompting the user.
- File Not Found: Ensure the input DOCX file exists and path is correct
- Permission Errors: Check write permissions for output directory
- Image Extraction Failures: Verify DOCX file contains embedded images
- Validation Warnings: Some duplicate images are normal in technical documents
- WMF/EMF Images Showing Placeholders: Install Inkscape for automatic high-quality SVG conversion
- Inkscape Not Found: Use the interactive path input feature to manually specify Inkscape location
- Use
docx2md --helpfor command reference - Review validation output for conversion quality assessment
This software is licensed under a commercial license. Personal, educational, and non-commercial use is free. Commercial use requires a paid license.
See LICENSE for full terms and conditions.
See CONTRIBUTING.md for contribution guidelines.