Skip to content

xiaoshaoning/docx2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docx2md

A Python tool for converting Microsoft DOCX files to Markdown format with image and table preservation. Optimized for 3GPP specification documents.

Features

  • Text Conversion: Convert DOCX text content to Markdown with proper heading detection
  • Table Preservation: Convert Word tables to Markdown table syntax
  • Image Extraction: Extract images and embed them in Markdown
  • Automatic SVG Conversion: Convert WMF/EMF files to SVG using Inkscape or placeholders
  • Inkscape Integration: Automatic detection and use of Inkscape for high-quality conversions
  • Interactive Path Input: Manual Inkscape path input when automatic detection fails
  • Persistent Storage: Remember Inkscape path for future use
  • 3GPP Specialization: Enhanced support for 3GPP specification document formatting
  • High-Quality Output: Generate clean, readable Markdown files

Installation

From PyPI (Recommended)

pip install docx2md

From Source

  1. Clone the repository:

    git clone <repository-url>
    cd docx2md
  2. Install in development mode:

    pip install -e .

Usage

Command Line Interface

# Basic conversion
docx2md -i input.docx

# Specify output file
docx2md -i input.docx -o output.md

# Skip image extraction
docx2md -i input.docx --no-images

# Custom image directory
docx2md -i input.docx --images-dir my_images

# Verbose output with validation
docx2md -i input.docx -v --validate

# Get help
docx2md --help

Command Line Options

# Required arguments
-i, --input INPUT     Input DOCX file path

# Optional arguments
-o, --output OUTPUT   Output Markdown file path (default: input filename with .md extension)
--images-dir DIR      Directory to extract images (default: images)
--inkscape-path PATH  Custom path to Inkscape executable
--no-images           Skip image extraction (only convert text and tables)
--validate            Run validation after conversion
-v, --verbose         Enable verbose output
-h, --help            Show help message

Advanced Examples

# Convert 3GPP specification document with full validation
docx2md -i nprach.docx -o nprach.md --validate -v

# Convert with custom Inkscape path
docx2md -i document.docx --inkscape-path "C:\Program Files\Inkscape\bin\inkscape.exe"

# Convert multiple documents (batch processing)
for doc in *.docx; do
    docx2md -i "$doc" -o "${doc%.docx}.md"
done

# Convert without images (text and tables only)
docx2md -i document.docx --no-images

# Convert with custom image directory
docx2md -i document.docx --images-dir assets

# Convert with custom image directory and Inkscape path
docx2md -i document.docx --images-dir assets --inkscape-path "/usr/bin/inkscape"

Python API

from docx2md.src.convert_embedded_images import EmbeddedImagesDocx2MdConverter

# Basic conversion
converter = EmbeddedImagesDocx2MdConverter()
results = converter.convert("input.docx", "output.md")

# With custom Inkscape path
inkscape_path = r"C:\Program Files\Inkscape\bin\inkscape.exe"
converter = EmbeddedImagesDocx2MdConverter(inkscape_path=inkscape_path)
results = converter.convert("input.docx", "output.md")

# With custom image directory and Inkscape path
converter = EmbeddedImagesDocx2MdConverter(
    base_image_path="my_images",
    inkscape_path=inkscape_path
)
results = converter.convert("input.docx", "output.md")

# Skip image extraction
results = converter.convert(
    "input.docx",
    "output.md",
    extract_images=False
)

Project Structure

docx2md/
├── docx2md/                   # Main package
│   ├── __init__.py            # Package initialization
│   ├── docx2md.py             # Command line interface
│   └── src/                   # Source modules
│       ├── __init__.py
│       ├── convert_embedded_images.py    # Main converter class
│       ├── docx_parser.py          # DOCX content extraction
│       ├── markdown_generator.py   # Markdown generation
│       ├── paragraph_with_images_extractor.py  # Paragraph and image extraction
│       ├── validate_embedded_images.py   # Validation tools
│       └── xml_based_extractor.py  # XML-based extraction
├── tests/                      # Unit tests
├── requirements.txt            # Dependencies
├── setup.py                   # Package setup
├── LICENSE                     # License file
└── README.md                  # This file

Dependencies

  • python-docx>=0.8.11 - DOCX file parsing
  • Pillow>=9.0.0 - Image processing
  • numpy>=1.21.0 - Numerical operations
  • pytest>=7.0.0 - Testing framework

Testing

Run the test suite:

python -m pytest tests/ -v

Code Standards

  • Naming: snake_case for all variables and functions
  • Comments: No Chinese in comments or print statements
  • Testing: Unit tests for all functions
  • Type Support: Full type hinting support

Example

Convert a 3GPP specification document:

docx2md -i nprach.docx -o nprach.md --validate

This will:

  • Extract text content with proper heading detection
  • Convert tables to Markdown format
  • Extract and save images to an images directory
  • Generate a high-quality Markdown file with embedded images
  • Validate the conversion results

Image Handling

Supported Image Formats

  • PNG, JPEG: Extracted directly from DOCX and embedded in Markdown
  • WMF/EMF (Windows Metafile): Automatically converted to SVG format
  • SVG: Preferred format for vector graphics

Automatic SVG Conversion

The tool automatically handles WMF/EMF to SVG conversion:

  1. With Inkscape: If Inkscape is installed, it performs high-quality conversion
  2. Without Inkscape: Generates informative placeholder SVGs with installation instructions
  3. File Size Based: Larger files (>50KB) get detailed placeholders with conversion instructions

Inkscape Integration

For optimal results, install Inkscape:

# Windows (using chocolatey)
choco install inkscape

# Windows (Microsoft Store)
# Search for "Inkscape" in Microsoft Store

# macOS (using homebrew)
brew install inkscape

# Linux (Ubuntu/Debian)
sudo apt install inkscape

Automatic Detection: The tool automatically detects Inkscape in:

  • System PATH
  • Common installation directories
  • Microsoft Store installations
  • Portable installations

Interactive Path Input: If Inkscape is not found automatically, the tool provides an interactive prompt with four options:

  • Option 1: Enter Inkscape executable path manually (use for this session only)
  • Option 2: Enter Inkscape path and remember for future use
  • Option 3: Skip SVG conversion (use placeholder images)
  • Option 4: Cancel conversion

Persistent Storage: When you choose option 2, the Inkscape path is saved to a configuration file and automatically used in future conversions.

Non-Interactive Support: In batch processing or script environments, the tool automatically detects non-interactive mode and uses placeholder images without prompting.

With Inkscape installed, the tool will automatically convert WMF/EMF files to high-quality SVG format during conversion.

Interactive Inkscape Path Input

When Automatic Detection Fails

If Inkscape is not found automatically, the tool provides an interactive prompt:

============================================================
Inkscape not found automatically
Inkscape is required for converting WMF/EMF files to SVG
============================================================

Options:
1. Enter Inkscape executable path manually
2. Skip SVG conversion (use placeholder images)
3. Cancel conversion

Please choose an option (1/2/3):

Option 1: Manual Path Input

If you choose option 1, you'll be prompted to enter the full path to the Inkscape executable:

Please enter the full path to Inkscape executable:

Example paths:

  • Windows: C:\Program Files\Inkscape\bin\inkscape.exe
  • Windows (Microsoft Store): C:\Program Files\WindowsApps\25415Inkscape.Inkscape_1.4.21.0_x64__9waqn51p1ttv2\VFS\ProgramFilesX64\Inkscape\bin\inkscape.exe
  • macOS: /Applications/Inkscape.app/Contents/MacOS/inkscape
  • Linux: /usr/bin/inkscape

The tool will validate the path and confirm if Inkscape is working correctly.

Option 2: Skip SVG Conversion

If you choose option 2, the tool will generate informative placeholder SVG images with installation instructions instead of performing actual SVG conversion.

Option 3: Cancel Conversion

If you choose option 3, the conversion process will be cancelled.

Non-Interactive Environments

In batch processing, scripts, or other non-interactive environments, the tool automatically detects the non-interactive mode and uses placeholder images without prompting the user.

Troubleshooting

Common Issues

  1. File Not Found: Ensure the input DOCX file exists and path is correct
  2. Permission Errors: Check write permissions for output directory
  3. Image Extraction Failures: Verify DOCX file contains embedded images
  4. Validation Warnings: Some duplicate images are normal in technical documents
  5. WMF/EMF Images Showing Placeholders: Install Inkscape for automatic high-quality SVG conversion
  6. Inkscape Not Found: Use the interactive path input feature to manually specify Inkscape location

Getting Help

  • Use docx2md --help for command reference
  • Review validation output for conversion quality assessment

License

This software is licensed under a commercial license. Personal, educational, and non-commercial use is free. Commercial use requires a paid license.

See LICENSE for full terms and conditions.

Contributing

See CONTRIBUTING.md for contribution guidelines.

About

Convert DOCX files to markdown with image preservation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages