Skip to content

Converter from ODT to markdown featuring: splitting into chapters, cross-links and diagram auto-detection.

License

Notifications You must be signed in to change notification settings

denyspoltorak/odt2wiki

Repository files navigation

ODT to wiki converter

odt2wiki is a small tool that converts an OpenOffice / LibreOffice text document into a markdown-based wiki or a static website. It was designed for processing large documents with hundreds of diagrams and thousands of cross-links.

Supported output formats:

Use cases

  • You wrote a document or a book in Google Docs or MS Office benefiting from their grammar checks and online collaboration. Now you are to publish it online, but the document is too large for a single page. It is much better to have a page per chapter, but the available tools preserve neither cross-references between chapters nor meaningful diagram names.

  • You have downloaded a large standard and want to make it easily accessible to your team and users.

Features

odt2wiki aims at remaining simple to use and to extend while producing best in the class output.

Killer features

  • odt2wiki preserves cross-references between sections. Your document is processed coherently and is transformed into a cohesive wiki.

  • If you have a folder with diagrams which were used in your document, odt2wiki can rely on them for the wiki, creating meaningful captions from file names. That works even if the diagrams were resized by GoogleDocs behind the scene.

  • odt2wiki adds a navigation bar and a sidebar with a collapsible table of contents to GitHub Wiki (while Hugo Book implements them by itself).

  • Image size is preserved. You won't see a small diagram from your document taking a whole page on the wiki.

  • The output is target-specific and production-ready - you can deploy it to the GitHub wiki or use it with Hugo Book to make a website.

  • Grayed-out sections are converted to quotes, retaining their distinct style.

  • There is a tool to help building light and dark website themes from SVG images.

Ordinary features

  • Support for bold, italic, underlined and strikethrough text.

  • Bulleted and numbered lists, including nested lists.

  • Tables.

  • An option to collapse sections (only with GitHub wiki for now, can be implemented for Hugo Book).

  • Debug modes and customizable analytics.

  • Next to no layout shift in Hugo-generated websites after matching images as every image's dimensions are output to the HTML.

Unsupported features

  • Mixed bulleted and numbered lists. If a list uses bullets at level 1, numbers at level 2, then again bullets at level 3, all its levels are output as bulleted. Mixed lists should be easy to implement in case someone uses them in practice.

  • Non-uniform tables (the ones with merged cells). I don't think that the GitHub dialect of markdown supports them.

  • Lists or images inside table cells.

Unsupported with GitHub wiki:

A couple of features are not supported by the GitHub wiki engine:

  • Colorized text - GitHub aggressively removes colors. Though odt2wiki generates color tags, they are ignored.

  • Identical names for multiple wiki pages. This is a limitation of GitHub wiki engine - they don't support folders, therefore files with identical names but different paths are treated as duplicates. Please check the Pages sidebar on GitHub wiki to make sure that you don't have duplicates. Duplicates break cross-references and navigation - all the links lead to the first page.

Usage

odt2wiki generates markdown files which are the main content for wikis. Below is a general instruction, with target-specific steps outlined in the section that follow:

  1. You need an ODT document with outline levels set up. If they are, the table of contents in LibreOffice's navbar represents the structure of the document.

    • If it is empty (e.g. because the document was exported from GoogleDocs), please follow the "Document structure" step of this instruction.
  2. Run the script, for example: ./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/MyWiki -c github -s 2 -i ~/Diagrams/MyDoc -r https://raw.githubusercontent.com/myname/myrepo/main/MyDoc -z my_custom_code

    • Positional arguments are the input ODT file (~/Documents/MyDoc.odt) and the output folder to be created (~/Work/MyWiki). The output folder should not already exist.

    • -c or --convert is the output format. Use github for GitHub wiki.

    • -s or --split-level is where you divide your wiki into pages. If your document is structured into parts (level 1), chapters (level 2) and sections (level 3) and you specify -s 2 you will have a wiki folder per part and a wiki page per chapter.

    • Optionally, you can add -l or --collapse-level to collapse sections of that outline level (GitHub format only).

    • -z or --customize allows you to provide your own code for processing your document: split chapters below the --split-level and set up SEO strings. The value of the argument is the name of a Python module in the custom folder. See custom/metapatterns.py.

    • Matching images:

      • By default, all the images from the document are extracted to the Pictures subfolder in the destination and given names image000, image001, etc.

      • If you want to match images from the document to external images, you need to install Pillow: pip install pillow.

      • -i or --images-folder is the path to a folder on your drive which contains images used throughout your document.

      • -r or --remote-images is the path to the corresponding folder with images on the server where your wiki will run.

      • Any image matched in the local folder (as given via -i argument) will be linked to a corresponding image at the remote -r path.

      • In our example, any chapter that uses ~/Diagrams/MyDoc/ColorDrawings/Foo/Bar.png will translate into a wiki page that references https://raw.githubusercontent.com/myname/myrepo/main/MyDoc/ColorDrawings/Foo/Bar.png with "Bar" for alt text.

      • -v or --use-svg replaces any raster images with SVG files from the same folder. SVG files are better compressed thus you should use them if you have them. You can also change colors in SVG images with the svgcolor.py tool.

  3. Customize the generated content.

  4. Deploy it.

Generating a GitHub wiki

Run ./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/MyWiki -c github -s 2 -i ~/Diagrams/MyDoc -r https://raw.githubusercontent.com/myname/myrepo/main/MyDoc -z my_code

Customize your wiki by editing:

  • The generated Home.md which now has the table of contents listing all your wiki pages. You will want to add an introduction.

  • _Sidebar.md is a good place for extra links of your logo. It already contains a generated table of contents.

  • _Footer.md was not generated - it's up to you to fill.

  • And you will likely need to remove the remnants of your title page from the beginning of Introduction.md.

Commit to GitHub.

Generating a Hugo-Book-based website

Install Hugo by running sudo snap install hugo.

Create a project: hugo new site my-wiki.

Set up Hugo Book from the latest release (v12.0.0) or get my fork with a few CSS customizations - see its metapatterns branch.

Run ./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/Hugo -c hugo -s 2 -i ~/Diagrams/MyDoc -r "diagrams" -v -z my_code

Copy the script's output from ~/Work/Hugo to your newly created Hugo project's content folder.

Move Pictures from the Hugo's content to its static folder.

Copy the original images you ran odt2wiki against from ~/Diagrams/MyDoc to static Hugo folder.

Edit hugo.toml in the project's root and the generated markdown files.

Check the results by running hugo server in the Hugo project's folder.

See if you can do SEO.

Generate your website: hugo.

Publish it.

Customizing a website

There are several customizable steps in the website generation process. See Customization in plugins.py and custom/metapatterns.py as the real-world example.

SVG color converter

The project includes a tool for batch-transforming SVG images. It allows for color correction and color transformartion:

Color analytics

  • svgcolor.py <input_folder> --list lists each color and the number of SVG images that contain that color in the input folder.

  • svgcolor.py <input_folder> --find-images lists all SVG images that contain embedded (likely raster) images.

  • svgcolor.py <input_folder> --list-one <file_name> see which colors a given SVG image uses.

  • svgcolor.py <input_folder> --find <color_code> list images which use the input color.

  • svgcolor.py <input_folder> --no-find <color_code> list images which don't use the input color.

Color mapping

  • svgcolor.py <input_folder> <output_folder> --remap <color_map> [--infix <string>] Copy all SVG images from the input folder to the output folder (should not exist) while remapping colors. The format of the color_map file is given below. The infix is an optional string which is added between the file name and file extension, for example, --infix dark reads foo.svg and writes the transformed image to foo.dark.svg.

Color map file

The color map file contains two columns of colors in hex notation. For example:

ffffff 000000

changes everything white to black. You can have multiple lines in the file to transform many colors in one run.

Comments with # and blank lines are supported.

There are several transformation for colors not matched by the map:

  • * <multiplier> multiplies each of the R, G and B channels by the given number. * 0.5 changes a0a0a0 to 505050.

  • ~ <pivot> inverts the HSL lightness (brightness) of every unmapped color around the pivot point. For example, ~ 0.8 will turn a color with brightness 20% to brightness 95% and another color with brightness of 90% to 40%. This is useful for conversion between the light and dark themes.

  • / <divisor> saturates colors by dividing the distance from the color's HSL saturation to 1. For example, / 2 will turn a color with 40% saturation to 70% saturation. This also helps with making a dark theme which high saturation.

You can find examples of the color maps used for the Architectural Metapatterns website in the custom folder.

Prepairing images for OpenGraph

OpenGraph clients vary in their expectations for image dimensions. The following ImageMagick command centers each input image in a 630x630 square (for Twitter) after which the area is expanded to the recommended 1200x630 resolution (For Facebook) by filling the sides of the image with white background color. This makes the resulting diagram fit any social network as it looks good both in wide and in square previews:

convert InputFolder/*.png -set filename:fn %[basename] -background white -resize 630x630 -gravity center -extent 1200x630 OutputFolder/%[filename:fn].png

Then feed a resized image per page by subclassing plugins.Customization and returning the image's filename from get_preview_image(). See the corresponding method in custom/metapatterns.py

Troubleshooting

If anything goes wrong (you get a failed assertion or some content from the document does not appear on the wiki), there is a bunch of troubleshooting modes:

  • odt2wiki.py MyBook.odt --print=files prints all the files archived inside the ODT (which is merely a ZIP archive). The ones of interest are:

    • content.xml which contains all the text from the document.

    • styles.xml with predefined settings and styles, such as paragraph indents and options for list bullets.

    • Pictures/* - here are all your diagrams.

    • Thumbnails/thumbnail.png - the thumbnail for your document.

  • odt2wiki.py MyBook.odt --print=attrs lists attributes found in the document for each XML tag. Yes, ODT is an archived XML.

  • odt2wiki.py MyBook.odt --print=tags outputs a tree of tags for your document.

  • odt2wiki.py MyBook.odt MyBook.txt --convert=text extracts all the text odt2wiki recognizes in the document to a txt file. This can be useful if some content is missing in the wiki output.

  • _print_doc_tree(doc) in odt2wiki.py prints the tree of headers (DOM) in a document.

  • Finally, you can extract content.xml and styles.xml from the ODT archive by using any unzip software and view them in your browser. There is hardly anything as useful for debugging as looking at the data.

Under the hood

If you decide to fix or extend the script, here are its components:

  • odt2wiki.py - the main file with command-line arguments and application-level logic.

  • document.py is a domain-agnostic representation of a document. You don't need to learn ODT or markdown formats to use it. It is built around several classes:

    • Style - text properties, such as bold or italic.

    • Span - a piece of text in a given style. It may be a hyperlink.

      • Content - a parent class for everything found in a document.

      • Paragraph - a kind of Content that carries several Spans and may have a bookmark for other content to link to.

        • Header - a kind of Paragraph which is also a section header. Features outline level - its rank in the document's hierarchical table of contents.
      • List - a numbered or bulleted list. Contains multiple Paragraphs or Lists as items.

      • Table - a list of rows. Each row is a list of cells. Each cell is a Paragraph or empty.

      • Image - a diagram. Has a link to an image file and size as % of the page's width.

    • Strategy - A few methods to prepare the document for conversion to markdown. Depends on the markdown dialect. Currently deals with cross-references but may be extended to other elements.

    • Section - A header with associated content. Sections make a DOM tree.

    • Document - A tree of sections.

  • odt_parser.py - my simplistic parser for the ODT format. It creates a Document

  • odt_tools.py - even simpler ODT parsers for troubleshooting modes.

  • md_writer.py - Conversion of Content to generic markdown. Is used by Sections to output wiki pages.

  • github_writer.py - GitHub-wiki-specific code (file naming and markdown format).

  • hugo_writer.py - Hugo-Book-specific code (index files, relrefs, front matter (metadata)).

  • image_matcher.py - extracts matches images from the document and matches them to local files.

  • plugins.py - parent classes for output customizations and analytics.

  • svg_tools.py - access SVG images, relies on regexp.

  • analytics is a folder with plugins (you can run one with -y or --analyze) that iterate over the DOM tree (all document sections) to collect information:

    • duplicates.py - find chapters with duplicate names (GitHub wiki cannot discern them).

    • no_preview.py - find chapters that lack images to be used for OpenGraph preview.

    • print_chapters.py - print all the chapters (files or web pages). This is a good start for writing SEO descriptions.

    • print_full_toc.py - pring the detailed table of contents.

  • custom folder contains the code for per-document customization:

    • metapatterns.py - the customization for the Architectural Metapatterns book. Includes SEO and some analytics.

    • light.map - the color mapping for Metapatterns' light theme. It fixes a couple of widely misapplied colors.

    • dark.map - the color mapping for transforming the light theme into the dark theme. As usual, you can see the results on my website.

  • svgcolor.py - a tool to recolor SVG images, used for making a dark theme from light images, or vice versa.

Q&A

What's the status of the project?

Prototype. It works for me. Please feel free to extend it.

There are no tests since there are no external users or committers.

What is the algorithm for image matching?

A kind of Spatial Partition algorithm. The trouble is that Google Docs resizes uploaded images, therefore a simple checksum or even histogram does not work. Funnily, the resulting diagram files become larger because downsizing blurs lines which creates new colors and makes compression inefficient.

The algorithm relies on image parameters which are resilient to resizing and recompression, namely its proportions and colors:

  • The algorithm precalculates average brightness for R, G and B channels of the image.

  • The image is placed into a bin according to its shape (width / height).

  • When an image is to be matched to an existing one:

    • A bin is calculated based on its shape.

    • RGB stats are checked for every image in that bin. If there is no strict match, our image should have been resized:

      • Now we also take the neighboring bins with slightly changed width to height ratio.

      • We go over the content of the 3 bins and compare the RGB stats of every image in the bins to those of our image, allowing for a looser match.

      • If there is only one match, then we've done it. Otherwise matching has failed and the image file from the document is extracted to the wiki folder.

Why not use odfdo as an ODT parser?

I tried. Twice. Just read the docs.

If I need to learn the whole ODF standard to understand anything, I may write my own ODT parser, learning the ODT format on my way.

Now that I understand the structure and elements of ODT, I think I can use odfdo, but I don't need it. I think I will integrate it if odt2wiki ever sees wide use requiring compatibility and tolerance to border cases. Right now my simple parser fits my personal needs.

Will any other input / output formats be supported?

Yes, if someone implements them.

I have a bug!

Yep. There should have been at least one. Please contact me or try to fix it by yourself and commit the changes.

Portfolio

About

Converter from ODT to markdown featuring: splitting into chapters, cross-links and diagram auto-detection.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages