odt2wiki is a small tool that converts an OpenOffice / LibreOffice text document into a markdown-based wiki or a static website. It was designed for processing large documents with hundreds of diagrams and thousands of cross-links.
Supported output formats:
-
You wrote a document or a book in Google Docs or MS Office benefiting from their grammar checks and online collaboration. Now you are to publish it online, but the document is too large for a single page. It is much better to have a page per chapter, but the available tools preserve neither cross-references between chapters nor meaningful diagram names.
-
You have downloaded a large standard and want to make it easily accessible to your team and users.
odt2wiki aims at remaining simple to use and to extend while producing best in the class output.
-
odt2wiki preserves cross-references between sections. Your document is processed coherently and is transformed into a cohesive wiki.
-
If you have a folder with diagrams which were used in your document, odt2wiki can rely on them for the wiki, creating meaningful captions from file names. That works even if the diagrams were resized by GoogleDocs behind the scene.
-
odt2wiki adds a navigation bar and a sidebar with a collapsible table of contents to GitHub Wiki (while Hugo Book implements them by itself).
-
Image size is preserved. You won't see a small diagram from your document taking a whole page on the wiki.
-
The output is target-specific and production-ready - you can deploy it to the GitHub wiki or use it with Hugo Book to make a website.
-
Grayed-out sections are converted to quotes, retaining their distinct style.
-
There is a tool to help building light and dark website themes from SVG images.
-
Support for bold, italic, underlined and strikethrough text.
-
Bulleted and numbered lists, including nested lists.
-
Tables.
-
An option to collapse sections (only with GitHub wiki for now, can be implemented for Hugo Book).
-
Debug modes and customizable analytics.
-
Next to no layout shift in Hugo-generated websites after matching images as every image's dimensions are output to the HTML.
-
Mixed bulleted and numbered lists. If a list uses bullets at level 1, numbers at level 2, then again bullets at level 3, all its levels are output as bulleted. Mixed lists should be easy to implement in case someone uses them in practice.
-
Non-uniform tables (the ones with merged cells). I don't think that the GitHub dialect of markdown supports them.
-
Lists or images inside table cells.
A couple of features are not supported by the GitHub wiki engine:
-
Colorized text - GitHub aggressively removes colors. Though odt2wiki generates color tags, they are ignored.
-
Identical names for multiple wiki pages. This is a limitation of GitHub wiki engine - they don't support folders, therefore files with identical names but different paths are treated as duplicates. Please check the Pages sidebar on GitHub wiki to make sure that you don't have duplicates. Duplicates break cross-references and navigation - all the links lead to the first page.
odt2wiki generates markdown files which are the main content for wikis. Below is a general instruction, with target-specific steps outlined in the section that follow:
-
You need an ODT document with outline levels set up. If they are, the table of contents in LibreOffice's navbar represents the structure of the document.
- If it is empty (e.g. because the document was exported from GoogleDocs), please follow the "Document structure" step of this instruction.
-
Run the script, for example:
./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/MyWiki -c github -s 2 -i ~/Diagrams/MyDoc -r https://raw.githubusercontent.com/myname/myrepo/main/MyDoc -z my_custom_code-
Positional arguments are the input ODT file (
~/Documents/MyDoc.odt) and the output folder to be created (~/Work/MyWiki). The output folder should not already exist. -
-cor--convertis the output format. Usegithubfor GitHub wiki. -
-sor--split-levelis where you divide your wiki into pages. If your document is structured into parts (level 1), chapters (level 2) and sections (level 3) and you specify-s 2you will have a wiki folder per part and a wiki page per chapter. -
Optionally, you can add
-lor--collapse-levelto collapse sections of that outline level (GitHub format only). -
-zor--customizeallows you to provide your own code for processing your document: split chapters below the--split-leveland set up SEO strings. The value of the argument is the name of a Python module in thecustomfolder. Seecustom/metapatterns.py. -
-
By default, all the images from the document are extracted to the
Picturessubfolder in the destination and given namesimage000,image001, etc. -
If you want to match images from the document to external images, you need to install Pillow:
pip install pillow. -
-ior--images-folderis the path to a folder on your drive which contains images used throughout your document. -
-ror--remote-imagesis the path to the corresponding folder with images on the server where your wiki will run. -
Any image matched in the local folder (as given via
-iargument) will be linked to a corresponding image at the remote-rpath. -
In our example, any chapter that uses
~/Diagrams/MyDoc/ColorDrawings/Foo/Bar.pngwill translate into a wiki page that referenceshttps://raw.githubusercontent.com/myname/myrepo/main/MyDoc/ColorDrawings/Foo/Bar.pngwith "Bar" for alt text. -
-vor--use-svgreplaces any raster images with SVG files from the same folder. SVG files are better compressed thus you should use them if you have them. You can also change colors in SVG images with the svgcolor.py tool.
-
-
-
Customize the generated content.
-
Deploy it.
Run ./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/MyWiki -c github -s 2 -i ~/Diagrams/MyDoc -r https://raw.githubusercontent.com/myname/myrepo/main/MyDoc -z my_code
Customize your wiki by editing:
-
The generated
Home.mdwhich now has the table of contents listing all your wiki pages. You will want to add an introduction. -
_Sidebar.mdis a good place for extra links of your logo. It already contains a generated table of contents. -
_Footer.mdwas not generated - it's up to you to fill. -
And you will likely need to remove the remnants of your title page from the beginning of
Introduction.md.
Install Hugo by running sudo snap install hugo.
Create a project: hugo new site my-wiki.
Set up Hugo Book from the latest release (v12.0.0) or get my fork with a few CSS customizations - see its metapatterns branch.
Run ./odt2wiki.py ~/Documents/MyDoc.odt ~/Work/Hugo -c hugo -s 2 -i ~/Diagrams/MyDoc -r "diagrams" -v -z my_code
Copy the script's output from ~/Work/Hugo to your newly created Hugo project's content folder.
Move Pictures from the Hugo's content to its static folder.
Copy the original images you ran odt2wiki against from ~/Diagrams/MyDoc to static Hugo folder.
Edit hugo.toml in the project's root and the generated markdown files.
Check the results by running hugo server in the Hugo project's folder.
See if you can do SEO.
Generate your website: hugo.
Publish it.
There are several customizable steps in the website generation process. See Customization in plugins.py and custom/metapatterns.py as the real-world example.
The project includes a tool for batch-transforming SVG images. It allows for color correction and color transformartion:
-
svgcolor.py <input_folder> --listlists each color and the number of SVG images that contain that color in the input folder. -
svgcolor.py <input_folder> --find-imageslists all SVG images that contain embedded (likely raster) images. -
svgcolor.py <input_folder> --list-one <file_name>see which colors a given SVG image uses. -
svgcolor.py <input_folder> --find <color_code>list images which use the input color. -
svgcolor.py <input_folder> --no-find <color_code>list images which don't use the input color.
svgcolor.py <input_folder> <output_folder> --remap <color_map> [--infix <string>]Copy all SVG images from the input folder to the output folder (should not exist) while remapping colors. The format of thecolor_mapfile is given below. Theinfixis an optional string which is added between the file name and file extension, for example,--infix darkreadsfoo.svgand writes the transformed image tofoo.dark.svg.
The color map file contains two columns of colors in hex notation. For example:
ffffff 000000
changes everything white to black. You can have multiple lines in the file to transform many colors in one run.
Comments with # and blank lines are supported.
There are several transformation for colors not matched by the map:
-
* <multiplier>multiplies each of the R, G and B channels by the given number.* 0.5changes a0a0a0 to 505050. -
~ <pivot>inverts the HSL lightness (brightness) of every unmapped color around the pivot point. For example,~ 0.8will turn a color with brightness 20% to brightness 95% and another color with brightness of 90% to 40%. This is useful for conversion between the light and dark themes. -
/ <divisor>saturates colors by dividing the distance from the color's HSL saturation to 1. For example,/ 2will turn a color with 40% saturation to 70% saturation. This also helps with making a dark theme which high saturation.
You can find examples of the color maps used for the Architectural Metapatterns website in the custom folder.
OpenGraph clients vary in their expectations for image dimensions. The following ImageMagick command centers each input image in a 630x630 square (for Twitter) after which the area is expanded to the recommended 1200x630 resolution (For Facebook) by filling the sides of the image with white background color. This makes the resulting diagram fit any social network as it looks good both in wide and in square previews:
convert InputFolder/*.png -set filename:fn %[basename] -background white -resize 630x630 -gravity center -extent 1200x630 OutputFolder/%[filename:fn].png
Then feed a resized image per page by subclassing plugins.Customization and returning the image's filename from get_preview_image(). See the corresponding method in custom/metapatterns.py
If anything goes wrong (you get a failed assertion or some content from the document does not appear on the wiki), there is a bunch of troubleshooting modes:
-
odt2wiki.py MyBook.odt --print=filesprints all the files archived inside the ODT (which is merely a ZIP archive). The ones of interest are:-
content.xmlwhich contains all the text from the document. -
styles.xmlwith predefined settings and styles, such as paragraph indents and options for list bullets. -
Pictures/*- here are all your diagrams. -
Thumbnails/thumbnail.png- the thumbnail for your document.
-
-
odt2wiki.py MyBook.odt --print=attrslists attributes found in the document for each XML tag. Yes, ODT is an archived XML. -
odt2wiki.py MyBook.odt --print=tagsoutputs a tree of tags for your document. -
odt2wiki.py MyBook.odt MyBook.txt --convert=textextracts all the text odt2wiki recognizes in the document to a txt file. This can be useful if some content is missing in the wiki output. -
_print_doc_tree(doc)in odt2wiki.py prints the tree of headers (DOM) in a document. -
Finally, you can extract
content.xmlandstyles.xmlfrom the ODT archive by using any unzip software and view them in your browser. There is hardly anything as useful for debugging as looking at the data.
If you decide to fix or extend the script, here are its components:
-
odt2wiki.py- the main file with command-line arguments and application-level logic. -
document.pyis a domain-agnostic representation of a document. You don't need to learn ODT or markdown formats to use it. It is built around several classes:-
Style- text properties, such as bold or italic. -
Span- a piece of text in a given style. It may be a hyperlink.-
Content- a parent class for everything found in a document. -
Paragraph- a kind ofContentthat carries severalSpans and may have a bookmark for other content to link to.Header- a kind ofParagraphwhich is also a section header. Features outline level - its rank in the document's hierarchical table of contents.
-
List- a numbered or bulleted list. Contains multipleParagraphs orLists as items. -
Table- a list of rows. Each row is a list of cells. Each cell is aParagraphor empty. -
Image- a diagram. Has a link to an image file and size as % of the page's width.
-
-
Strategy- A few methods to prepare the document for conversion to markdown. Depends on the markdown dialect. Currently deals with cross-references but may be extended to other elements. -
Section- A header with associated content.Sections make a DOM tree. -
Document- A tree of sections.
-
-
odt_parser.py- my simplistic parser for the ODT format. It creates aDocument -
odt_tools.py- even simpler ODT parsers for troubleshooting modes. -
md_writer.py- Conversion ofContentto generic markdown. Is used bySections to output wiki pages. -
github_writer.py- GitHub-wiki-specific code (file naming and markdown format). -
hugo_writer.py- Hugo-Book-specific code (index files, relrefs, front matter (metadata)). -
image_matcher.py- extracts matches images from the document and matches them to local files. -
plugins.py- parent classes for output customizations and analytics. -
svg_tools.py- access SVG images, relies on regexp. -
analyticsis a folder with plugins (you can run one with-yor--analyze) that iterate over the DOM tree (all document sections) to collect information:-
duplicates.py- find chapters with duplicate names (GitHub wiki cannot discern them). -
no_preview.py- find chapters that lack images to be used for OpenGraph preview. -
print_chapters.py- print all the chapters (files or web pages). This is a good start for writing SEO descriptions. -
print_full_toc.py- pring the detailed table of contents.
-
-
customfolder contains the code for per-document customization:-
metapatterns.py- the customization for the Architectural Metapatterns book. Includes SEO and some analytics. -
light.map- the color mapping for Metapatterns' light theme. It fixes a couple of widely misapplied colors. -
dark.map- the color mapping for transforming the light theme into the dark theme. As usual, you can see the results on my website.
-
-
svgcolor.py- a tool to recolor SVG images, used for making a dark theme from light images, or vice versa.
Prototype. It works for me. Please feel free to extend it.
There are no tests since there are no external users or committers.
A kind of Spatial Partition algorithm. The trouble is that Google Docs resizes uploaded images, therefore a simple checksum or even histogram does not work. Funnily, the resulting diagram files become larger because downsizing blurs lines which creates new colors and makes compression inefficient.
The algorithm relies on image parameters which are resilient to resizing and recompression, namely its proportions and colors:
-
The algorithm precalculates average brightness for R, G and B channels of the image.
-
The image is placed into a bin according to its shape (width / height).
-
When an image is to be matched to an existing one:
-
A bin is calculated based on its shape.
-
RGB stats are checked for every image in that bin. If there is no strict match, our image should have been resized:
-
Now we also take the neighboring bins with slightly changed width to height ratio.
-
We go over the content of the 3 bins and compare the RGB stats of every image in the bins to those of our image, allowing for a looser match.
-
If there is only one match, then we've done it. Otherwise matching has failed and the image file from the document is extracted to the wiki folder.
-
-
Why not use odfdo as an ODT parser?
I tried. Twice. Just read the docs.
If I need to learn the whole ODF standard to understand anything, I may write my own ODT parser, learning the ODT format on my way.
Now that I understand the structure and elements of ODT, I think I can use odfdo, but I don't need it. I think I will integrate it if odt2wiki ever sees wide use requiring compatibility and tolerance to border cases. Right now my simple parser fits my personal needs.
Yes, if someone implements them.
Yep. There should have been at least one. Please contact me or try to fix it by yourself and commit the changes.