Several custom scripts to aid in file processing.
Install
pip install -r requirements.txtReplaces the folder subfolders with a Markdown files.
- Input: A folder (non-recursive), only processes HTML files in the root directory's immediate subfolders
- Output: Markdown files named after the subfolders
Usage
./html2md.py /path/to/folderReplaces the PDF with a Markdown file.
- Input: A PDF file
- Output: A Markdown file (in the same folder as the input)
Usage
./pdf2md.py /path/to/file.pdfReplaces PDFs with cleaned versions.
- Input: A folder (recursive), cleans every PDF under it.
- Output: Replaces every PDF with its cleaned version.
Usage
./pdf_cleaner.py /path/to/folderExtracts the first page's text via OCR and prints it to the terminal.
- Input: A PDF file
- Output: Printed to stdout (no files created)
Usage
./text_extractor.py /path/to/file.pdfDependencies
- Requires Tesseract OCR to be installed on your system.
- On macOS (Homebrew):
brew install tesseract
Merges multiple PDFs into one in the order provided.
- Input: Output path followed by input PDF paths (2 or more)
- Output: A single merged PDF at the output path
Usage
./merge_pdf.py /path/to/output.pdf /path/to/1.pdf /path/to/2.pdf ...