-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Current Behavior
The detected deskew angle is often wrong, so tools like OCRmyPDF apply a rotation to the images which results in a totally titlted document.
Expected Behavior
Correctly detect the angle or (in case that the detection algorithm is not sure about the angle) return 0.
Suggested Fix
No response
tesseract -v
tesseract 5.3.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
Found AVX
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
Found libcurl/7.88.1 OpenSSL/3.0.16 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3 OpenLDAP/2.5.13
Operating System
Debian 12 Bookworm
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
I encountered a problem with the deskew algorithm. I'm using the currect version of OCRmyPDF (freshly installed with pipx) and Tesseract 5.3.0 on Debian 12.
For example, tesseract --psm 2 gives me a deskew angle of 0.0247 for a scan that is almost skew-free. OCRmyPDF makes that 1.42 degrees, and the page is much worse afterwards.
For reference, this is the input file: (just two pages out of over 800)
x.pdf
And this is the output from OCRmyPDF:
y.pdf
As you can see, the input file is almost skew-free, the output is unusable.
The originial TIFF files:
tif.zip
# tesseract --psm 2 xaaa.tif
Orientation: 0
WritingDirection: 0
TextlineOrder: 2
Deskew angle: 0.0247
# tesseract --psm 2 xacy.tif
Orientation: 0
WritingDirection: 0
TextlineOrder: 2
Deskew angle: -0.0389