Skip to content

Skew detection gives wrong results #4445

@ibm5110

Description

@ibm5110

Current Behavior

The detected deskew angle is often wrong, so tools like OCRmyPDF apply a rotation to the images which results in a totally titlted document.

Expected Behavior

Correctly detect the angle or (in case that the detection algorithm is not sure about the angle) return 0.

Suggested Fix

No response

tesseract -v

tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.2) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
 Found libcurl/7.88.1 OpenSSL/3.0.16 zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.3 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.10.0 nghttp2/1.52.0 librtmp/2.3 OpenLDAP/2.5.13

Operating System

Debian 12 Bookworm

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

I encountered a problem with the deskew algorithm. I'm using the currect version of OCRmyPDF (freshly installed with pipx) and Tesseract 5.3.0 on Debian 12.
For example, tesseract --psm 2 gives me a deskew angle of 0.0247 for a scan that is almost skew-free. OCRmyPDF makes that 1.42 degrees, and the page is much worse afterwards.

For reference, this is the input file: (just two pages out of over 800)
x.pdf
And this is the output from OCRmyPDF:
y.pdf
As you can see, the input file is almost skew-free, the output is unusable.

The originial TIFF files:
tif.zip

# tesseract --psm 2 xaaa.tif
Orientation: 0
WritingDirection: 0
TextlineOrder: 2
Deskew angle: 0.0247
# tesseract --psm 2 xacy.tif
Orientation: 0
WritingDirection: 0
TextlineOrder: 2
Deskew angle: -0.0389

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions