GitHub - yigitkonur/api-llm-ocr: PDF to markdown using vision LLMs — tables, layouts, and structure preserved

LLM-powered PDF to markdown. uses vision models to actually read your documents — tables, headers, mixed layouts — and outputs clean, structured markdown. not traditional OCR.

curl -X POST "http://localhost:8000/ocr" -F "file=@document.pdf"

demo

video.mp4

NASA Apollo 17 flight docs — mixed orientations, messy layouts — converted to structured markdown.

what it does

vision model OCR — understands context, not just character shapes
parallel processing — 50-page PDF in seconds, not minutes
table preservation — detected and formatted as proper markdown tables
smart batching — configurable pages-per-request for speed vs accuracy tradeoff
retry with backoff — handles rate limits and timeouts without crashing
flexible input — file upload or URL, your choice
image descriptions — non-text elements get [Image: description] annotations

cost

using OpenAI as an example (~1,500 tokens/page average):

model	cost per 1,000 pages
GPT-4o	~$15
GPT-4o mini	~$8
batch API	~$4

works with any OpenAI-compatible vision API. swap the endpoint and model in config.

install

git clone https://github.com/yigitkonur/api-llm-ocr.git
cd api-llm-ocr

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

configure

create a .env file:

# required
OPENAI_API_KEY=your_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_DEPLOYMENT_ID=your_vision_model_deployment

# optional
OPENAI_API_VERSION=gpt-4o
BATCH_SIZE=1
MAX_CONCURRENT_OCR_REQUESTS=5
MAX_CONCURRENT_PDF_CONVERSION=4

run

# pick one
uvicorn main:app --reload
uvicorn swift_ocr.app:app --reload
python -m swift_ocr
python -m swift_ocr --host 0.0.0.0 --port 8080 --workers 4

API lives at http://127.0.0.1:8000. auto-generated docs at /docs.

usage

upload a file

curl -X POST "http://127.0.0.1:8000/ocr" \
  -F "file=@/path/to/document.pdf"

process from URL

curl -X POST "http://127.0.0.1:8000/ocr" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf"}'

response

{
  "text": "# document title\n\n## section 1\n\nextracted text...",
  "status": "success",
  "pages_processed": 5,
  "processing_time_ms": 1234
}

health check

curl http://127.0.0.1:8000/health

error codes

code	meaning
`200`	success
`400`	bad request (no file/URL, or both provided)
`422`	validation error
`429`	rate limited — retry with backoff
`500`	processing error
`504`	timeout downloading PDF

configuration

variable	default	description
`OPENAI_API_KEY`	—	API key
`AZURE_OPENAI_ENDPOINT`	—	endpoint URL
`OPENAI_DEPLOYMENT_ID`	—	vision model deployment ID
`OPENAI_API_VERSION`	`gpt-4o`	API version
`BATCH_SIZE`	`1`	pages per OCR request (1-10). higher = faster, less accurate
`MAX_CONCURRENT_OCR_REQUESTS`	`5`	parallel OCR calls
`MAX_CONCURRENT_PDF_CONVERSION`	`4`	parallel page renders. match your CPU cores

tuning

high accuracy: BATCH_SIZE=1
balanced: BATCH_SIZE=5, MAX_CONCURRENT_OCR_REQUESTS=10
max throughput: BATCH_SIZE=10, MAX_CONCURRENT_OCR_REQUESTS=20 (watch rate limits)

project structure

swift_ocr/
  __init__.py           — package init
  __main__.py           — CLI entry point
  app.py                — FastAPI app factory
  config/
    settings.py         — pydantic settings (type-safe config)
  core/
    exceptions.py       — custom exception hierarchy
    logging.py          — structured logging
    retry.py            — exponential backoff
  schemas/
    ocr.py              — pydantic request/response models
  services/
    ocr.py              — vision model OCR service
    pdf.py              — PDF conversion service
  api/
    deps.py             — dependency injection
    exceptions.py       — FastAPI exception handlers
    router.py           — route aggregation
    routes/
      health.py         — health check endpoints
      ocr.py            — OCR endpoints

troubleshooting

problem	fix
missing env vars	check `.env` has `OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`, `OPENAI_DEPLOYMENT_ID`
429 rate limits	reduce `MAX_CONCURRENT_OCR_REQUESTS` or `BATCH_SIZE`
timeout errors	large PDFs take time — backoff is built in
garbled output	make sure your PDF isn't password-protected or corrupted
tables misformatted	try `BATCH_SIZE=1` for complex tables
failed to init client	verify endpoint format: `https://your-resource.openai.azure.com/`

license

AGPL v3 — required by PyMuPDF dependency.

if you want MIT, swap PyMuPDF for pdf2image + Poppler. the rest of the code is yours.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
swift_ocr		swift_ocr
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

demo

what it does

cost

install

configure

run

usage

upload a file

process from URL

response

health check

error codes

configuration

tuning

project structure

troubleshooting

license

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

demo

what it does

cost

install

configure

run

usage

upload a file

process from URL

response

health check

error codes

configuration

tuning

project structure

troubleshooting

license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages