Exploring Tuning Strategies for Post-OCR Correction: A Comparative Analysis of Modern LLMs

Historical newspapers, with their invaluable insights into past societies and events, present unique challenges for Optical Character Recognition systems due to degraded paper quality, irregular layouts, and antiquated typography. The experiment primarily focused on evaluating the effectiveness of parameter-efficient fine-tuning with LoRA compared to full fine-tuning for open-source models, highlighting their trade-offs in error reduction and cost-efficiency. Prompt-tuning, particularly with few-shot learning, demonstrated strong performance for commercial LLMs. While commercial models outperformed open-source alternatives in terms of low error rates, the gap was modest, with fine-tuned open-source models still offering a highly scalable and relatively reliable alternative. These findings provide actionable insights into the trade-offs between fine-tuning and prompt-tuning strategies, contributing to developing efficient and scalable post-OCR correction pipelines for historical document processing. We release our synthetic dataset to facilitate further evaluation and research.

Our work aims to systematically review and explore the relative effectiveness of fine-tuning and prompt-tuning approaches for post-OCR correction, focusing on both open-source and commercial LLMs. Given the complexity of OCR errors in historical texts and the diverse capabilities of modern LLMs, this study adopts an exploratory perspective, recognizing model selection, evaluation metrics (CER, WER), and subjective interpretation of results. By combining designed experimentation with grounded evaluations, we aim to seek answers for:

Empirical analysis:
- Under what conditions do fine-tuning and prompt-tuning provide the most effective performance for post-OCR correction tasks, based on the measurement metrics (WER, CER)?
Sensitivity to data augmentation technique:
- Does data augmentation improve correction performance, and how sensitive is this improvement to variations in tuning strategies?
Real-world application:
- How do fine-tuned open-source models compare to commercial LLM solutions in terms of cost-effectiveness and viability for large-scale historical document post-correction?

Performance of Open-Source Model trained with/without synthetic data.

Models trained on BLN600 only are labelled "original," while those trained on augmented datasets are labelled by their target CER-WER pairs.

Model	Augmented Dataset Used(CER-WER)	Train CER	Val CER	Test CER	Train WER	Val WER	Test WER
BART-Base: Lora	0.05-0.25	0.0872	0.0830	0.0845	0.1816	0.1850	0.1743
	0.1-0.3	0.0974	0.0831	0.0851	0.1963	0.1833	0.1744
	0.15-0.4	0.1074	0.0826	0.0846	0.2116	0.1847	0.1754
	0.15-0.55	0.1113	0.0823	0.0832	0.2202	0.1856	0.1751
	original	0.0872	0.0832	0.0842	0.1861	0.1845	0.1753
LLaMA-2-7B: Lora	0.05-0.25	0.0998	0.0792	0.0921	0.1204	0.1113	0.1915
	0.1-0.3	0.0648	0.0633	0.0659	0.1281	0.1198	0.1113
	0.15-0.4	0.0501	0.0525	0.0522	0.1132	0.1168	0.1212
	0.15-0.55	0.0511	0.0537	0.0539	0.0998	0.0988	0.0942
	original	0.0509	0.0536	0.0533	0.0932	0.0990	0.0949
BART-Base: Full	0.05-0.25	0.0691	0.0750	0.0764	0.1258	0.1417	0.1386
	0.1-0.3	0.0848	0.0730	0.0734	0.1510	0.1412	0.1361
	0.15-0.4	0.0822	0.0733	0.0743	0.1462	0.1429	0.1354
	0.15-0.55	0.0848	0.0730	0.0733	0.1510	0.1412	0.1361
	original	0.0671	0.0738	0.0752	0.1249	0.1403	0.1355
LLaMA-2-7B: Full	0.05-0.25	0.0704	0.0650	0.0694	0.0106	0.1024	0.1130
	0.1-0.3	0.0430	0.0424	0.0419	0.0936	0.0977	0.0862
	0.15-0.4	0.0494	0.0509	0.0517	0.1035	0.1021	0.1197
	0.15-0.55	0.0501	0.0520	0.0505	0.0749	0.0893	0.0881
	original	0.0539	0.0684	0.0663	0.0890	0.0932	0.1011

Performance of Commercial Models Based on Prompt Tuning Strategy

Model	Prompt Type	Test CER	Test WER
Gemini-flash-1.5-8b	Basic Zero-shot	0.0908	0.3512
	Detailed Zero-shot	0.0940	0.3335
	One-shot	0.0857	0.2923
	Few-shot	0.0753	0.2577
GPT-4o-Mini	Basic Zero-shot	0.0230	0.1242
	Detailed Zero-shot	0.0217	0.1115
	One-shot	0.0132	0.0937
	Few-shot	0.0172	0.0764

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Fine-tuning outputs		Fine-tuning outputs
Prompt-tuning outputs		Prompt-tuning outputs
dataset		dataset
preprocessing-synthetic_data		preprocessing-synthetic_data
BART.py		BART.py
README.md		README.md
api_provider.py		api_provider.py
config.yaml		config.yaml
llama_2_main.py		llama_2_main.py
main.py		main.py
model_eval.py		model_eval.py
model_eval_BART_LoRa.py		model_eval_BART_LoRa.py
model_eval_BART_full.py		model_eval_BART_full.py
prompt.md		prompt.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring Tuning Strategies for Post-OCR Correction: A Comparative Analysis of Modern LLMs

Performance of Open-Source Model trained with/without synthetic data.

Performance of Commercial Models Based on Prompt Tuning Strategy

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Angelawork/NLP_CLOCR_COMP550_ResearchProject

Folders and files

Latest commit

History

Repository files navigation

Exploring Tuning Strategies for Post-OCR Correction: A Comparative Analysis of Modern LLMs

Performance of Open-Source Model trained with/without synthetic data.

Performance of Commercial Models Based on Prompt Tuning Strategy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages