Latin OCR provides free software to convert scans of early modern Latin printed text into unicode text and PDF files that can be easily searched, copied, archived, and transformed. It uses Tesseract as an OCR engine with a specific training set based on the work of Ancient Greek OCR and Ryan Baumann's Latin OCR for Tesseract. The training set is developed by Rescribe Ltd and is specifically tailored to cater to the peculiarities of historic fonts and characters used in printing from 1500 to about 1800.
The training is still in active development. We are keen to improve it, so any feedback would be very welcome.
Apart from standard alphanumeric characters, the following signs and ligatures are currently supported:
Æ Œ æ œ ſ ß † ‡ ÷ ※ - ¶ ã õ ũ à è ì ò ù ā đ ē ę ī ĩ ij ō Ū ū â ê î ô û ǣ ǽ ː ⸗ ꝑ ꝓ ꝗ ꝙ ꝛ ꝝ ꝫ ꝯ ꝰ ꝶ ff fi fl ffi ffl ſt st ↄ
We recommend using the Rescribe desktop OCR tool that we have recently developed. It includes our latest Latin OCR training sets built-in, is free and open source, and is the easiest way to do high quality OCR of historical Latin.
Alternatively, you can directly download the training sets below, to use directly with Tesseract:
Latin OCR v1.0 (for Tesseract v4.x) (2020-01-14)
Latin OCR v0.5 (for Tesseract v3.x) (2016-06-22)
All of the code used to generate and test the Latin OCR training data is free software released under the Apache License 2.0. It is in the git repository:
git clone https://latinocr.org/lattraining.git
For comments, bugs, criticisms, code, help, or anything else, contact the folks at Rescribe: info@rescribe.xyz.
This project was funded by a Proof-of-Concept grant awarded by the European Research Council, on the basis of research developed as part of the project Living Poets: A New Approach to Ancient Poetry, directed by Prof. Barbara Graziosi and funded by the European Research Council. The Tesseract OCR engine makes this all possible, doing all of the hard work behind the scenes.