Latin OCR provides free software to convert scans of early modern Latin printed text into unicode text and PDF files that can be easily searched, copied, archived, and transformed. It uses Tesseract as an OCR engine with a specific training set based on the work of Ancient Greek OCR and Ryan Baumann's Latin OCR for Tesseract. The training set is developed by Rescribe Ltd and is specifically tailored to cater to the peculiarities of historic fonts and characters used in printing from 1500 to about 1800.
The training is still in active development. We are keen to improve it, so any feedback would be very welcome.
Apart from standard alphanumeric characters, the following signs and ligatures are currently supported:
Æ Œ æ œ ſ ß † ‡ ÷ ※ - ¶ ã õ ũ à è ì ò ù ā đ ē ę ī ĩ ĳ ō Ū ū â ê î ô û ǣ ǽ ː ⸗ ꝑ ꝓ ꝗ ꝙ ꝛ ꝝ ꝫ ꝯ ꝰ ꝶ ﬀ ﬁ ﬂ ﬃ ﬄ ﬅ ﬆ ↄ
Latin OCR v1.0 (for Tesseract v4.x) (2020-01-14)
Latin OCR v0.5 (for Tesseract v3.x) (2016-06-22)
To install and use the software, follow the instructions for Ancient Greek OCR (instead of grc.traineddata, download the lat.traineddata from the above link and use that): Windows | OS X | Linux
The quality of the scanned images can often be a challenge for the OCR. We strongly recommend using image processing programs such as the very user-friendly open source software ScanTailor to deskew, despeckle and binarize the scans prior to the actual OCR. A User Guide and Video Tutorial can be found on the Wiki.
All of the code used to generate and test the Latin OCR training data is free software released under the Apache License 2.0. It is in the git repository:
git clone https://latinocr.org/lattraining.git
For comments, bugs, criticisms, code, help, or anything else, contact the folks at Rescribe: firstname.lastname@example.org.
This project was funded by a Proof-of-Concept grant awarded by the European Research Council, on the basis of research developed as part of the project Living Poets: A New Approach to Ancient Poetry, directed by Prof. Barbara Graziosi and funded by the European Research Council. The Tesseract OCR engine makes this all possible, doing all of the hard work behind the scenes.