Tesseract Open Source OCR Engine (main repository)

Last update: Jan 5, 2023

Related tags

Third-party APIs Wrappers machine-learning ocr tesseract lstm tesseract-ocr hacktoberfest ocr-engine

Overview

Tesseract OCR

About

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty documentation.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from master branch on GitHub. Open issues can be found in issue tracker, and planning documentation.

The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

Supported Compilers are:

GCC 4.8 and above
Clang 3.4 and above
MSVC 2015, 2017, 2019

Other compilers might work, but are not officially supported.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

Examples can be found in the documentation.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section in the AddOns documentation.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

Support

Before you submit an issue, please review the guidelines for this repository.

For support, first read the documentation, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.

Mailing-lists:

tesseract-ocr - For tesseract users.
tesseract-dev - For tesseract developers.

Please report an issue only for a bug, not for asking questions.

License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Tesseract uses Leptonica library which essentially uses a BSD 2-clause license.

Dependencies

Tesseract uses Leptonica library for opening input images (e.g. not documents like pdf). It is suggested to use leptonica with built-in support for zlib, png and tiff (for multipage tiff).

Latest Version of README

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md

Comments

RFC: Tesseract 4.0.0 – open tasks
I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.

These tasks are on my own list and to be discussed whether we consider them important for the new release or not:

Remove deprecated code. This does not include OpenCL or the old Tesseract engine.

Add --version parameter for all command line commands.

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.

Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).

Relative includes for traineddata: tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.

Maybe more fixes for compiler warnings and issues reported by Coverity Scan.

(list still incomplete)

RFC
opened by stweil 194
Build Tesseract from source with Visual Studio
Environment

Tesseract Version: 5.0.0 alfa

Commit Number: a1a177f

Platform:Windows 10 64 bit

Current Behavior:

I can not build from source i had download SW client and save it at "D:\Essam\Software\SW" the add to Path and i can run SW in command line and see WS information as follow D:\Tutorial\Git\tesseract\build>sw --version sw.client.sw version 1.0.0 git revision 083bb99144549c1f361298e8284daa6b54422965 assembled on 30.01.2020 18:36:29 Egypt Standard Time

then i run the following commands to compile from source as describe in the following link https://github.com/tesseract-ocr/tesseract/wiki/Compiling the command are

git clone https://github.com/tesseract-ocr/tesseract tesseract cd tesseract mkdir build && cd build cmake .. -G "Visual Studio 15 2017 Win64" -DCMAKE_INSTALL_PREFIX=inst

i receive the following error

"-- Selecting Windows SDK version 10.0.17763.0 to target Windows 10.0.18363. Configuring tesseract version 5.0.0-alpha-621-ga1a17... -- target changed from "auto" to "kaby-lake" CMake Error at CMakeLists.txt:197 (find_package): By not providing "FindSW.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "SW", but CMake did not find one.

Could not find a package configuration file provided by "SW" with any of the following names:

SWConfig.cmake sw-config.cmake

Add the installation prefix of "SW" to CMAKE_PREFIX_PATH or set "SW_DIR" to a directory containing one of the above files. If "SW" provides a separate development package or SDK, be sure it has been installed.

-- Configuring incomplete, errors occurred! See also "D:/Tutorial/Git/tesseract/build/CMakeFiles/CMakeOutput.log"."

the log file attached

CMakeOutput.log

Expected Behavior:

build tesseract solution

Suggested Fix:
build process
opened by essamzaky 114
Tag a new version for LSTM 4.0

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!
RFC

opened by Shreeshrii 108
RFC: Remove the legacy OCR Engine

Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

From #518:

@stweil commented:

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

@theraysmith commented:

Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

legacy RFC

opened by amitdo 106
good accuracy but too slow, how to improve Tesseract speed

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks
performance OpenMP SIMD

opened by ychtioui 90
Tesseract 4.0.0 crashed on Intel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2)
Environment

Tesseract Version: 4.0.0 Release

Commit Number: 51316994ccae0b48692d547030f26c0969308214

Platform: Debian 9.6.0 amd64

Current Behavior: Tesseract 4.0.0 crashed on Itel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2).

I compiled the tesseract 4.0 on Itel I5-8400 CPU with Debian 9.6.0 amd64. tesseract --version output this: tesseract 4.0.0 leptonica-1.74.2 libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 Found AVX2 Found AVX Found SSE

When I call tesseract several times, crash happens and PC is reboot.

I have a Intel G4650 CPU and this CPU not suport AVX2 / AVX and everything works fine! Never crash happens! How to make tesseract work fine on Intel I5-8400 with AVX/AVX2/SSE.

Expected Behavior:

Suggested Fix:
SIMD unexpected termination
opened by s3vrlinux 86
RFC: Add initial support for traineddata files in compressed archive formats (don't merge)

This requires libminizip-dev, so expect failures from CI.

Up to now, little endian tesseract works with the new zip format.

More work is needed for training tools and big endian support and also to maintain compatibility with the current proprietary format.

Signed-off-by: Stefan Weil [email protected]
feature request build process RFC

opened by stweil 81
trying to add tessedit_char_whitelist etc. again:
ignore matrix outputs in ComputeTopN if they belong to a disabled unichar_id

pass UNICHARSET refs to check that

in SetBlackAndWhitelist, also update the unicharset of the lstm_recognizer_ instance, if any

RFC enhancement allowlist / denylist
opened by bertsky 79
RFC: Reorganize source tree
I'd like to propose changes to tesseract source tree structure. Today the common way is to have src folder with all program stuff and include folder with public headers. Now we have a lot of dirs in the root - that's very annoying. On the first stage I propose:

move all sources into src

move training tools from training to tools/training

Later we can try to move public headers to include directory.

The new look will be like:

If there are no objections, I'll commit changes.
RFC
opened by egorpugin 69

4.0 bugs on MAC OS X and a step by step for reference

This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

Special thanks for Shree that helped me at the google groups

Project and more details: https://github.com/tesseract-ocr/tesseract

where to get help?

google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found AVX2 Found AVX Found SSE

Compiling Tesseract - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)

Steps

1 - Install these libs

brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

2 - Run the code

ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

3 - Clone tesseract repo

git clone https://github.com/tesseract-ocr/tesseract/

4 - Enter in the folder

cd tesseract

5 - Run the script

./autogen.sh

6 - Run the code, and copy the CPPFLAGS and LDFLAGS

brew info icu4c

7 - Update the CPPFLAGS and LDFLAGS and execute the code

./configure \
  CPPFLAGS=-I/usr/local/opt/icu4c/include \
  LDFLAGS=-L/usr/local/opt/icu4c/lib

8 - Run the code

make -j

9 - Run the code

sudo make install

10 - Run the code

sudo update_dyld_shared_cache

Obs.: this is the sudo ldconfig version for MAC OS X

11 - Run the code

make training

Creating ScrollView.jar - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Important: Use the JDK 8 to build, or else it is going to return an error

Steps

1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

3 - Enter the tesseract/java folder

cd java

4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar

Training Font - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

Steps

1 - Clone the langdata dir from git

git clone https://github.com/tesseract-ocr/langdata

2 - Enter the tesseract folder

cd ..

3 - Execute this code and select one font from the list (I recommend "Verdana")

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

More details here: https://support.apple.com/en-us/HT201722

4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)

Obs.: this is a fix for the error:

mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied

5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

git clone https://github.com/tesseract-ocr/tessdata_best

git clone https://github.com/tesseract-ocr/tessdata_fast

6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

7 - Create the training data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/engtrain

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

8 - Create other training data using other font to compare

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/engeval

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

9 - Create the needed folder

mkdir -p ~/tesstutorial/engoutput

10 - Start the training

SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

11 - Monitor the log on another console

tail -f ~/tesstutorial/engoutput/basetrain.log

12 - Test Accuracy with other font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

13 - Test Accuracy with best traindata

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

14 - Test Accuracy with actual traindata (in this case the same as step 13)

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Steps

1 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_small

2 - Start to fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_small/verdana \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200

3 - Validate the progress

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

4 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_full

5 - Combine the trained data

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/verdana_from_full/eng.lstm

6 - Train merged data

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_full/verdana \
  --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400

7 - Validate the results on the main training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

8 - Validate the results on our training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning add ± character - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

Steps

1 - Modify langdata/eng/eng.training_text and include these lines:

alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS ﬁrm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberﬂachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

2 - Generate the training file

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
              "Times New Roman, Bold" \
              "Times New Roman, Bold Italic" \
              "Times New Roman, Italic" \
              "Courier New" \
              "Courier New Bold" \
              "Courier New Bold Italic" \
              "Courier New Italic" \
  --output_dir ~/tesstutorial/trainplusminus

3 - Generate the eval data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/evalplusminus

4 - Combine trained data files

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm

5 - Fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/trainplusminus/plusminus \
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600

6 - Test the result on other fonts

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt

6 - Test the result test on main font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt

build process

opened by FernandoGOT 57

Some programs can't find OCR text in Tesseract's PDFs (3.04)

While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

pdftotext produces empty output. Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched. PyPDF2 extractText also produces an empty string as text.
bug PDF

opened by jbarlow83 56
TSV output splits each word by newline AND space
Basic Information

tesseract v5.3.0.20221222 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

Windows

[X] Windows 11

[ ] Windows 10

Current Behavior

The string output of the file is correct,

[..] Nutrition Facts 4 [..]

yet when selecting tsv output. Each word is placed on a newline.

5 1 1 1 1 1 48 0 562 323 76.177887 Nutrition 5 1 1 1 1 2 661 64 358 188 96.668480 Facts 5 1 1 1 1 3 1062 0 60 269 55.497231 4

Expected Behavior

To display the information similar to the string output.

Suggested Fix

Is there a way to omit/combine the items within the word_num column? Using psm did not have any effect

Other Information

No response
awaiting feedback output
opened by Antsthebul 1
Wrong Word-Confidence for specific input
Before you submit an issue, please review the guidelines for this repository.

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that reproduces the problem is provided. Ideally this test case should not have any external dependencies. Provide a copy of the image or link to files for the test case.

Please delete this text and fill in the template below.

Environment

Tesseract Version: tesseract v5.1.0.20220510, 32-Bit leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

Platform: Microsoft Windows 10 Enterprise, Version 10.0.19045 Build 19045, 64-Bit

Current Behavior:

Wrong Word-Confidence for input file "NOK.jpg".

Correct Word-Confidence for input file "OK.jpg" (image is almost identical with "NOK.jpg"

See attached input-, hocr- and own log- files

Expected Behavior:

Correct Word-Confidence

Suggested Fix:

hocr_NOK.txt hocr_OK.txt
opened by jam-codx 11

tesseract failed to build error LNK2001: unresolved external symbol (EC Symbol) with MSVC on Windows arm64ec

tesseract failed to build error LNK2001: 'unresolved external symbol "double __cdecl tesseract::DotProductAVX(double const *,double const *,int)" (?DotProductAVX@tesseract@@$$hYANPEBN0H@Z) (EC Symbol)' with MSVC on Windows arm64ec. It can reproduce on latest version on main branch. Could you please help look at this issue?

Repro steps:

set VSCMD_SKIP_SENDTELEMETRY=1 & "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\Tools\VsDevCmd.bat" -host_arch=amd64 -arch=arm64
git clone https://github.com/tesseract-ocr/tesseract F:\tesseract
cd F:\tesseract
git submodule update --init --recursive
set PATH=F:\gitP\tesseract-ocr\tools;%PATH%
sw setup
mkdir build_arm64ec & cd build_arm64ec
cmake -G "Visual Studio 16 2019" -A ARM64EC -DCMAKE_SYSTEM_VERSION=10.0.22618.0 -DCMAKE_BUILD_TYPE=Release -DBUILD_TRAINING_TOOLS=OFF -DINSTALL_CONFIGS=OFF -DFAST_FLOAT=OFF ..
msbuild /m /p:Platform=ARM64EC /p:Configuration=Release tesseract.sln /t:Rebuild

Error info:

9>tesseract52.lib(simddetect.obj) : error LNK2001: unresolved external symbol "float __cdecl tesseract::DotProductAVX(float const *,float const *,int)" (?DotProductAVX@tesseract@@$$hYAMPEBM0H@Z) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "float __cdecl tesseract::DotProductAVX(float const *,float const *,int)" (?DotProductAVX@tesseract@@YAMPEBM0H@Z) referenced in function "private: __cdecl tesseract::SIMDDetect::SIMDDetect(void)" (??0SIMDDetect@tesseract@@$$hAEAA@XZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
 9>tesseract52.lib(simddetect.obj) : error LNK2001: unresolved external symbol "float __cdecl tesseract::DotProductAVX512F(float const *,float const *,int)" (?DotProductAVX512F@tesseract@@$$hYAMPEBM0H@Z) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "float __cdecl tesseract::DotProductAVX512F(float const *,float const *,int)" (?DotProductAVX512F@tesseract@@YAMPEBM0H@Z) referenced in function "private: __cdecl tesseract::SIMDDetect::SIMDDetect(void)" (??0SIMDDetect@tesseract@@$$hAEAA@XZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>tesseract52.lib(simddetect.obj) : error LNK2001: unresolved external symbol "float __cdecl tesseract::DotProductFMA(float const *,float const *,int)" (?DotProductFMA@tesseract@@$$hYAMPEBM0H@Z) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "float __cdecl tesseract::DotProductFMA(float const *,float const *,int)" (?DotProductFMA@tesseract@@YAMPEBM0H@Z) referenced in function "public: static void __cdecl tesseract::SIMDDetect::Update(void)" (?Update@SIMDDetect@tesseract@@$$hSAXXZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>tesseract52.lib(simddetect.obj) : error LNK2001: unresolved external symbol "float __cdecl tesseract::DotProductSSE(float const *,float const *,int)" (?DotProductSSE@tesseract@@$$hYAMPEBM0H@Z) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
 9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "float __cdecl tesseract::DotProductSSE(float const *,float const *,int)" (?DotProductSSE@tesseract@@YAMPEBM0H@Z) referenced in function "private: __cdecl tesseract::SIMDDetect::SIMDDetect(void)" (??0SIMDDetect@tesseract@@$$hAEAA@XZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
 9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "public: static struct tesseract::IntSimdMatrix const tesseract::IntSimdMatrix::intSimdMatrixAVX2" (?intSimdMatrixAVX2@IntSimdMatrix@tesseract@@2U12@B) referenced in function "private: __cdecl tesseract::SIMDDetect::SIMDDetect(void)" (??0SIMDDetect@tesseract@@$$hAEAA@XZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
 9>tesseract52.lib(simddetect.obj) : error LNK2019: unresolved external symbol "public: static struct tesseract::IntSimdMatrix const tesseract::IntSimdMatrix::intSimdMatrixSSE" (?intSimdMatrixSSE@IntSimdMatrix@tesseract@@2U12@B) referenced in function "private: __cdecl tesseract::SIMDDetect::SIMDDetect(void)" (??0SIMDDetect@tesseract@@$$hAEAA@XZ) (EC Symbol) [F:\tesseract\build_arm64ec\tesseract.vcxproj]
9>F:\tesseract\build_arm64ec\bin\Release\tesseract.exe : fatal error LNK1120: 10 unresolved externals [F:\tesseract\build_arm64ec\tesseract.vcxproj]

Error log: tesseract_build.log

build process

opened by YangYang129 7

Tesseract produces overlapping bounding boxes for clearly separated lines
Environment

Tesseract Version: 5.2.0

Platform: Windows 10, x64

Current Behavior:

PDF Render renders two different lines on the same line, intermixing the chars.

I cannot post the full original image here because of GDPR, but I can show part of it and the HOCR from that and the resulting text in PDF. Hopefully this is enough, but if not, feel free to contact me.

Part of HOCR;

tesseract.exe "C:\support\redacted.png" "c:\support\redacted" --tessdata-dir "C:\Tesseract\tessdata_best-main" -l eng --psm 4 --oem 1 -c tessedit_create_hocr=1

Selected all text in PDF:

Copied and pasted to notepad gives intermixed text;

No: Date: 09.2221420323 09.22

Expected Behavior:

To have two separate lines which can be copy/pasted.

Suggested Fix:
bug layout analysis bounding box
opened by bleze 8
TessBaseAPIInit1() should take config file list as const
Environment

Tesseract Version: 4.0 and newer

Commit Number:

Platform: issue is independent of the platform

Current Behavior:

In the C API, the function int TessBaseAPIInit1() accepts a list of configuration files with the parameters char **configs, int configs_size. Although configs is used read-only, it is not correctly const-qualified, which is annoying for me as a user because I have to use const_cast.

char * configs[1]; configs[0] = const_cast<char *>(strConfigFile.c_str()); int r = TessBaseAPIInit1(h, strTessdata.c_str(), "eng", OEM_DEFAULT, configs, 1);

Expected Behavior:

TessBaseAPIInit1 should accept an array of const char pointers.

Suggested Fix:

Change the signature from TESS_API int TessBaseAPIInit1(TessBaseAPI *handle, const char *datapath, const char *language, TessOcrEngineMode oem, char **configs, int configs_size);

to

TESS_API int TessBaseAPIInit1(TessBaseAPI *handle, const char *datapath, const char *language, TessOcrEngineMode oem, const char **configs, int configs_size);

The implementation is the same because the function does not modify the array of strings.
enhancement API
opened by M-Fabian 1
Continuation of an interrupted training

From issue #3560:

@stweil commented,

There are other training aspects which I consider more important.

One is continuation of an interrupted training. That should start with the line following the last one which was used for training. I'm afraid that it currently starts with the first line, so a training which was interrupted is different compared to a training which runs without any interrupt.

enhancement

opened by amitdo 0

Releases(5.3.0)

5.3.0(Dec 22, 2022)
This is a new minor version of Tesseract 5.

What's Changed

Fix memory issues in ScrollView::MessageReceiver by @p12tic in https://github.com/tesseract-ocr/tesseract/pull/3872

autotools: Add rule for svpaint executable by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3873

Replace call of exit function by return statement in main function by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3878

Fix the build on CodeQL/Analyze by @arseniy-sonar in https://github.com/tesseract-ocr/tesseract/pull/3888

CI: Remove Ubuntu 18.04 by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3902

configure.ac: fix build on aarch64_be by @ffontaine in https://github.com/tesseract-ocr/tesseract/pull/3907

SW CI: Add paths filter by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3908

Create .mailmap by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3910

Fix tesseract.pc from cmake to match autotools by @jeroen in https://github.com/tesseract-ocr/tesseract/pull/3930

Update README.md by @nicholasz2510 in https://github.com/tesseract-ocr/tesseract/pull/3935

Fixed 2 errors by @Gitoffthelawn in https://github.com/tesseract-ocr/tesseract/pull/3938

fix issue #3940 - remove colormap before thresholding by @zdenop in https://github.com/tesseract-ocr/tesseract/pull/3942

Update upload-artifact action by @rettinghaus in https://github.com/tesseract-ocr/tesseract/pull/3949

Update checkout action to version 3 by @rettinghaus in https://github.com/tesseract-ocr/tesseract/pull/3948

Fix Markdownlint by @Saibamen in https://github.com/tesseract-ocr/tesseract/pull/3950

Fix broken links in CONTRIBUTING.md by @doraeric in https://github.com/tesseract-ocr/tesseract/pull/3951

pdfrenderer.cpp: Ignore non-text blocks by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3959

lstm.train: allow .box from .raw.png too by @bertsky in https://github.com/tesseract-ocr/tesseract/pull/3962

Fix a number of performance issues (reported by Coverity Scan) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3967

Fix training tools for legacy engine (issue #3925) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3970

Fix function tesseract::WriteFeature (issue #3925) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3972

Modernize function ObjectCache::DeleteUnusedObjects (fix issue with s… by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3978

More fixes for issue #3925 by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3977

New Contributors

@p12tic made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3872

@arseniy-sonar made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3888

@nicholasz2510 made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3935

@rettinghaus made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3949

@Saibamen made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3950

@doraeric made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3951

Full Changelog: https://github.com/tesseract-ocr/tesseract/compare/5.2.0...5.3.0
Source code(tar.gz)
Source code(zip)
5.3.0-rc1(Dec 13, 2022)
What's Changed

Fix memory issues in ScrollView::MessageReceiver by @p12tic in https://github.com/tesseract-ocr/tesseract/pull/3872

autotools: Add rule for svpaint executable by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3873

Replace call of exit function by return statement in main function by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3878

Fix the build on CodeQL/Analyze by @arseniy-sonar in https://github.com/tesseract-ocr/tesseract/pull/3888

CI: Remove Ubuntu 18.04 by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3902

configure.ac: fix build on aarch64_be by @ffontaine in https://github.com/tesseract-ocr/tesseract/pull/3907

SW CI: Add paths filter by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3908

Create .mailmap by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3910

Fix tesseract.pc from cmake to match autotools by @jeroen in https://github.com/tesseract-ocr/tesseract/pull/3930

Update README.md by @nicholasz2510 in https://github.com/tesseract-ocr/tesseract/pull/3935

Fixed 2 errors by @Gitoffthelawn in https://github.com/tesseract-ocr/tesseract/pull/3938

fix issue #3940 - remove colormap before thresholding by @zdenop in https://github.com/tesseract-ocr/tesseract/pull/3942

Update upload-artifact action by @rettinghaus in https://github.com/tesseract-ocr/tesseract/pull/3949

Update checkout action to version 3 by @rettinghaus in https://github.com/tesseract-ocr/tesseract/pull/3948

Fix Markdownlint by @Saibamen in https://github.com/tesseract-ocr/tesseract/pull/3950

Fix broken links in CONTRIBUTING.md by @doraeric in https://github.com/tesseract-ocr/tesseract/pull/3951

pdfrenderer.cpp: Ignore non-text blocks by @amitdo in https://github.com/tesseract-ocr/tesseract/pull/3959

lstm.train: allow .box from .raw.png too by @bertsky in https://github.com/tesseract-ocr/tesseract/pull/3962

Fix a number of performance issues (reported by Coverity Scan) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3967

Fix training tools for legacy engine (issue #3925) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3970

Fix function tesseract::WriteFeature (issue #3925) by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3972

Modernize function ObjectCache::DeleteUnusedObjects (fix issue with s… by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3978

More fixes for issue #3925 by @stweil in https://github.com/tesseract-ocr/tesseract/pull/3977

New Contributors

@p12tic made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3872

@arseniy-sonar made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3888

@nicholasz2510 made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3935

@rettinghaus made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3949

@Saibamen made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3950

@doraeric made their first contribution in https://github.com/tesseract-ocr/tesseract/pull/3951

Full Changelog: https://github.com/tesseract-ocr/tesseract/compare/5.2.0...5.3.0-rc1
Source code(tar.gz)
Source code(zip)
5.2.0(Jul 6, 2022)
This is a new minor version of Tesseract 5.

Improvements and fixes for continuous integration, autoconf and cmake builds.

Set /Os for some 32 bit MS compilers (fixes #3769).

Improve comments and other documentation.

Add initial support for Intel AVX512F.

Fix for very large PDF files on 32 bit hosts (fixes #3805).

Fix NEON detection on FreeBSD.

Fix regression with UZN files (fixes #3837).

Fix calling delete[] for memory allocated by malloc in C API.

Add an API function to init tesseract with traineddata from memory (fixes #3691).

Replace direct access to Leptonica internal data structures by function calls and support latest releases of Leptonica.

Replace std::regex by std::string functions (fixes issue #3830).

Use compiled-in TESSDATA_PREFIX also on Windows (fixes #3767).

Add new parameter 'invert_threshold', change the default threshold from 0.5 to 0.7 and mark parameter 'tessedit_do_invert' as deprecated.

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.1.0(Mar 1, 2022)
This is a new minor version of Tesseract 5.

Handle image and line regions in output formats ALTO, hOCR and text.

New parameter curl_timeout for curl_easy_setop.

Build fixes and improvements.

Catch nullptr in PageIterator::Orientation to improve robustness.

Remove unused code.

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.1(Jan 7, 2022)
This is a bug fix release of Tesseract 5.0.

Add SPDX-License-Identifier to public include files.

Support redirections when running OCR on a URL.

Lots of fixes and improvements for cmake builds. Distributions should use the autoconf build.

Fix broken msys2 build with gcc 11.

Fix parameter certainty_scale (was duplicated).

Fix some compiler warnings and clean code.

Correctly detect amd64 and i386 on FreeBSD.

Add libarchive and libcurl in continuous integration actions.

Update submodule googletest to release v1.11.0.

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0(Nov 30, 2021)
This is the final stable release of Tesseract 5.0.0.

Limit BCER to interval [0,1]

Improved build process

Cleaned code

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-rc3(Nov 22, 2021)
This is the third release candidate of Tesseract 5.0.0.

Improve training messages

Add RowAttributes getter to PageIterator

See also list of all changes.
Source code(tar.gz)
Source code(zip)
4.1.3(Nov 15, 2021)
This is a new stable release of Tesseract 4.1.

Fix broken autoconf build (issue #3642)

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-rc2(Nov 14, 2021)
This is the second release candidate of Tesseract 5.0.0.

Fix regression for OCR with more than one model file

Bug fixes

Optimizations

See also list of all changes.
Source code(tar.gz)
Source code(zip)
4.1.2(Nov 14, 2021)
This is a new stable release of Tesseract 4.1.

Note: The autoconf build is broken (see issue #3642), so please use 4.1.3.

Allow line images with larger width for training

Bug fixes

Build updates and fixes

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-rc1(Oct 29, 2021)
This is the first release candidate of Tesseract 5.0.0.

Enable fast float32 LSTM by default

Switch to NFC normalisation everywhere

Remove banner message

Disable music staff detection and removal

Add new command line option --loglevel

Bug fixes

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-beta-20210916(Sep 16, 2021)
This is a new pre-release of Tesseract 5.0.0.

Bug fixes

Extend URI support for Tesseract with libcurl

Rename processed TIFF output file and add page number if needed

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-beta-20210815(Aug 15, 2021)
This is a new pre-release of Tesseract 5.0.0.

Bug fixes

Modernize more code

More options for binarization

Improved support for ARM NEON

No longer depends on Abseil for unit tests

Support float for model training and text recognition (faster, requires less RAM)

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-alpha-20210401(Apr 1, 2021)
This is a new pre-release of Tesseract 5.0.0.

Replaced all remaining STRING by std::string

Replaced lots of GenericVector by std::vector

Replaced all malloc / free by C++ code

Modernized and formatted code

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-alpha-20201231(Dec 31, 2020)
This is a new pre-release of Tesseract 5.0.0.

It has massive changes in the public API which is a great step towards a final 5.0.0. All unit tests pass, but because of those changes more practical experience is needed.

the public API no longer uses proprietary data types GenericVector, STRING

pdf.ttf is no longer needed because it is now integrated into the code

See also list of all changes.
Source code(tar.gz)
Source code(zip)
5.0.0-alpha-20201224(Dec 24, 2020)
This is a new pre-release of Tesseract 5.0.0.

It is considered to be production ready for end users, but nevertheless not stable because more incompatible API changes are planned.

improved performance (also on ARM / ARM64)

improved unit tests

many fixes

faster flat build with automake

support for latest macOS (including new M1 processor)

See also list of all changes.
Source code(tar.gz)
Source code(zip)
4.1.1(Dec 26, 2019)
Implemented sw build (cppan is deprecated)

Improved cmake build

Code cleanup and optimization

A lot of bug fixes...

Source code(tar.gz)
Source code(zip)
4.1.0(Jul 7, 2019)
Added new renderers Alto, LSTMBox, WordStrBox.

Added character boxes in hOCR output.

Added python training scripts (experimental) as alternative shell scripts.

Better support AVX / AVX2 / SSE.

Disable OpenMP support by default (see e.g. #1171, #1081).

Fix for bounding box problem.

Implemented support for whitelist/blacklist in LSTM engine.

Improved cmake configuration.

Code modernization and improvements.

A lot of bug fixes...

Detailed changelog is on wiki.

Windows installer can be downloaded from https://github.com/UB-Mannheim/tesseract/wiki.
Source code(tar.gz)
Source code(zip)
4.0.0(Oct 29, 2018)

Detailed Release notes, Changelog and documentation can be found in project wiki.

Windows installer can be downloaded from https://github.com/UB-Mannheim/tesseract/wiki.
Source code(tar.gz)
Source code(zip)
3.05.02(Jun 19, 2018)

Bug fix release
Source code(tar.gz)
Source code(zip)
3.05.01(Jun 1, 2017)

Bug fix release
Source code(tar.gz)
Source code(zip)
3.05.00(Feb 16, 2017)
Made some fine tuning to the hOCR output.

Added TSV as another optional output format.

Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.

text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.

Training tools - Replaced asserts with tprintf() and exit(1).

Fixed Cygwin compatibility.

Improved multipage tiff processing.

Improved the embedded pdf font (pdf.ttf).

Enable selection of OCR engine mode from command line.

Changed tesseract command line parameter '-psm' to '--psm'.

Added new C API for orientation and script detection, removed the old one.

Increased minimum autoconf version to 2.59.

Removed dead code.

Fixed many compiler warning.

Fixed memory and resource leaks.

Fixed some issues with the 'Cube' OCR engine.

Fixed some openCL issues.

Added option to build Tesseract with CMake build system.

Implemented CPPAN support for easy Windows building.

Source code(tar.gz)
Source code(zip)
3.04.01(Feb 16, 2016)

bug-fix release of 3.04 version
Source code(tar.gz)
Source code(zip)
3.04.00(Jul 24, 2015)
Added OpenCL support (experimental)

Many bug fixes

Source code(tar.gz)
Source code(zip)