This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty documentation.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from master branch on GitHub. Open issues can be found in issue tracker, and planning documentation.

The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

Supported Compilers are:

  • GCC 4.8 and above
  • Clang 3.4 and above
  • MSVC 2015, 2017, 2019

Other compilers might work, but are not officially supported.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

Examples can be found in the documentation.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section in the AddOns documentation.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.


Before you submit an issue, please review the guidelines for this repository.

For support, first read the documentation, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.


Please report an issue only for a bug, not for asking questions.


The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at


Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Tesseract uses Leptonica library which essentially uses a BSD 2-clause license.


Tesseract uses Leptonica library for opening input images (e.g. not documents like pdf). It is suggested to use leptonica with built-in support for zlib, png and tiff (for multipage tiff).

Latest Version of README

For the latest online version of the README.md see:


  • 5.0.0-rc3(Nov 22, 2021)

  • 4.1.3(Nov 15, 2021)

  • 4.1.2(Nov 14, 2021)

    This is a new stable release of Tesseract 4.1.

    Note: The autoconf build is broken (see issue #3642), so please use 4.1.3.

    • Allow line images with larger width for training
    • Bug fixes
    • Build updates and fixes

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-rc2(Nov 14, 2021)

  • 5.0.0-rc1(Oct 29, 2021)

    This is the first release candidate of Tesseract 5.0.0.

    • Enable fast float32 LSTM by default
    • Switch to NFC normalisation everywhere
    • Remove banner message
    • Disable music staff detection and removal
    • Add new command line option --loglevel
    • Bug fixes

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-beta-20210916(Sep 16, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Bug fixes
    • Extend URI support for Tesseract with libcurl
    • Rename processed TIFF output file and add page number if needed

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-beta-20210815(Aug 15, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Bug fixes
    • Modernize more code
    • More options for binarization
    • Improved support for ARM NEON
    • No longer depends on Abseil for unit tests
    • Support float for model training and text recognition (faster, requires less RAM)

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20210401(Apr 1, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Replaced all remaining STRING by std::string
    • Replaced lots of GenericVector by std::vector
    • Replaced all malloc / free by C++ code
    • Modernized and formatted code

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20201231(Dec 31, 2020)

    This is a new pre-release of Tesseract 5.0.0.

    It has massive changes in the public API which is a great step towards a final 5.0.0. All unit tests pass, but because of those changes more practical experience is needed.

    • the public API no longer uses proprietary data types GenericVector, STRING
    • pdf.ttf is no longer needed because it is now integrated into the code

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20201224(Dec 24, 2020)

    This is a new pre-release of Tesseract 5.0.0.

    It is considered to be production ready for end users, but nevertheless not stable because more incompatible API changes are planned.

    • improved performance (also on ARM / ARM64)
    • improved unit tests
    • many fixes
    • faster flat build with automake
    • support for latest macOS (including new M1 processor)

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 4.1.1(Dec 26, 2019)

  • 4.1.0(Jul 7, 2019)

    • Added new renderers Alto, LSTMBox, WordStrBox.
    • Added character boxes in hOCR output.
    • Added python training scripts (experimental) as alternative shell scripts.
    • Better support AVX / AVX2 / SSE.
    • Disable OpenMP support by default (see e.g. #1171, #1081).
    • Fix for bounding box problem.
    • Implemented support for whitelist/blacklist in LSTM engine.
    • Improved cmake configuration.
    • Code modernization and improvements.
    • A lot of bug fixes...

    Detailed changelog is on wiki.

    Windows installer can be downloaded from https://github.com/UB-Mannheim/tesseract/wiki.

    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Oct 29, 2018)

  • 3.05.02(Jun 19, 2018)

  • 3.05.01(Jun 1, 2017)

  • 3.05.00(Feb 16, 2017)

    • Made some fine tuning to the hOCR output.
      • Added TSV as another optional output format.
      • Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
      • text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
      • Training tools - Replaced asserts with tprintf() and exit(1).
      • Fixed Cygwin compatibility.
      • Improved multipage tiff processing.
      • Improved the embedded pdf font (pdf.ttf).
      • Enable selection of OCR engine mode from command line.
      • Changed tesseract command line parameter '-psm' to '--psm'.
      • Added new C API for orientation and script detection, removed the old one.
      • Increased minimum autoconf version to 2.59.
      • Removed dead code.
      • Fixed many compiler warning.
      • Fixed memory and resource leaks.
      • Fixed some issues with the 'Cube' OCR engine.
      • Fixed some openCL issues.
      • Added option to build Tesseract with CMake build system.
      • Implemented CPPAN support for easy Windows building.
    Source code(tar.gz)
    Source code(zip)
  • 3.04.01(Feb 16, 2016)

  • 3.04.00(Jul 24, 2015)

