This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work.
I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.
Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr
git: https://github.com/tesseract-ocr/tesseract/issues
Platform: MAC OS X 10.13.3
Tesseract: 4.0.0-beta.1-69-g10f4
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Compiling Tesseract - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar
from it! (At least I wasn't able to generate it)
Steps
1 - Install these libs
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
2 - Run the code
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
Obs.: text2image
is set to use icu4c/60.2 but the actual version is icu4c/61.1
3 - Clone tesseract repo
git clone https://github.com/tesseract-ocr/tesseract/
4 - Enter in the folder
cd tesseract
5 - Run the script
./autogen.sh
6 - Run the code, and copy the CPPFLAGS
and LDFLAGS
brew info icu4c
7 - Update the CPPFLAGS
and LDFLAGS
and execute the code
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib
8 - Run the code
make -j
9 - Run the code
sudo make install
10 - Run the code
sudo update_dyld_shared_cache
Obs.: this is the sudo ldconfig
version for MAC OS X
11 - Run the code
make training
Creating ScrollView.jar - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging
Important: Use the JDK 8 to build, or else it is going to return an error
Steps
1 - Download the files piccolo2d-core-3.0.jar
and piccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files piccolo2d-core-3.0.jar
and piccolo2d-extras-3.0.jar
to tesseract/java
3 - Enter the tesseract/java
folder
cd java
4 - Set the var SCROLLVIEW_PATH
to your tesseract/java
folder and run the code
SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
Training Font - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
Steps
1 - Clone the langdata dir from git
git clone https://github.com/tesseract-ocr/langdata
2 - Enter the tesseract folder
cd ..
3 - Execute this code and select one font from the list (I recommend "Verdana")
text2image --list_available_fonts --fonts_dir=/Library/Fonts
Font dir for MAC can be : ~/Library/Fonts
/Library/Fonts/
/Network/Library/Fonts/
/System/Library/Fonts/
/System Folder/Fonts/
More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh
from
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
Obs.: this is a fix for the error:
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)
git clone https://github.com/tesseract-ocr/tessdata_best
or
git clone https://github.com/tesseract-ocr/tessdata_fast
6 - Copy the tessdata_best/eng.traineddata
(for english training) from the tessdata you just cloned and past at tesseract/tessdata/
7 - Create the training data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/engtrain
Add the prefix PANGOCAIRO_BACKEND=fc
if using MAC OSX
8 - Create other training data using other font to compare
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/engeval
Add the prefix PANGOCAIRO_BACKEND=fc
if using MAC OSX
9 - Create the needed folder
mkdir -p ~/tesstutorial/engoutput
10 - Start the training
SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1
11 - Monitor the log on another console
tail -f ~/tesstutorial/engoutput/basetrain.log
12 - Test Accuracy with other font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
13 - Test Accuracy with best traindata
~/projects/tesseract/training/lstmeval \
--model ~/projects/tessdata_best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
14 - Test Accuracy with actual traindata (in this case the same as step 13)
~/projects/tesseract/training/lstmeval \
--model ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
Steps
1 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_small
2 - Start to fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_small/verdana \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 1200
3 - Validate the progress
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
4 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_full
5 - Combine the trained data
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/verdana_from_full/eng.lstm
6 - Train merged data
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_full/verdana \
--continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 400
7 - Validate the results on the main training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
8 - Validate the results on our training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Fine tuning add ± character - tesseract 4.0
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
Steps
1 - Modify langdata/eng/eng.training_text
and include these lines:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
2 - Generate the training file
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
"Times New Roman, Bold" \
"Times New Roman, Bold Italic" \
"Times New Roman, Italic" \
"Courier New" \
"Courier New Bold" \
"Courier New Bold Italic" \
"Courier New Italic" \
--output_dir ~/tesstutorial/trainplusminus
3 - Generate the eval data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/evalplusminus
4 - Combine trained data files
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm
5 - Fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/trainplusminus/plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 3600
6 - Test the result on other fonts
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
6 - Test the result test on main font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
build process