安裝
參考至: Centos5.5 安装Tesseract-OCR (本機備份)
CentOS 5.5 和 6.7 實在有些差距,不過還好要安裝的軟體變化不大,最後安裝的選擇是
- leptonica-1.69.tar.gz
- tesseract-ocr-3.02.02.tar.gz
- tesseract-ocr-3.02.eng.tar.gz
- tesseract-ocr-3.02.chi_tra.tar.gz
確實按照安裝步驟將相依性程式先安裝後編譯即可順利安裝
1 |
yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel |
特別寫出來是因為,我以為我的環境安裝一堆軟體應該不缺這種基礎元件,結果還是少了 .... = =a
使用
就敲指令就對了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...] pagesegmode values are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. -l lang and/or -psm pagesegmode must occur before anyconfigfile. Single options: -v --version: version info --list-langs: list available languages for tesseract engine |
--
tesseract 辨識圖檔 產生文字檔案名稱 -l 使用辨識字體
1 |
tesseract /tmp/phototest.tif /tmp/output -l eng |
輸出的檔案會自動加上 .txt 副檔名
phototest.tif 是內附的測試圖檔,可以到 這裡 看
因為有安裝正體中文字體辨識檔案,當然也可以換成這樣辨識
1 |
tesseract phototest.tif output -l chi_tra |
不過辨識正確率就相當差了,結果如
1 2 3 4 5 6 7 8 9 |
ThiS iS a |0t of T2 point teXt to teSt the oc「 c0de and see if it WorkS 0n a|| typeS of fi|e f0「mat' ˉ|ˉhe quick br0Wn do9 jumped oVe「 the |aZy fo)(_ The quick broWn do9 jumped oVer the |aZy f0X_ ˉ|ˉhe quick br0Wn do9 jumped 0Ver the |aZy f0X_ ˉ「he quick br0Wn do9 jumped oVe「 the |aZy fo)(_ |
看不懂的人請看 eng 辨識結果,如
1 2 3 4 5 6 7 8 9 |
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. |
有關辨識率提昇
--
免安裝
安裝後的檔案即可複製出來使用,使用上會遇到的問題就是 tessdata 路徑指定
1 |
>tesseract.exe 3.jpg 3 --tessdata-dir .\tessdata -l chi_tra |
--
測試結果
- 不同的版本語言辨識檔案無法共用
- 不同的辨識檔辨識率不同
- 以上的問題可以藉由免安裝的方式解決
- 4.0 程式可以套用 3.05 辨識檔案
--
Windows 以及訓練
--
1,051 total views, 1 views today