个人工具
登录
查看“UbuntuHelp:OCR”的源代码 - Ubuntu中文
UbuntuHelp
讨论
查看源代码
历史
搜索
导航
首页
最近更改
随机页面
页面分类
帮助
编辑
编辑指南
沙盒
新闻动态
字词处理
工具
链入页面
相关更改
特殊页面
页面信息
查看“UbuntuHelp:OCR”的源代码
来自Ubuntu中文
←
UbuntuHelp:OCR
跳转至:
导航
,
搜索
因为以下原因,你没有权限编辑本页:
您所请求的操作仅限于该用户组的用户使用:
用户
您可以查看与复制此页面的源代码。
{{From|https://help.ubuntu.com/community/OCR}} {{Languages|UbuntuHelp:OCR}} == OCR - Optical Character Recognition == OCR is a technology that allows you to convert scanned images of text into plain text. This enables you to save space, edit the text and search/index it. == Available OCR tools == The Ubuntu Universe repositories contain the following OCR tools: * [http://code.google.com/p/tesseract-ocr/ tesseract-ocr] * [http://www.gnu.org/software/ocrad/ocrad.html ocrad] * [http://jocr.sourceforge.net/ gocr] === Tesseract === ==== Introduction ==== Arguably the one producing the best (most accurate) results is Tesseract. It is a technology initially developed by HP Labs between 1985 and 1995, then they open-sourced it in 2005. Tesseract can recognize text in 7 different languages: English, German, French, Italian, Spanish, Brazilian Portuguese and Dutch. You can install more than one dictionary if needed. It does not support layout analysis, so multi-column text, images, equations etc. should give you a garbled text output. Also, it only supports TIFF images as input. ==== Usage ==== Tesseract is currently a command-line-only tool (although they're working on an integration with OCROpus for a GUI). After successful installation, the command to use is <code><nowiki>tesseract <path to tiff image> <output file></nowiki></code>. Tesseract will automatically give the output file a .txt extension. It is critical that the tiff image have a ".tif" extension and not a ".tiff" extension. The command line should look like this example: <code><nowiki>$ tesseract ~/input.tif output</nowiki></code> Where <code><nowiki>input.tif</nowiki></code> is the document to be converted located in your home folder and <code><nowiki>output</nowiki></code> is the document that Tesseract will create as <code><nowiki>output.txt</nowiki></code>. The <code><nowiki>.txt</nowiki></code> file extension will be added by Tesseract automatically. ==== Preparing images for Tesseract ==== Tesseract is not very flexible about the format of its input images. It will only accept TIFF images. According to user reports, compressed TIFF images are quite problematic, and the same goes for grey-scale and colour images. So you're better of with single-bit uncompressed TIFF images. The process to prepare them with GIMP is very simple: <ol><li>Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode. </li><li>Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value. </li><li>Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering. </li><li>Save the image in TIFF format with a .tif extension.</li></ol> ==== Using Tesseract With a Multi Page PDF ==== Often, scanned documents are stored as a raster image in a large PDF document. Using [[UbuntuHelp:ImageMagick|ImageMagick]], the individual pages can then be extracted as TIFF files for processing using Tesseract. The following script can help automate this process: <pre><nowiki>#!bash #!/bin/sh PAGES=100 # set to the number of pages in the PDF SOURCE=book.pdf # set to the file name of the PDF OUTPUT=book.txt # set to the final output file RESOLUTION=600 # set to the resolution the scanner used (the higher, the better) touch $OUTPUT for i in `seq 1 $PAGES`; do convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page$i.tif tesseract page$i.tif page$i cat $OUTPUT page$i.txt > temp.txt rm $OUTPUT rm page$i.tif rm page$i.txt mv temp.txt $OUTPUT done </nowiki></pre> After running this script, the OCR text should be contained in <code><nowiki>book.txt</nowiki></code> (or whatever you set <code><nowiki>$OUTPUT</nowiki></code> to be). === Cuneiform === ==== Introduction ==== Cuneiform is another OCR system, which was originally developed and open-sourced by Cognitive Technologies. Its Linux port is being developed on [http://launchpad.net/cuneiform-linux Launchpad]. ==== Using pdfocr With a Multi Page PDF ==== pdfocr is a script that uses cuneiform which both performs OCR on multi-page PDF files, and also embeds the text back into the PDF file as a searchable text layer. The script itself can be obtained from [http://github.com/gkovacs/pdfocr/raw/master/pdfocr.rb Github] or from the [http://launchpad.net/~gezakovacs/+archive/pdfocr PPA]. To use, simply do: <pre><nowiki>#!bash pdfocr -i input.pdf -o output.pdf </nowiki></pre> == Further Reading == * [http://www.linuxjournal.com/article/9676 A LinuxJournal article on Tesseract] ---- [[category:CategoryCommandLine]] [[category:UbuntuHelp]]
该页面使用的模板:
模板:From
(
查看源代码
)
模板:Languages
(
查看源代码
)(受保护)
模板:Languages/Lang
(
查看源代码
)(受保护)
返回至
UbuntuHelp:OCR
。