Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
Does anyone know of a free OCR program for linux, that WORX :) ?
If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout).
As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).
On 12/15/2015 03:00 PM, Tom Horsley wrote:
If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout).
there is an option: -layout It does a good job with preserving the layout. David
As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).
On 12/15/2015 02:00 PM, Tom Horsley wrote:
If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout).
As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).
Thanx. Downloaded and tested tesseract. It failed TOTALLY on EVERY image file created by pdf.
On 12/15/2015 12:45 PM, jd1008 wrote:
Does anyone know of a free OCR program for linux, that WORX
My wife used a tesseract-ocr frontend (gimagereader, on Windows) successfully. There are a list of others: https://code.google.com/p/tesseract-ocr/wiki/3rdParty
On 12/15/2015 02:37 PM, Gordon Messmer wrote:
On 12/15/2015 12:45 PM, jd1008 wrote:
Does anyone know of a free OCR program for linux, that WORX
My wife used a tesseract-ocr frontend (gimagereader, on Windows) successfully. There are a list of others: https://code.google.com/p/tesseract-ocr/wiki/3rdParty
I have to make a decision on this. Of the 2 packages: ReadIris 14 Pro ABBYY FineReader 12 Corp Edition-2
Which one is better for converting pdf document images to text?
On Tue, Dec 15, 2015 at 01:45:20PM -0700, jd1008 wrote:
Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
Last time I used tesseract (4 or 5 years, perhaps) it processed ONLY TIFF images. You need to convert other forms of image to a TIFF. Having done that, I found that it worked pretty well. In the intervening time it may have undergone improvements, as well.
On Tue, Dec 15, 2015 at 09:44:01PM -0500, Fred Smith wrote:
On Tue, Dec 15, 2015 at 01:45:20PM -0700, jd1008 wrote:
Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
Last time I used tesseract (4 or 5 years, perhaps) it processed ONLY TIFF images. You need to convert other forms of image to a TIFF. Having done that, I found that it worked pretty well. In the intervening time it may have undergone improvements, as well.
Also, it seemed to work best with certain resolutions, if I can recall it correctly, it was either 150 dpi or 300 dpi.
Allegedly, on or about 15 December 2015, jd1008 sent:
Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
But are they good quality images? OCR needs a reasonable resolution, *and* clean character definition.
On Wed, Dec 16, 2015 at 01:36:40PM +1030, Tim wrote:
Allegedly, on or about 15 December 2015, jd1008 sent:
Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
But are they good quality images? OCR needs a reasonable resolution, *and* clean character definition.
When I was using tesseract a few years ago (as mentioned earlier in this thread) I was getting PDFs made of scanned legal documents (from Groklaw, documents from the SCO v IBM case). These were pretty awful quality, as if they had been scanned at some terribly low resolution from what may have been poor quality originals (or copies thereof). They were very messy to look at, but tesseract could read most of it fairly well. but converting to higher-resolution TIFF files actually made the OCR work more poorly, odd as that may seem.
On 12/15/2015 10:22 PM, Doug wrote:
On 12/15/2015 11:35 PM, Fred Smith wrote:
On Wed, Dec 16, 2015 at 01:36:40PM +1030, Tim wrote:
Allegedly, on or about 15 December 2015, jd1008 sent:
Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.
But are they good quality images? OCR needs a reasonable resolution, *and* clean character definition.
When I was using tesseract a few years ago (as mentioned earlier in this thread) I was getting PDFs made of scanned legal documents (from Groklaw, documents from the SCO v IBM case). These were pretty awful quality, as if they had been scanned at some terribly low resolution from what may have been poor quality originals (or copies thereof). They were very messy to look at, but tesseract could read most of it fairly well. but converting to higher-resolution TIFF files actually made the OCR work more poorly, odd as that may seem.
About 3 or so years ago, I tried tesseract and it was only about 80% on good quality print. I tried the Windows program ABBYY and it was virtually perfect. So if you have a long document, or a bunch of documents, spend the money and find a Windows machine if you don't have one. You'd be surprised how much work it is to go thru and change "1"s to "l"s or vice versa, Or "i"s. And that's just one example. And I can almost guarantee you'll miss a couple! (If the document is not too long, it might be worthwhile to have someone read it to you and type it in by hand!)
Looking at Google output, I find a free on-line service, but there might be a problem sending a scanned file due to file size--check with your usp. The url is: www.online*ocr*.net
As they say on the net, YMMV!
--doug
Thanx Doug. Will try the windoze approach with ABBYY. Is ABBYY available for Linux. Guess it might work with wine.