A reliable ocr program for Fedora - users - Fedora Mailing-Lists

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

1987

1986

1985

1984

1983

1982

1981

1980

A reliable ocr program for Fedora

yum configure to only download one...

rpmfusion updates-testing repos...

JD

Tuesday, 15 December 2015 Tue, 15 Dec '15

2:45 p.m.

Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc. Does anyone know of a free OCR program for linux, that WORX :) ?

Reply

Show replies by date

Tom Horsley

Tuesday, 15 December Tue, 15 Dec

3 p.m.

If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout). As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).

Reply

dcw

3:11 p.m.

On 12/15/2015 03:00 PM, Tom Horsley wrote:

If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout).

there is an option: -layout It does a good job with preserving the layout. David

As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).

Reply

JD

3:45 p.m.

On 12/15/2015 02:00 PM, Tom Horsley wrote:

If you have pdf files with actual characters, the pdftotext tool works well for extracting the text (though not necessarily the layout). As far as doing OCR from actual image files, I always found tesseract to work better than most (but it was still pretty feeble).

Thanx. Downloaded and tested tesseract. It failed TOTALLY on EVERY image file created by pdf.

Reply

Gordon Messmer

3:37 p.m.

On 12/15/2015 12:45 PM, jd1008 wrote:

Does anyone know of a free OCR program for linux, that WORX

My wife used a tesseract-ocr frontend (gimagereader, on Windows) successfully. There are a list of others: https://code.google.com/p/tesseract-ocr/wiki/3rdParty

Reply

JD

Friday, 18 December Fri, 18 Dec

4:47 p.m.

On 12/15/2015 02:37 PM, Gordon Messmer wrote:

On 12/15/2015 12:45 PM, jd1008 wrote: > Does anyone know of a free OCR program for linux, that WORX My wife used a tesseract-ocr frontend (gimagereader, on Windows) successfully. There are a list of others: https://code.google.com/p/tesseract-ocr/wiki/3rdParty

I have to make a decision on this. Of the 2 packages: ReadIris 14 Pro ABBYY FineReader 12 Corp Edition-2 Which one is better for converting pdf document images to text?

Reply

Fred Smith

Tuesday, 15 December Tue, 15 Dec

8:44 p.m.

On Tue, Dec 15, 2015 at 01:45:20PM -0700, jd1008 wrote:

Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.

Last time I used tesseract (4 or 5 years, perhaps) it processed ONLY TIFF images. You need to convert other forms of image to a TIFF. Having done that, I found that it worked pretty well. In the intervening time it may have undergone improvements, as well. -- ---- Fred Smith -- fredex(a)fcshome.stoneham.ma.us ----------------------------- But God demonstrates his own love for us in this: While we were still sinners, Christ died for us. ------------------------------- Romans 5:8 (niv) ------------------------------

Reply

Fred Smith

8:54 p.m.

On Tue, Dec 15, 2015 at 09:44:01PM -0500, Fred Smith wrote:

On Tue, Dec 15, 2015 at 01:45:20PM -0700, jd1008 wrote: > Downloaded and tried tesseract and cuneiform, and both fail to > work on any of the pdf images I have. These images are NOT encrypted > as they are public documents like from the DMV, ... etc. Last time I used tesseract (4 or 5 years, perhaps) it processed ONLY TIFF images. You need to convert other forms of image to a TIFF. Having done that, I found that it worked pretty well. In the intervening time it may have undergone improvements, as well.

Also, it seemed to work best with certain resolutions, if I can recall it correctly, it was either 150 dpi or 300 dpi. -- ------------------------------------------------------------------------------- .---- Fred Smith / ( /__ ,__. __ __ / __ : / / / / /__) / / /__) .+' Home: fredex(a)fcshome.stoneham.ma.us / / (__ (___ (__(_ (___ / :__ 781-438-5471 -------------------------------- Jude 1:24,25 ---------------------------------

Reply

Tim

9:06 p.m.

Allegedly, on or about 15 December 2015, jd1008 sent:

Downloaded and tried tesseract and cuneiform, and both fail to work on any of the pdf images I have. These images are NOT encrypted as they are public documents like from the DMV, ... etc.

But are they good quality images? OCR needs a reasonable resolution, *and* clean character definition. -- [tim@localhost ~]$ uname -rsvp Linux 3.9.10-100.fc17.x86_64 #1 SMP Sun Jul 14 01:31:27 UTC 2013 x86_64 All mail to my mailbox is automatically deleted, there is no point trying to privately email me, I will only read messages posted to the public lists. Next time your service provider asks you to reboot your equipment, ask them to reboot theirs, first.

Reply

Fred Smith

10:35 p.m.

On Wed, Dec 16, 2015 at 01:36:40PM +1030, Tim wrote:

Allegedly, on or about 15 December 2015, jd1008 sent: > Downloaded and tried tesseract and cuneiform, and both fail to > work on any of the pdf images I have. These images are NOT encrypted > as they are public documents like from the DMV, ... etc. But are they good quality images? OCR needs a reasonable resolution, *and* clean character definition.

When I was using tesseract a few years ago (as mentioned earlier in this thread) I was getting PDFs made of scanned legal documents (from Groklaw, documents from the SCO v IBM case). These were pretty awful quality, as if they had been scanned at some terribly low resolution from what may have been poor quality originals (or copies thereof). They were very messy to look at, but tesseract could read most of it fairly well. but converting to higher-resolution TIFF files actually made the OCR work more poorly, odd as that may seem. -- ---- Fred Smith -- fredex(a)fcshome.stoneham.ma.us ----------------------------- I can do all things through Christ who strengthens me. ------------------------------ Philippians 4:13 -------------------------------

Reply

Doug

11:22 p.m.

Reply

JD

Wednesday, 16 December Wed, 16 Dec

11:37 a.m.

On 12/15/2015 10:22 PM, Doug wrote:

On 12/15/2015 11:35 PM, Fred Smith wrote: > On Wed, Dec 16, 2015 at 01:36:40PM +1030, Tim wrote: >> Allegedly, on or about 15 December 2015, jd1008 sent: >>> Downloaded and tried tesseract and cuneiform, and both fail to >>> work on any of the pdf images I have. These images are NOT encrypted >>> as they are public documents like from the DMV, ... etc. >> But are they good quality images? OCR needs a reasonable resolution, >> *and* clean character definition. > When I was using tesseract a few years ago (as mentioned earlier > in this thread) I was getting PDFs made of scanned legal documents > (from Groklaw, documents from the SCO v IBM case). These were pretty > awful quality, as if they had been scanned at some terribly low > resolution from what may have been poor quality originals (or copies > thereof). They were very messy to look at, but tesseract could read > most of it fairly well. but converting to higher-resolution TIFF > files actually made the OCR work more poorly, odd as that may seem. > About 3 or so years ago, I tried tesseract and it was only about 80% on good quality print. I tried the Windows program ABBYY and it was virtually perfect. So if you have a long document, or a bunch of documents, spend the money and find a Windows machine if you don't have one. You'd be surprised how much work it is to go thru and change "1"s to "l"s or vice versa, Or "i"s. And that's just one example. And I can almost guarantee you'll miss a couple! (If the document is not too long, it might be worthwhile to have someone read it to you and type it in by hand!) Looking at Google output, I find a free on-line service, but there might be a problem sending a scanned file due to file size--check with your usp. The url is: www.online*ocr*.net As they say on the net, YMMV! --doug

Thanx Doug. Will try the windoze approach with ABBYY. Is ABBYY available for Linux. Guess it might work with wine.

Reply

3054

days inactive

3057

days old

users@lists.fedoraproject.org

Manage subscription

11 comments

7 participants

tags (0)

participants (7)

dcw
Doug
Fred Smith
Gordon Messmer
JD
Tim
Tom Horsley