Dear All
To get a pdf file with the contents of a web-page, I produce a ps file and then I use ps2pdf. However, no text can be copied from the resulting pdf file. Is there other way of producing a pdf file with copyable text?
Thanks in advance,
Paul
On Saturday 06 January 2007 14:59, Paul Smith wrote:
Dear All
To get a pdf file with the contents of a web-page, I produce a ps file and then I use ps2pdf. However, no text can be copied from the resulting pdf file. Is there other way of producing a pdf file with copyable text?
Thanks in advance,
Paul
I use Adobe and xpdf to read pdf files and can copy text from both ok. What reader are you using?
Dave
I use Adobe and xpdf to read pdf files and can copy text from both ok. What reader are you using?
Dave
D> I would guess it's how the PDF files are created that's causing the problem.
P> Check out the 'Print to PDF' thread here: https://www.redhat.com/archives/fedora-list/2006-July/thread.html
HTH, Chris
On 1/8/07, Chris Mohler cr33dog@gmail.com wrote:
I use Adobe and xpdf to read pdf files and can copy text from both ok. What reader are you using?
D> I would guess it's how the PDF files are created that's causing the problem.
P> Check out the 'Print to PDF' thread here: https://www.redhat.com/archives/fedora-list/2006-July/thread.html
The problem is, as Chris points out, the created pdf files. With cups-pdf, the problem is not removed, unfortunately: the text of the pdf file is seen as an image and not as text.
Paul
On 1/9/07, Chris Mohler cr33dog@gmail.com wrote:
How about this one?
Thanks, Chris. Unfortunately, htmldoc does not handle web-pages with utf-8 charsets.
Paul
On 1/9/07, Chris Mohler cr33dog@gmail.com wrote:
Thanks, Chris. Unfortunately, htmldoc does not handle web-pages with utf-8 charsets.
Would you be willing to point me to an example html file that you're wanting to convert to PDF?
This one, for instance:
$ htmldoc --webpage -f output.pdf http://ithink.ch/blog/2004/02/20/unicode_ribbon_campaign.html
Paul
I didn't have much luck. I found this PHP class: http://sourceforge.net/projects/tcpdf/
It does seem to partially work, but I didn't see a front-end for it. I hacked up the test file provided and made an imperfect conversion of the link you sent me.
I also saw this: http://sourceforge.net/projects/acrophobia/
But that looks a little dubious. I may have a peek inside the RPM anyway....
Chris
On 1/10/07, Chris Mohler cr33dog@gmail.com wrote:
I didn't have much luck. I found this PHP class: http://sourceforge.net/projects/tcpdf/
It does seem to partially work, but I didn't see a front-end for it. I hacked up the test file provided and made an imperfect conversion of the link you sent me.
I also saw this: http://sourceforge.net/projects/acrophobia/
But that looks a little dubious. I may have a peek inside the RPM anyway....
Thanks, Chris. They look a bit complicated to use. It should exist a free tool similar to Acrobat Professional for Linux! :-)
Paul
On 1/11/07, Paul Smith phhs80@gmail.com wrote:
I didn't have much luck. I found this PHP class: http://sourceforge.net/projects/tcpdf/
It does seem to partially work, but I didn't see a front-end for it. I hacked up the test file provided and made an imperfect conversion of the link you sent me.
I also saw this: http://sourceforge.net/projects/acrophobia/
But that looks a little dubious. I may have a peek inside the RPM anyway....
Thanks, Chris. They look a bit complicated to use. It should exist a free tool similar to Acrobat Professional for Linux! :-)
When using Acrobat Professional, the pdf files that I obtain from, e.g., a web-page of a newspaper contain text that one can copy to a word processor. However, in Linux, with
1. print to a ps file; 2. use ps2pdf to convert to pdf from ps,
the pdf files does not contain copyable text, as the text is bitmapped. Can one obtain, in Linux, pdf files with copyable text? Any ideas?
I have tried htmldoc, as suggested, but it does not support utf-8.
Thanks in advance,
Paul
On Sat, 2007-08-11 at 12:01 +0100, Paul Smith wrote:
When using Acrobat Professional, the pdf files that I obtain from, e.g., a web-page of a newspaper contain text that one can copy to a word processor. However, in Linux, with
- print to a ps file;
- use ps2pdf to convert to pdf from ps,
the pdf files does not contain copyable text, as the text is bitmapped. Can one obtain, in Linux, pdf files with copyable text? Any ideas?
I suppose that depends on the creation method. For instance, you can use OpenOffice to "export" a document as PDF, and you can install a PDF printer driver for CUPs, and "print" a PDF. You get different results both ways, you might want to try that. But I found exporting a document resulted in a PDF with copyable test, printing to the CUPS-PDF driver resulted in uncopyable. I picked a non-standard font for my test, just to see if one would still be copyable if I didn't use the basic fonts commonly employed in PDFs.
I'd imagine how you created your PostScript file would have made a difference, too. Whether it's controlled text, or pre-rendered graphics, possibly the use of some fonts might be a potential problem.
Info from the cups-pdf RPM: -------------------------- "cups-pdf" is a backend script for use with CUPS - the "Common UNIX Printing System" (see more for CUPS under http://www.cups.org/). "cups-pdf" uses the ghostscript pdfwrite device to produce PDF Files.
This version has been modified to store the PDF files on the Desktop of the user. This behavior can be changed by editing the configuration file.
On 8/11/07, Tim ignored_mailbox@yahoo.com.au wrote:
When using Acrobat Professional, the pdf files that I obtain from, e.g., a web-page of a newspaper contain text that one can copy to a word processor. However, in Linux, with
- print to a ps file;
- use ps2pdf to convert to pdf from ps,
the pdf files does not contain copyable text, as the text is bitmapped. Can one obtain, in Linux, pdf files with copyable text? Any ideas?
I suppose that depends on the creation method. For instance, you can use OpenOffice to "export" a document as PDF, and you can install a PDF printer driver for CUPs, and "print" a PDF. You get different results both ways, you might want to try that. But I found exporting a document resulted in a PDF with copyable test, printing to the CUPS-PDF driver resulted in uncopyable. I picked a non-standard font for my test, just to see if one would still be copyable if I didn't use the basic fonts commonly employed in PDFs.
I'd imagine how you created your PostScript file would have made a difference, too. Whether it's controlled text, or pre-rendered graphics, possibly the use of some fonts might be a potential problem.
Info from the cups-pdf RPM:
"cups-pdf" is a backend script for use with CUPS - the "Common UNIX Printing System" (see more for CUPS under http://www.cups.org/). "cups-pdf" uses the ghostscript pdfwrite device to produce PDF Files.
This version has been modified to store the PDF files on the Desktop of the user. This behavior can be changed by editing the configuration file.
Thanks, Tim. Maybe, the problem is in the creation of the PS file, as CutePDF (a freeware PDF printer tool for MS Windows) uses ps2pdf to produce PDF documents, which with copyable text. See
http://www.cutepdf.com/Products/CutePDF/writer.asp
Paul
On 8/11/07, Paul Smith phhs80@gmail.com wrote:
When using Acrobat Professional, the pdf files that I obtain from, e.g., a web-page of a newspaper contain text that one can copy to a word processor. However, in Linux, with
- print to a ps file;
- use ps2pdf to convert to pdf from ps,
the pdf files does not contain copyable text, as the text is bitmapped. Can one obtain, in Linux, pdf files with copyable text? Any ideas?
I suppose that depends on the creation method. For instance, you can use OpenOffice to "export" a document as PDF, and you can install a PDF printer driver for CUPs, and "print" a PDF. You get different results both ways, you might want to try that. But I found exporting a document resulted in a PDF with copyable test, printing to the CUPS-PDF driver resulted in uncopyable. I picked a non-standard font for my test, just to see if one would still be copyable if I didn't use the basic fonts commonly employed in PDFs.
I'd imagine how you created your PostScript file would have made a difference, too. Whether it's controlled text, or pre-rendered graphics, possibly the use of some fonts might be a potential problem.
Info from the cups-pdf RPM:
"cups-pdf" is a backend script for use with CUPS - the "Common UNIX Printing System" (see more for CUPS under http://www.cups.org/). "cups-pdf" uses the ghostscript pdfwrite device to produce PDF Files.
This version has been modified to store the PDF files on the Desktop of the user. This behavior can be changed by editing the configuration file.
Thanks, Tim. Maybe, the problem is in the creation of the PS file, as CutePDF (a freeware PDF printer tool for MS Windows) uses ps2pdf to produce PDF documents, which with copyable text. See
I have filed an enhancement request:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=251801
Paul
On 8/11/07, Paul Smith phhs80@gmail.com wrote:
I have filed an enhancement request:
It might be obsolete :)
Try this: yum install cups-pdf service cups restart
Then you should be able to print any web page to PDF, by selecting the 'CUPS PDF' printer.
Chris
On 8/11/07, Chris Mohler cr33dog@gmail.com wrote:
I have filed an enhancement request:
It might be obsolete :)
Try this: yum install cups-pdf service cups restart
Then you should be able to print any web page to PDF, by selecting the 'CUPS PDF' printer.
Not obsolete, Chris!
In fact, cups-pdf does not produce pdf files with *copyable* text.
Paul
On 8/11/07, Paul Smith phhs80@gmail.com wrote:
In fact, cups-pdf does not produce pdf files with *copyable* text.
It is copyable here. I'm using cups-pdf (F7 updated) and Adobe Reader 7.0.
I first tried your example provided earlier in this thread - I was able to copy/paste the text from the resulting PDF.
Now that I've tried it a few more times, there are some formatting issues here and there when using cups-pdf.
Two other methods;
1. Install this extension for FF: https://addons.mozilla.org/en-US/firefox/addon/4738
Cons: you have to sign up for a third-party service Pros: free (as in no money), easy to use, formatting is pretty good.
2. "Save As" from FF (format: Web page, complete). Open the file in Openoffice, export PDF.
Cons: formatting issues - extra white space added Pros: truly free
Both additional methods work form me (as does cups-pdf) - producing PDF files with copyable text.
HTH, Chris
On 8/11/07, Chris Mohler cr33dog@gmail.com wrote:
In fact, cups-pdf does not produce pdf files with *copyable* text.
It is copyable here. I'm using cups-pdf (F7 updated) and Adobe Reader 7.0.
Thanks, Chris. I am using
$ rpm -q cups-pdf cups-pdf-2.4.6-1.fc7 $
and my cups-pdf only produces bitmapped pdfs. Could you please send me your cups-pdf.conf?
Paul
On 8/11/07, Paul Smith phhs80@gmail.com wrote:
Thanks, Chris. I am using
$ rpm -q cups-pdf cups-pdf-2.4.6-1.fc7 $
Same here: $ rpm -q cups-pdf cups-pdf-2.4.6-1.fc7
and my cups-pdf only produces bitmapped pdfs. Could you please send me your cups-pdf.conf?
Attached - I haven't modified it at all...
Chris
On Sat, 2007-08-11 at 22:50 +0100, Paul Smith wrote:
Attached - I haven't modified it at all...
OK - so maybe *not* attached...
Thanks, Chris. Mine is totally equal to yours. I do not understand why it produces pdfs with copyable text in your case, whereas in my case it only produces bitmapped pdfs.
Paul
AdobeReader does not aloow copying, Document Viewer and other applications do. -- ======================================================================= Reliable source, n.: The guy you just met. ======================================================================= Aaron Konstam telephone: (210) 656-0355 e-mail: akonstam@sbcglobal.net
On 8/12/07, Aaron Konstam akonstam@sbcglobal.net wrote:
Attached - I haven't modified it at all...
OK - so maybe *not* attached...
Thanks, Chris. Mine is totally equal to yours. I do not understand why it produces pdfs with copyable text in your case, whereas in my case it only produces bitmapped pdfs.
Paul
AdobeReader does not aloow copying, Document Viewer and other applications do.
I have tried with all available pdf viewers, but the text is not copyable at all. Something seems to be wrong with the method 'print to ps + ps2pdf'.
Paul
On 8/12/07, Paul Smith phhs80@gmail.com wrote:
I have tried with all available pdf viewers, but the text is not copyable at all. Something seems to be wrong with the method 'print to ps + ps2pdf'.
Let's try for a common target:
Target: http://fedoraproject.org/wiki/Infrastructure/fedorapeople.org
Methods: 1 - cups-pdf. Produces copyable text in Adobe Reader
2 - LOOP extension for FF. Also produces copyable text
3 - Openoffice - text *is* copyable, but the formatting is horrible!
4 - I did 'select all' in FF, made a new OO doc, and pasted it in. PDF export produces a file with copyable text, and decent formatting.
In summary, I was able to produce a PDF w/copyable text using all four methods. #3 looked like hell, but the text was there. Seems like #4 ought to Just Work. I'm curious - can anyone can duplicate my results? I went into further detail about each method previously.
Chris
PS - I've found that not all pages work with method #1 - on some pages (gmail, e.g.), the text appears to be copyable, but it is in fact copying gibberish to the clipboard. By gibberish, I mean those nice little unicode character boxes that you see when your font can't do UTF-8.
On 8/12/07, Chris Mohler cr33dog@gmail.com wrote:
I have tried with all available pdf viewers, but the text is not copyable at all. Something seems to be wrong with the method 'print to ps + ps2pdf'.
Let's try for a common target:
Target: http://fedoraproject.org/wiki/Infrastructure/fedorapeople.org
Methods: 1 - cups-pdf. Produces copyable text in Adobe Reader
2 - LOOP extension for FF. Also produces copyable text
3 - Openoffice - text *is* copyable, but the formatting is horrible!
4 - I did 'select all' in FF, made a new OO doc, and pasted it in. PDF export produces a file with copyable text, and decent formatting.
In summary, I was able to produce a PDF w/copyable text using all four methods. #3 looked like hell, but the text was there. Seems like #4 ought to Just Work. I'm curious - can anyone can duplicate my results? I went into further detail about each method previously.
Chris
PS - I've found that not all pages work with method #1 - on some pages (gmail, e.g.), the text appears to be copyable, but it is in fact copying gibberish to the clipboard. By gibberish, I mean those nice little unicode character boxes that you see when your font can't do UTF-8.
Here is the pdf file that I obtain with method 1 (cups-pdf).
Paul
On 8/12/07, Paul Smith phhs80@gmail.com wrote:
Here is the pdf file that I obtain with method 1 (cups-pdf).
That seems to confirm what Tony just said - I think tis is a font issue. I can copy/paste the text from your attached file, but it's all "unicode boxes" instead of legible text.
Chris
On Sun, 2007-08-12 at 13:44 -0500, Chris Mohler wrote:
On 8/12/07, Paul Smith phhs80@gmail.com wrote:
Here is the pdf file that I obtain with method 1 (cups-pdf).
That seems to confirm what Tony just said - I think tis is a font issue. I can copy/paste the text from your attached file, but it's all "unicode boxes" instead of legible text.
Chris
Well I can't in Adobe Reader. -- ======================================================================= Sometimes, too long is too long. - Joe Crowe ======================================================================= Aaron Konstam telephone: (210) 656-0355 e-mail: akonstam@sbcglobal.net
On Sun, 2007-08-12 at 19:09 +0100, Paul Smith wrote:
Here is the pdf file that I obtain with method 1 (cups-pdf).
The only thing I could copy, or even search for, in that file using Evince or XPDF was the text in the gadget: ssh your_fedora_username@fedorapeople.org
On Sun, 2007-08-12 at 13:01 -0500, Chris Mohler wrote:
On 8/12/07, Paul Smith phhs80@gmail.com wrote:
I have tried with all available pdf viewers, but the text is not copyable at all. Something seems to be wrong with the method 'print to ps + ps2pdf'.
Let's try for a common target:
Target: http://fedoraproject.org/wiki/Infrastructure/fedorapeople.org
Methods: 1 - cups-pdf. Produces copyable text in Adobe Reader
The method above does not work on my machine.
2 - LOOP extension for FF. Also produces copyable text
3 - Openoffice - text *is* copyable, but the formatting is horrible!
4 - I did 'select all' in FF, made a new OO doc, and pasted it in. PDF export produces a file with copyable text, and decent formatting.
In summary, I was able to produce a PDF w/copyable text using all four methods. #3 looked like hell, but the text was there. Seems like #4 ought to Just Work. I'm curious - can anyone can duplicate my results? I went into further detail about each method previously.
Chris
PS - I've found that not all pages work with method #1 - on some pages (gmail, e.g.), the text appears to be copyable, but it is in fact copying gibberish to the clipboard. By gibberish, I mean those nice little unicode character boxes that you see when your font can't do UTF-8.
-- ======================================================================= I THINK THERE SHOULD BE SOMETHING in science called the "reindeer effect." I don't know what it would be, but I think it'd be good to hear someone say, "Gentlemen, what we have here is a terrifying example of the reindeer effect." -- Jack Handley, The New Mexican, 1988. ======================================================================= Aaron Konstam telephone: (210) 656-0355 e-mail: akonstam@sbcglobal.net
On Sat, Aug 11, 2007 at 10:20:38PM +0100, Paul Smith wrote:
On 8/11/07, Chris Mohler cr33dog@gmail.com wrote:
In fact, cups-pdf does not produce pdf files with *copyable* text.
It is copyable here. I'm using cups-pdf (F7 updated) and Adobe Reader 7.0.
Thanks, Chris. I am using
$ rpm -q cups-pdf cups-pdf-2.4.6-1.fc7 $
and my cups-pdf only produces bitmapped pdfs. Could you please send me your cups-pdf.conf?
Paul
I've been able to copy text from PDF files using xpdf when the same file prevented copying in acroread. I consider that xpdf lets you copy to be a feature, not a bug.
On Sat, 2007-08-11 at 22:20 +0100, Paul Smith wrote:
and my cups-pdf only produces bitmapped pdfs.
In the past, I've had mixed results. I don't recall all the ins and outs, but the content created, and what applications I used, had an effect.
Did you try creating a document with the ordinary fonts used in PDF files (Times, etc.), or ones that it'll have to embed or pre-render?
And what programs have you tried to print a PDF file with?
On 8/12/07, Tim ignored_mailbox@yahoo.com.au wrote:
and my cups-pdf only produces bitmapped pdfs.
In the past, I've had mixed results. I don't recall all the ins and outs, but the content created, and what applications I used, had an effect.
Did you try creating a document with the ordinary fonts used in PDF files (Times, etc.), or ones that it'll have to embed or pre-render?
I just print to a file as ps and then I use ps2pdf. I have just tried this method with a OOo document written with Times New Roman, but again the resulting pdf is bitmapped. The problem maybe located in the way Linux produces the ps file and not in ps2pdf.
And what programs have you tried to print a PDF file with?
Firefox and Opera.
Paul
On Sun, 2007-08-12 at 00:53 +0100, Paul Smith wrote:
I just print to a file as ps and then I use ps2pdf. I have just tried this method with a OOo document written with Times New Roman, but again the resulting pdf is bitmapped. The problem maybe located in the way Linux produces the ps file and not in ps2pdf.
Hmm, I wonder if the complexity of the document plays a factor? Simple margins, etc., would be easy to implement in Postscript and PDF, other things might not be.
You can view Postscript files directly in applications like Evince, you could try some Postscript testing, before adding PDF to the equation.
At 12:53 AM +0100 8/12/07, Paul Smith wrote:
On 8/12/07, Tim ignored_mailbox@yahoo.com.au wrote:
and my cups-pdf only produces bitmapped pdfs.
In the past, I've had mixed results. I don't recall all the ins and outs, but the content created, and what applications I used, had an effect.
Did you try creating a document with the ordinary fonts used in PDF files (Times, etc.), or ones that it'll have to embed or pre-render?
I just print to a file as ps and then I use ps2pdf. I have just tried this method with a OOo document written with Times New Roman, but again the resulting pdf is bitmapped. The problem maybe located in the way Linux produces the ps file and not in ps2pdf.
And what programs have you tried to print a PDF file with?
Firefox and Opera.
From what has been said recently in the thread, it seems likely that the
fonts you use are the issue, in that the font either has to be so standard that it is assumed always available, or it must be embedded if allowed, or the pdf must be bitmapped where that font is used. Possibly the output of pdffonts will tell you what is going on. By default, ps2pdf won't embed the "14 standard fonts" (Courier, Helvetica, Times, Symbol, ZapfDingbats) -- see /user/share/doc/ghostscript*/Ps2pdf.htm.
On Sat, Aug 11, 2007 at 07:16:20PM +0100, Paul Smith wrote:
On 8/11/07, Chris Mohler cr33dog@gmail.com wrote:
I have filed an enhancement request:
It might be obsolete :)
Try this: yum install cups-pdf service cups restart
Then you should be able to print any web page to PDF, by selecting the 'CUPS PDF' printer.
Not obsolete, Chris!
In fact, cups-pdf does not produce pdf files with *copyable* text.
Paul
What I tend to do (at work, where I am pretty much forced to use MS Word) is print to a PS file (install a HP PS printer driver, configure it for print to file) then hand that file to ps2pdf. Works pretty well.
I could import the Word doc into OOo, but my experience has been that while it imports well, the line spacing is different such that the page breaks all change (in OOo, whether you print to PDF or not) changing the look of the document that you may have carefully tweaked in Word to get tables and pictures where you want them. (No I'm not complaining about OOo, I'd just as soon use it if I could, I'm merely making an observation.)
Fred
On 8/11/07, fredex fredex@fcshome.stoneham.ma.us wrote:
I have filed an enhancement request:
It might be obsolete :)
Try this: yum install cups-pdf service cups restart
Then you should be able to print any web page to PDF, by selecting the 'CUPS PDF' printer.
Not obsolete, Chris!
In fact, cups-pdf does not produce pdf files with *copyable* text.
Paul
What I tend to do (at work, where I am pretty much forced to use MS Word) is print to a PS file (install a HP PS printer driver, configure it for print to file) then hand that file to ps2pdf. Works pretty well.
Can one install in Linux a non-existent PS printer? Perhaps, it will do the trick.
Paul
On Sat, 2007-08-11 at 23:57 +0100, Paul Smith wrote:
Can one install in Linux a non-existent PS printer? Perhaps, it will do the trick.
You don't need to, just do a print to file. The output is PostScript.
On Sat, Aug 11, 2007 at 11:57:09PM +0100, Paul Smith wrote:
On 8/11/07, fredex fredex@fcshome.stoneham.ma.us wrote:
I have filed an enhancement request:
It might be obsolete :)
Try this: yum install cups-pdf service cups restart
Then you should be able to print any web page to PDF, by selecting the 'CUPS PDF' printer.
Not obsolete, Chris!
In fact, cups-pdf does not produce pdf files with *copyable* text.
Paul
What I tend to do (at work, where I am pretty much forced to use MS Word) is print to a PS file (install a HP PS printer driver, configure it for print to file) then hand that file to ps2pdf. Works pretty well.
Can one install in Linux a non-existent PS printer? Perhaps, it will do the trick.
Most Linux apps emit PostScript as their "natural" output language, then cups hands it to Ghostscript to turn into something the printer won't choke on. So, if your app will let you print-to-file, which FireFox will, you're all set.