highlight words by regex in pdf files using python [Python]

Prev: ANNOUNCE: Exscript 2.0
Next: Fast GUI pipemeter: gprog

From: David Boddie on 17 Mar 2010 18:40

On Wednesday 17 March 2010 00:47, Aahz wrote:

> In article
> <af0830ae-1d24-4db9-b721-d6602fedd540(a)15g2000yqi.googlegroups.com>,
> Peng Yu <pengyu.ut(a)gmail.com> wrote:
>>
>>I don't find a general pdf library in python that can do any
>>operations on pdfs.
>>
>>I want to automatically highlight certain words (using regex) in a
>>pdf. Could somebody let me know if there is a tool to do so in python?
>
> Did you Google at all? "python pdf" finds this as the first link, though
> I have no clue whether it does what you want:
>
> http://pybrary.net/pyPdf/

The original poster might also be interested in displaying the highlighted
words without modifying the original file. In which case, the Poppler
library is worth investigating:

http://poppler.freedesktop.org/

David

From: TP on 18 Mar 2010 15:36

On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu <pengyu.ut(a)gmail.com> wrote:
> On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <pmaupin(a)gmail.com> wrote:
>> On Mar 4, 6:57 pm, Peng Yu <pengyu...(a)gmail.com> wrote:
>>> I don't find a general pdf library in python that can do any
>>> operations on pdfs.
>>>
>>> I want to automatically highlight certain words (using regex) in a
>>> pdf. Could somebody let me know if there is a tool to do so in python?
>>
>> The problem with PDFs is that they can be quite complicated. There is
>> the outer container structure, which isn't too bad (unless the
>> document author applied encryption or fancy multi-object compression),
>> but then inside the graphics elements, things could be stored as
>> regular ASCII, or as fancy indexes into font-specific tables. Not
>> rocket science, but the only industrial-strength solution for this is
>> probably reportlab's pagecatcher.
>>
>> I have a library which works (primarily with the outer container) for
>> reading and writing, called pdfrw. I also maintain a list of other
>> PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It
>> may be that pdfminer (link on that page) will do what you want -- it
>> is certainly trying to be complete as a PDF reader. But I've never
>> personally used pdfminer.
>>
>> One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
>> will read in preexisting PDFs and write them out to a reportlab
>> canvas. This works quite well on a few very simple ASCII PDFs, but
>> the font handling needs a lot of work and probably won't work at all
>> right now on unicode. (But if you wanted to improve it, I certainly
>> would accept patches or give you commit rights!)
>>
>> That pdfrw example does graphics reasonably well. I was actually
>> going down that path for getting better vector graphics into rst2pdf
>> (both uniconvertor and svglib were broken for my purposes), but then I
>> realized that the PDF spec allows you to include a page from another
>> PDF quite easily (the spec calls it a form xObject), so you don't
>> actually need to parse down into the graphics stream for that. So,
>> right now, the best way to do vector graphics with rst2pdf is either
>> to give it a preexisting PDF (which it passes off to pdfrw for
>> conversion into a form xObject), or to give it a .svg file and invoke
>> it with -e inkscape, and then it will use inkscape to convert the svg
>> to a pdf and then go through the same path.
>
> Thank you for your long reply! But I'm not sure if you get my question or not.
>
> Acrobat can highlight certain words in pdfs. I could add notes to the
> highlighted words as well. However, I find that I frequently end up
> with highlighting some words that can be expressed by a regular
> expression.
>
> To improve my productivity, I don't want do this manually in Acrobat
> but rather do it in an automatic way, if there is such a tool
> available. People in reportlab mailing list said this is not possible
> with reportlab. And I don't see PyPDF can do this. If you know there
> is an API to for this purpose, please let me know. Thank you!
>
> Regards,
> Peng
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Take a look at the Acrobat SDK
(http://www.adobe.com/devnet/acrobat/?view=downloads). In particular
see the Acrobat Interapplication Communication information at
http://www.adobe.com/devnet/acrobat/interapplication_communication.html.

"Spell-checking a document" shows how to spell check a PDF using
visual basic at
http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.17.html

"Working with annotations" shows how to add an annotation with visual
basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html.

Presumably combining the two examples with Python's win32com should
allow you to do what you want.

First | Prev |
Pages: 1 2
Prev: ANNOUNCE: Exscript 2.0
Next: Fast GUI pipemeter: gprog