Prev: Can upper() or lower() ever change the length of a string?
Next: Most reliable/pythonic way to tell if an instance comes froma class implemented in C/etc?
From: Eknath Venkataramani on 24 May 2010 09:43 I have around 45 pdfs to convert into raw text containing text in _HINDI_ . When I use the xpdf package, the generated text is very weird, so I'd like to write a program which would convert the pdf text into Unicode text as it is. The fonts used in the pdfs: name type emb sub uni object ID ------------------------------------ ----------------- --- --- --- --------- APKAPP+Usha-Bold Type 1C yes yes yes 72 0 APKBBB+Agenda-Light Type 1C yes yes yes 77 0 APKBGF+Usha Type 1C yes yes yes 41 0 APKBKJ+Agenda-Medium Type 1C yes yes yes 46 0 APKBON+Agenda-Bold Type 1C yes yes yes 49 0 For eg. in the pdf: à¤à¤¦à¤®à¥ मà¥à¤¸à¤¾à¤«à¤¿à¤° हॠwhen I use pdftotext, I get some very weird symbols: '... ........' while i'd like 'à¤à¤¦à¤®à¥ मà¥à¤¸à¤¾à¤«à¤¿à¤° हà¥' to be the output -- Eknath Venkataramani
From: joy99 on 24 May 2010 18:16
On May 24, 6:43 pm, Eknath Venkataramani <eknath.i...(a)gmail.com> wrote: > I have around 45 pdfs to convert into raw text containing text in _HINDI_ . > When I use the xpdf package, the generated text is very weird, so I'd like > to write a program which would convert the pdf text into Unicode text as it > is. > > The fonts used in the pdfs: > name                  type        emb sub uni object > ID > ------------------------------------ ----------------- --- --- --- --------- > APKAPP+Usha-Bold           Type 1C      yes yes yes   72  0 > APKBBB+Agenda-Light          Type 1C      yes yes yes   77  0 > APKBGF+Usha              Type 1C      yes yes yes   41  0 > APKBKJ+Agenda-Medium         Type 1C      yes yes yes   46  0 > APKBON+Agenda-Bold          Type 1C      yes yes yes   49  0 > > For eg. in the pdf: à¤à¤¦à¤®à¥ मà¥à¤¸à¤¾à¤«à¤¿à¤° हॠ>        when I use pdftotext, I get some very weird symbols: '... > .......' >        while i'd like 'à¤à¤¦à¤®à¥ मà¥à¤¸à¤¾à¤«à¤¿à¤° हà¥' to be the output > > -- > Eknath Venkataramani Hi, I do not think this can be done. Few months back I heard a person was trying to convert from .pdf,.tiff,.jpg in text on Devnagari script. He is most probably from Germany. You can surf the net and find out. I feel it is an open problem in Indian NLP and if you can do it, please let me know. You will get red carpet welcome in any top notch research organizations in India working on Indian NLP/OCR. Best Regards, Subhabrata. |