From: Eknath Venkataramani on
I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
When I use the xpdf package, the generated text is very weird, so I'd like
to write a program which would convert the pdf text into Unicode text as it
is.

The fonts used in the pdfs:
name type emb sub uni object
ID
------------------------------------ ----------------- --- --- --- ---------
APKAPP+Usha-Bold Type 1C yes yes yes 72 0
APKBBB+Agenda-Light Type 1C yes yes yes 77 0
APKBGF+Usha Type 1C yes yes yes 41 0
APKBKJ+Agenda-Medium Type 1C yes yes yes 46 0
APKBON+Agenda-Bold Type 1C yes yes yes 49 0

For eg. in the pdf: आदमी मुसाफिर है
when I use pdftotext, I get some very weird symbols: '...
........'
while i'd like 'आदमी मुसाफिर है' to be the output


--
Eknath Venkataramani
From: joy99 on
On May 24, 6:43 pm, Eknath Venkataramani <eknath.i...(a)gmail.com>
wrote:
> I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
> When I use the xpdf package, the generated text is very weird, so I'd like
> to write a program which would convert the pdf text into Unicode text as it
> is.
>
> The fonts used in the pdfs:
> name                                   type              emb sub uni object
> ID
> ------------------------------------ ----------------- --- --- --- ---------
> APKAPP+Usha-Bold                     Type 1C           yes yes yes     72  0
> APKBBB+Agenda-Light                  Type 1C           yes yes yes     77  0
> APKBGF+Usha                          Type 1C           yes yes yes     41  0
> APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46  0
> APKBON+Agenda-Bold                   Type 1C           yes yes yes     49  0
>
> For eg. in the pdf: आदमी मुसाफिर है
>               when I use pdftotext, I get some very weird symbols: '...
> .......'
>              while i'd like 'आदमी मुसाफिर है' to be the output
>
> --
> Eknath Venkataramani

Hi,
I do not think this can be done. Few months back I heard a person was
trying to convert from .pdf,.tiff,.jpg in text on Devnagari script. He
is most probably from Germany. You can surf the net and find out. I
feel it is an open problem in Indian NLP and if you can do it, please
let me know. You will get red carpet welcome in any top notch research
organizations in India working on Indian NLP/OCR.
Best Regards,
Subhabrata.