From: Stefan Behnel on 29 Jul 2010 08:54
William Johnston, 29.07.2010 14:12:
> I have a Python app that parses XML files and then writes to text files.
XML or HTML?
> However, the output text file is "sometimes" encoded in some Asian language.
> Here is my code:
> encoding = "iso-8859-1"
> clean_sent = nltk.clean_html(sent.text)
> clean_sent = clean_sent.encode(encoding, "ignore");
> I also tried "UTF-8" encoding, but received the same results.
Maybe the NLTK cannot determine the encoding of the HTML file (because the
file is broken and/or doesn't correctly specify its own encoding) and thus
fails to decode it?