Decompressing LZW compression from PDF file [Ruby]

Prev: [ANN] Dfect 2.0.0 (2010-03-21)
Next: super

From: Ahmad Azizan on 22 Mar 2010 03:32

Hello,

I'm trying to find a ruby module/code that can decompress
LZW-compression-scheme from a PDF file. However, there is no such code
or module (as far as I've known) that exist publicly.

PDF usually compress its stream data by using FlateDecode,
ASCIIHexDecode, ASCII85Decode, and LZWDecode. In ruby, FlateDecode and
ASCII85Decode can be decompressed with existing ruby module which are
zlib and Ascii85. For ASCIIHexDecode, I just need to convert Hex
characters to char. My problem arise from the LZWDecode since there is
no module or code to decompress it.

Since there is no code example of implementing the LZW decompression in
ruby, I've found the implementation code from python. However,
translating python into ruby seems to be a pain-in-a-butt process.

Example of working LZW decompression in python is here:
http://pastebin.ca/1849009
My translated code in ruby is here: http://pastebin.ca/1849012

With a small input, I can decompress the it to get the equivalent output
like the python code.
e.g:
Python
data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
tmp = LZWDecode(data)
print tmp

data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
lzw = LZWDecoder.new(data)
puts lzw.run()

However, with a real stream from PDF file, I cannot get the decompressed
output. I guess it might be some error in the code or improper handling
of special character in ruby.
I've spent large amount of hours/days in digesting how to decompress LZW
stream and try to translate from python to ruby. It seems that my
current effort didnt give me a bright end. I really hope someone can
help me pointing some of the hint or solution towards this problem.

Thank you
--
Posted via http://www.ruby-forum.com/.

From: Ryan Davis on 22 Mar 2010 04:20

On Mar 22, 2010, at 00:32 , Ahmad Azizan wrote:

> With a small input, I can decompress the it to get the equivalent output
> like the python code.
> e.g:
> Python
> data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
> tmp = LZWDecode(data)
> print tmp
>
> data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
> lzw = LZWDecoder.new(data)
> puts lzw.run()
>
> However, with a real stream from PDF file, I cannot get the decompressed
> output. I guess it might be some error in the code or improper handling
> of special character in ruby.

Can you get the python code to decode the real stream? That'd be one way to determine if the original data is corrupt or not.

From: Brian Candler on 22 Mar 2010 05:31

> Example of working LZW decompression in python is here:
> http://pastebin.ca/1849009
> My translated code in ruby is here: http://pastebin.ca/1849012

Which version of ruby are you using? If it's 1.9 then your @fp[@inc] may
fall foul of the character encoding rules. Try this in your initialize:

puts @fp.encoding
@fp.force_encoding("ASCII-8BIT")

However if you pass in a StringIO rather than a String then you can just
copy what python is doing:

x = @fp.read(1)
@buff = x[0].unpack("C").first

and read(1) always reads single bytes. This has the advantage of being
able to decompress directly from files, without reading them into RAM
first.

Minor suggestion: it might be more rubyish to return nil rather than
raise EOFError, which would simplify your run loop to

result = ""
while code = readbits(@nbits)
result << feed(code)
end
return result

Regards,

Brian.
--
Posted via http://www.ruby-forum.com/.

|
Pages: 1
Prev: [ANN] Dfect 2.0.0 (2010-03-21)
Next: super