Search through a (large) binary file. [CSharp]

Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?

From: Michelle on 13 Sep 2009 06:15

Peter,

> Try something like this (warning: uncompiled, untested, unoptimized, no
> error-checking, doesn't handle files > 2GB, and assumes that the search
> pattern is always no longer than the length of a single block that's
> read...proof of concept only):

With the help of Tom I've solved some errors. So it works :-))
I want to first understand and optimize this part, before I get busy with
reading the previous four bytes.

Why doesn't it handle files > 2GB ? (this could be an issue for me. The
files are sometimes even larger then 2 Gb)
Is it correct that it finds only the first occurence of the pattern. (if so,
how to find all occurences ?)
About the error-checking (and as part of that) compare the search pattern
length and the length of a single block, I think I can solve that.
But can you give me some hits about the optimalisation (and what about the
"Boyer-Moore Algorithm")

Thanks in advance,

Michelle

From: Tom Spink on 13 Sep 2009 07:03

Hello, Michelle.

Michelle wrote:
> Why doesn't it handle files > 2GB ? (this could be an issue for me. The
> files are sometimes even larger then 2 Gb)

It's the old, "There's not enough atoms in the Universe" problem - but
in this case, there's not enough bits in an integer.

An 'int' is 32 bits.

2^32 = 4294967296

This is 4 GiB, I hear you think. But, an int is signed, so in actual
fact, the largest value for an int is:

int.MaxValue = (2^31 - 1) = 2147483647 = 2GiB

(it's "minus one" because the maximum value is one less than the
number of possible values)

This is because of the way negative numbers are represented, half the
possible values represent a positive value (0 to 2147483647) and half
the possible values represent a negative value (-2,147,483,648 to -1).

> Michelle
--
Tom

From: Peter Duniho on 13 Sep 2009 16:01

On Sun, 13 Sep 2009 03:15:36 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> Peter,
>
>> Try something like this (warning: uncompiled, untested, unoptimized, no
>> error-checking, doesn't handle files > 2GB, and assumes that the search
>> pattern is always no longer than the length of a single block that's
>> read...proof of concept only):
>
> With the help of Tom I've solved some errors. So it works :-))
> I want to first understand and optimize this part, before I get busy with
> reading the previous four bytes.
>
> Why doesn't it handle files > 2GB ? (this could be an issue for me. The
> files are sometimes even larger then 2 Gb)

Tom's answered that. If you want larger files to be handled, you need to
use 64-bit integers instead of 32-bit integers anywhere the code has an
int used as an offset within the file.

> Is it correct that it finds only the first occurence of the pattern. (if
> so,
> how to find all occurences ?)

You are correct...it returns only the first occurrence. To find all
occurrences, instead of returning when a match is found, you'd simply do
some appropriate action there and then continue processing as if you
hadn't found a match. That appropriate action could be to call a delegate
(e.g. Action<int>, where the delegate takes the offset into the file and
does something), or perhaps just add the offset to a List<int> and then
return the List<int>.ToArray() result of that at the end.

> About the error-checking (and as part of that) compare the search pattern
> length and the length of a single block, I think I can solve that.
> But can you give me some hits about the optimalisation (and what about
> the
> "Boyer-Moore Algorithm")

Boyer-Moore is a special-case of the finite-state-machine solution I
alluded to earlier. For a search for a single string, the general-purpose
FSM solution has a small potential to be slightly faster than the current
brute-force approach, and the Boyer-Moore the potential to be slightly
faster than that. Any speed improvement when searching for a single
string would be highly dependent on the input data.

The reason those might be a little faster is that they both guarantee that
you only ever have to examine a given byte from the file once. The
brute-force solution I posted has the potential for examining a given byte
multiple times, depending on how many times it follows something that
looks like a potential match.

In practice, both the generalized FSM and the Boyer-Moore solution involve
a little bit of overhead themselves. At the same time, assuming random
data, on average the brute-force algorithm will only have a 2-byte false
start (the smallest false start possible) only 1 out of 256 iterations,
and only longer than that 1 out of 65536 iterations. I.e. most of the
time even the brute-force algorithm only examines each byte once, if you
only searching for one string.

Where Boyer-Moore and the generalized FSM shine is when you have multiple
strings to compare against, because they continue to allow you to only
ever examine each byte of the file once, whereas the brute-force solution
increases exponentially in cost as your list of search strings increases
in size.

If you have only a single string at a time you're ever looking for, I
wouldn't bother with the FSM or Boyer-Moore (or similar).

As far as optimizing the code I posted goes, I haven't really thought
about it much. I would say that if it performs acceptably well for you,
you might as well spend your time on other things. Don't bother trying to
make it faster until you know for sure that making it faster will make a
difference to your user, and then make it faster by measuring the
performance of the code with a profiler so you know what parts are slower
and should be improved.

Pete

From: rossum on 13 Sep 2009 18:42

On Sun, 13 Sep 2009 13:01:40 -0700, "Peter Duniho"
<no.peted.spam(a)no.nwlink.spam.com> wrote:

>Where Boyer-Moore and the generalized FSM shine is when you have multiple
>strings to compare against
In that case Rabin-Karp is also a possibility.

rossum

From: Michelle on 14 Sep 2009 03:38

Peter,

> Tom's answered that. If you want larger files to be handled, you need to
> use 64-bit integers instead of 32-bit integers anywhere the code has an
> int used as an offset within the file.

Solved that already.
If you read the explanation is actually very logical and simple.

> . . To find all occurrences, instead of returning when a match is found,
> you'd simply do some appropriate action there and then continue
> processing as if you hadn't found a match.

Solved that already.
Rest and take away what, works enlightening,

> Boyer-Moore is a special-case of the finite-state-machine solution I
> alluded to earlier. For a search for a single string, the general-purpose

Thanks for the clear explanation.

> .... I would say that if it performs acceptably well for you, you might
> as well spend your time on other things.

It performs much faster then the Powershell script. For now it is more than
enough.
I am now involved in reading the previous four bytes.

Michelle

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?