From: Peter Duniho on
On Thu, 10 Sep 2009 09:37:52 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> [...]
>> Define "large," because you'd be better of if you can read the entire
>> thing into a byte array and then just work off that byte array. Using
>> chunks forces you to deal with the situation where the data you're
>> searching for is split across the end of one chunk and the beginning of
>> the next.
>
> Mostly more then 1 Gb.

Jeff is correct that it will be more of a hassle to deal with per-block
reads from the file when searching for a specific string of bytes. But
with a file that large, you probably will want to anyway. A general
purpose solution would involve some kind of list or circular buffer of
blocks you've read, but if you can ensure that the length of the search
string is always less than the size of a single read block, you could get
away with a couple of variables or a two-element array.

Other than dealing with the cross-boundary comparison logic, it should be
straightforward. Just read data and at each byte offset, compare your
search string to the bytes starting at that offset. Just repeat until you
either find the byte string you're looking for or you've reached a byte
offset that is closer to the end of the file than you have bytes in your
search string.

Pete
From: Tom Spink on
Hi Michelle,

Michelle wrote:

> I need some help to search through a (sometimes large) binary file.
>
> I'd like to search within this binary file for a pattern containing a
> particular hex value (e.g. FF56131A1B087B15610800151E).
> When the pattern is found, i need to know the (start) offset, because then I'd
> like to read the 4 previous bytes (need the hex values).
> I suppose that i need to read the file in blocks (500 kb or 1 mb) because
> it's a large file.
>
> I 'am a rookie using C#, so if possible please share a piece of code.
>
> Thanks in advance,
>
> Michelle
>

Do you know if the hex value you are searching for is aligned at all?

--
Tom

From: Tom Spink on
Peter Duniho wrote:

> On Thu, 10 Sep 2009 09:37:52 -0700, Michelle <michelle(a)notvalid.nomail>
> wrote:
>
>> [...]
>>> Define "large," because you'd be better of if you can read the entire
>>> thing into a byte array and then just work off that byte array. Using
>>> chunks forces you to deal with the situation where the data you're
>>> searching for is split across the end of one chunk and the beginning of
>>> the next.
>>
>> Mostly more then 1 Gb.
>
> Jeff is correct that it will be more of a hassle to deal with per-block
> reads from the file when searching for a specific string of bytes. But
> with a file that large, you probably will want to anyway. A general
> purpose solution would involve some kind of list or circular buffer of
> blocks you've read, but if you can ensure that the length of the search
> string is always less than the size of a single read block, you could get
> away with a couple of variables or a two-element array.
>
> Other than dealing with the cross-boundary comparison logic, it should be
> straightforward. Just read data and at each byte offset, compare your
> search string to the bytes starting at that offset. Just repeat until you
> either find the byte string you're looking for or you've reached a byte
> offset that is closer to the end of the file than you have bytes in your
> search string.
>
> Pete

I'd say it'd probably be easier to use a finite state machine,
and just read one byte at a time. The maximum size of the buffer would
be the length of the array to find.

--
Tom

From: Tom Spink on
Hi All,

Tom Spink wrote:
> I'd say it'd probably be easier to use a finite state machine,
> and just read one byte at a time. The maximum size of the buffer would
> be the length of the array to find.

I haven't tested this in the slightest (I've got to run off to work),
except compile-tested and ran with a very (very) contrived sample.

Let me know what you guys think - and what I've forgotten about, and
please rip it apart and correct it!

It takes a stream, and finds the offset of the first matching sequence
of bytes in 'arr'. Hopefully.

///
public static long FindArrayOffset(Stream s, byte[] arr)
{
int testByte, testIndex = 0;
long startingOffset = 0;

for (testByte = s.ReadByte(); testByte >= 0; testByte = s.ReadByte()) {
if (arr[testIndex++] != (byte)testByte) {
testIndex = 0;
startingOffset = s.Position;
}

if (testIndex == arr.Length)
return startingOffset;
}

return -1;
}
///

I haven't thought through all the code paths, but I've got a brain cell
nagging me about off-by-one errors, so pay particular attention to the
post-increment operator, as I may have cocked that up.

--
Tom

From: Michelle on
Hi Tom,

> Do you know if the hex value you are searching for is aligned at all?

I don't exactly understand what you mean with "aligned" (maybe it's because
English is not my native language)
I need to search in all the rubbish for the pattern
(FF56131A1B087B15610800151E)
and then read the previous 4 bytes. (For example:
0923080709C0224FFF56131A1B087B15610800151E -> 09C0224F)
The pattern is a kind of 'record footer'. But the 'records' doesn't have a
fixed length..

Is this the information you're needed?

Michelle