Search through a (large) binary file. [CSharp]

Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?

From: Michelle on 16 Sep 2009 01:26

Peter,

> [. . . ]
> Frankly, the more you explain about the basic problem, the less I feel
> Files have structures; I can guarantee you whatever this kind of file is,
> the intended user code doesn't need to search for things. It simply
> parses the data and knows the precise location of particular kinds of
> data within the file.
The original file has a structure with non fixed-length records.
There's less documentation (and no official one) available, so reverse
engineering is the only option.
We have the data obtained through data-recovery.
So, between the requested records, there's other data stored.
We can no longer rely on the original file structure.
Because we don't need the complete record information, it's enough to find
the mentioned patterns and reed some bytes before or after these patterns.
Of course there is risk of false positives. But the combination of the
pattern and the available bytes found, makes it acceptable.
Because we need to do this more than once, and using WinHex is a lot of
work, we decided to (try) write an application for this (only for internal
use).
Unfortunately I can't share all the details to the "why"s and "what"s, etc.
related to our problem.

Summarized it comes down:
Search for the pattern: 0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08
0x00 0x15 0x1E
Read the previous 4 bytes (convert them to a Decimal value)

Search for the pattern: 0x07 0x00 0xXX 0x00 0x00 0x00 0x07 0x00 0xYY 0x00
0x00 0x00 0x08 0x00 ( where 0xXX can have 4 and 0xYY 6 different values)
Read the next 8 bytes (convert them to a Decimal value)

> [. . .] You could search for the sub-components individually. Look for
> one, then look for the other in the specific place it should be if you
> find the first. Though, if Regex makes the code simpler, it might well
> be worth it anyway, even if it doesn't perform as well.
Can I do this with the example you created ?

Yesterday I spent my day reading documentation on various algorithms and C#
examples.
The problem is that all the examples I've found on the Internet, they
intended to search for strings and not bytes.
So, I've got a challenge :-))

I appreciate your help extremely !

Michelle

From: Michelle on 16 Sep 2009 01:33

Tom,

> This sounds like a highly structured file - *surely* there is some sort
> of descriptor at the start of it that contains a pointer to these
> records.
[ . . . ]

Please read my reply on Peter's contribution.

And also for you Tom.
I appreciate your help extremely !

Michelle

From: Peter Duniho on 16 Sep 2009 03:54

On Tue, 15 Sep 2009 22:26:06 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> [...]
> Summarized it comes down:
> Search for the pattern: 0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08
> 0x00 0x15 0x1E
> Read the previous 4 bytes (convert them to a Decimal value)
>
> Search for the pattern: 0x07 0x00 0xXX 0x00 0x00 0x00 0x07 0x00 0xYY 0x00
> 0x00 0x00 0x08 0x00 ( where 0xXX can have 4 and 0xYY 6 different values)
> Read the next 8 bytes (convert them to a Decimal value)

I'm concerned that you keep writing "Decimal", when nothing about any of
the description of the problem suggests you have or need Decimal values.
Decimal is a specific type in .NET, a base-10 floating point structure.
As I mentioned before, it takes 16 bytes.

You can, of course, store an Int32 or Int64 value in a Decimal variable
_after_ you've converted the raw bytes to Int32 or Int64 as appropriate.
But that's a step completely independent of the file i/o and searching,
and so any discussion of the Decimal type seems out of place here.

The fact that it keeps coming up makes me concerned that you may not
understand the distinction between Decimal and other numeric types, and/or
the implications regarding how the number is stored.

>> [. . .] You could search for the sub-components individually. Look
>> for
>> one, then look for the other in the specific place it should be if you
>> find the first. Though, if Regex makes the code simpler, it might well
>> be worth it anyway, even if it doesn't perform as well.
>
> Can I do this with the example you created ?

Can you do which? You quoted two options: searching for sub-components
individually, and using Regex.

You can do the former by modifying the code I posted. You'll simply have
to come up with a way of representing your search string in a way that can
be translated into calls to the FRangesEqual() method. For example, have
a List<T> of structs where the struct data type stores a reference to a
byte[] containing the byte string you want to find, along with an offset
within the search range for that byte string. Then pass that list to the
"find" method, where it starts the search with the first element in the
list, and then upon finding each element in the list, it executes another
search at the offset relative to the current position in the file for the
next element in the list. Repeat that until you run out of elements in
the list or find a mis-match; if you run out of elements, you've found a
match.

Using your two sample search strings, the first time you search, the
List<T> would have just one element, referencing a single byte string to
look for, "0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08 0x00 0x15
0x1E", and an offset of 0. The next time you search, the List<T> would
have three elements. The first would reference the byte string "0x07
0x00" and an offset of 0, the next the byte string "0x00 0x00 0x00 0x07
0x00" and an offset of 3, and the last the byte string "0x00 0x00 0x00
0x08 0x00" and an offset of 9.

You can't use the code I posted to do a Regex search, not directly
anyway. IMHO, if you're going to use Regex, you might as well go back to
porting the PowerShell script you found.

> Yesterday I spent my day reading documentation on various algorithms and
> C#
> examples.
> The problem is that all the examples I've found on the Internet, they
> intended to search for strings and not bytes.
> So, I've got a challenge :-))

Any algorithm you find that is specifically for strings, you should be
able to easily modify to handle bytes instead. The main issue you may run
into would be examples that take advantage of string functions in existing
libraries rather than implementing the algorithm themselves. You can
either provide versions of those functions that work with bytes, or just
stick to those algorithm examples that don't depend on library functions,
but instead do all their own processing (in which case, simply changing
any place a string is used to byte[] and any place a char is used to a
byte).

Pete

From: Michelle on 16 Sep 2009 04:40

Peter,

> I'm concerned that you keep writing "Decimal", when nothing about any of
> the description of the problem suggests you have or need Decimal values.
> Decimal is a specific type in .NET, a base-10 floating point structure.
> As I mentioned before, it takes 16 bytes.

Is using 'Decimal notation' better ?

> [. . . ]
> and so any discussion of the Decimal type seems out of place here.

Correct, it's not the main issue.

[. . . ]
> Can you do which? You quoted two options: searching for sub-components
> individually, and using Regex.

I meant search for the sub-components individually.

> You can do the former by modifying the code I posted. You'll simply have
> to come up with a way of representing your search string in a way that can
> be translated into calls to the FRangesEqual() method.
[ . . .]

I examine this and see if I can get it done

> You can't use the code I posted to do a Regex search, not directly
> anyway. IMHO, if you're going to use Regex, you might as well go back to
> porting the PowerShell script you found.

Okay, that's not an option.

[. . . ]
> You can either provide versions of those functions that work with bytes,
> or just stick to those algorithm examples that don't depend on library
> functions, but instead do all their own processing (in which case, simply
> changing any place a string is used to byte[] and any place a char is
> used to a byte).

See if I can get it done.

Michelle

From: Michelle on 16 Sep 2009 11:10

Peter,

[. . .]
> You can use the Position property to adjust from where you're reading in
> the file; save the current position, set the current position to 4 bytes
> earlier than the offset of the found string, read the 4 bytes of
> interest, then restore the current position to the previously saved
> value.
Int64 Offset1 = (ibBaseOffset + ibOffset); offset = 41152
Int64 Offset2 = stream.Position; offset = 49152

Why is Offset1 not equal to Offset2 ?
Offset1 is the right offset. Changing the block size has affect ( byte[]
rgbBlockCur = new byte[4096]; )

I tried several options with stream.Seek(Offset, SeekOrigin) to set the
current position and restore the previously saved position.
But when I read the previous 4 bytes and restore the position to the
previous saved,
the returned offset is not right anymore. The search continues, but it's not
the right offset anymore.

The first 'hit' has a right offset and the previous read bytes are right.
After restoring the position to the previous saved, then it goes wrong.

Michelle

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: 443413 M3i Zero , Ezflash Dsi , R4i Dsi 43531
Next: Can I get the Mime Content Type from a byte array?