From: Michelle on
Tested the Powershell script.

It's very slow. . . but it works !!
Is it possible to convert the script to C#. I suppose it's much faster in
C#.
Maybe it needs some performance tuning, but it illustrates what I need.

Regards,

Michelle


> -<script.ps1>-
>
> param (
> [string]$pattern = $(throw '-pattern can''t be $null'),
> [string]$file = $(throw '-file can''t be $null'),
> [long]$blockSize = 50kb,
> [int]$prevBytes = 8
> )
> if ($blockSize -lt 1kb) {throw '-blockSize must be at least 1kb'}
> $file = $(if (test-path $file) {(rvpa $file).path} else {
> throw "Cannot find $path"})
> $stream = [io.file]::OpenRead($file)
> if ($stream.length -lt $blockSize) {$blockSize = $stream.length}
> $lastBlock = $stream.length % $blockSize
> $buffer = new-object byte[] $blockSize
> $extraBytes = $pattern.length / 2 + $prevBytes - 1
> $blocks = [math]::ceiling($stream.length / $blockSize)
> $block = 1
> while ($block -le $blocks) {
> if ($block -eq $blocks -and $lastBlock) {$blockSize = $lastBlock
> $buffer = new-object byte[] $blockSize}
>
> [void]$stream.read($buffer, 0, $blockSize)
> $hexStr = [string]::join('', ($rimBytes + $buffer | % {'{0:X2}' -f $_}))
> $rimBytes = $buffer[-$extraBytes..-1]
> [regex]::matches($hexStr, $pattern, 'ignoreCase') |
> % {$i = $_.index - $prevBytes * 2
> [string]::join('', $hexStr[$i..($i + $prevBytes * 2 - 1)]) |
> % {$target = $_
> $hexBytes = 0..($_.length - 1) | ? {!($_ -band 1)} |
> % {"0x$($target.subString($_,2))"}
> }
> }
> $block++
> }
> [void]$stream.close
> [void]$stream.dispose
>
> -<script.ps1>-
>


From: rossum on
On Thu, 10 Sep 2009 16:43:16 +0200, "Michelle"
<michelle(a)notvalid.nomail> wrote:

>I need some help to search through a (sometimes large) binary file.
>
>I'd like to search within this binary file for a pattern containing a
>particular hex value (e.g. FF56131A1B087B15610800151E).
>When the pattern is found, i need to know the (start) offset, because then I'd
>like to read the 4 previous bytes (need the hex values).
>I suppose that i need to read the file in blocks (500 kb or 1 mb) because
>it's a large file.
>
>I 'am a rookie using C#, so if possible please share a piece of code.
>
>Thanks in advance,
>
>Michelle
>
This is essentially a string searching problem. Google for the
"Rabin-Karp Algorithm", the "Knuth-Morris-Pratt Algorithm" and the
"Boyer-Moore Algorithm". If there is more than one possible pattern
to search for then Rabin-Karp is the best as it can search for
multiple patterns in a single pass. If there is only a single pattern
to match then use one of the other two as they will be faster at
finding a single target.

Google will find example C# code for all three algorithms.

rossum

From: Peter Duniho on
On Fri, 11 Sep 2009 10:33:25 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> Tested the Powershell script.
>
> It's very slow. . . but it works !!
> Is it possible to convert the script to C#. I suppose it's much faster
> in C#.
> Maybe it needs some performance tuning, but it illustrates what I need.

Actually, the main reason it's slow is the fundamental implementation, not
the language And if you port that logic to C#, you'll have the same
problems. It's converting bytes to characters for the purpose of the
searching, as well as doing string concatenation to deal with the
cross-block issues.

It may be that you gain some speed by moving to a compiled .NET
implementation, but you'll still have the other overhead.

That said, sure...it's possible to port the script to C#. It's just code.

Pete
From: Peter Duniho on
On Fri, 11 Sep 2009 05:21:37 -0700, Michelle <michelle(a)notvalid.nomail>
wrote:

> This is what i've done so far: [...]

Unfortunately, it's not even close. You haven't even successfully ported
the block-read logic, and that's not the interesting part of sample. It
"handles" cross-block boundaries simply by maintaining enough "previous
data" to account for the length of the search pattern, and then prepending
that to the recently read data before each match attempt. It's also
formatting the bytes as text before doing the match, which is very
inefficient.

Try something like this (warning: uncompiled, untested, unoptimized, no
error-checking, doesn't handle files > 2GB, and assumes that the search
pattern is always no longer than the length of a single block that's
read...proof of concept only):

int FindByteString(string strInputFile, byte[] rgbPattern)
{
byte[] rgbBlockCur = new byte[4096],
rgbBlockNext = new byte[rgbBlockCur.Length];
FileStream stream = File.OpenRead(strInputFile);
int cbBlockCur, cbBlockNext, ibOffset = 0, ibBaseOffset = 0;

cbBlockCur = stream.Read(rgbBlockCur, 0, rgbBlockCur.Length);
cbBlockNext = stream.Read(rgbBlockNext, 0, rgbBlockNext.Length);

while (true)
{
if (ibOffset + rgbPattern.Length <= cbBlockCur)
{
if (FRangesEqual(rgbBlockCur, ibOffset, rgbPattern.Length,
rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else if (ibOffset + rgbPattern.Length <= cbBlockCur +
cbBlockNext)
{
if (FRangesEqual(rgbBlockCur, ibOffset, cbBlockCur -
ibOffset, rgbPattern) &&
FRangesEqual(rgbBlockNext, 0, ibOffset +
rgbPattern.Length - cbBlockCur, rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else
{
return -1;
}

if (++ibOffset < cbBlockCur.Length)
{
continue;
}

byte[] rgbT = rgbBlockCur;
rgbBlockCur = rgbBlockNext;
rgbBlockNext = rgbT;

ibOffset = 0;
ibBaseOffset += cbBlockCur;

cbBlockCur = cbBlockNext;
cbBlockNext = stream.Read(rgbBlockNext, 0,
rgbBlockNext.Length);
}
}

bool FRangesEqual(byte[] rgb, int ibStart, int ibLength, byte[]
rgbPattern)
{
for (int ibPattern = 0; ibPattern < ibLength; ibPattern++)
{
if (rgb[ibStart + ibPattern] != rgbPattern[ibPattern])
{
return false;
}
}

return true;
}

There are a number of refinements one might make to the above. But even
that basic example ought to perform much better than a C# port of the
PowerShell script you posted. I apologize in advance for any typos or
bugs; hopefully it's enough to get you pointed in the right direction, in
spite of any flaws that might be present.

Pete
From: Tom Spink on
Peter Duniho wrote:

> On Fri, 11 Sep 2009 02:59:42 -0700, Michelle <michelle(a)notvalid.nomail>
> wrote:
>
>> Hi Tom,
>>
>>> Do you know if the hex value you are searching for is aligned at all?
>>
>> I don't exactly understand what you mean with "aligned" (maybe it's
>> because
>> English is not my native language)
>
> I believe that Tom's asking if the pattern you're looking for will always
> begin at a byte the offset of which is exactly some multiple of some
> number larger than 1.
>
> For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
> 8, etc. but never at 1, 2, 3, 5, 6, 7, etc.
>
> If it's aligned, then that obviously can reduce somewhat the overhead of
> non-matching characters, because you have to examine on average 1/cb the
> number of bytes, where "cb" is the byte length of the alignment (e.g. for
> 32-bit alignment, "cb" is "4").
>
> Pete

Thanks Pete,

Those were my thoughts - which would also tie in nicely with your
chunking algorithm - and help to eliminate the block-overlap problem, if
the block size used was some multiple of the alignment value.

--
Tom