From: Alexei A. Frounze on
On Feb 18, 3:00 am, "wolfgang kern" <nowh...(a)never.at> wrote:
> peter asked:> Hi
>
> Hello,
>
> >    I am adding a "page up" and "page down" button to the instruction
> > panel (http://peter-bochs.googlecode.com/files/screendump20100203.png)
> > For the page up button, I don't know how to calculate the address to
> > start to disassemble. For example, if I am disassembling 0x1000
> > address, how can I know what address I should disassemble after
> > pressing the "page up" button, it could be 0xff0, 0xff1, 0xff2.
> > Currently I use this method, but it has bug, arround 50% will
> > disassemble the correct result:  save the first 10 instructions into
> > an array, keep trying to disassemble the previous address (decrease
> > the address one by one to try), if those 10 instructions appears
> > again, that mean I have disassemble the correct address.
> > thanks
> > from Peter (cmk...(a)hotmail.com)
>
> I once tried this too and also used backwards byte stepping,
> but it will only be correct if it starts with the max. possible
> instruction-length (14/15 bytes depending on mode) and it heavy
> fails if code is mixed with data. It only remembered the last
> known start address of the first visible line for matching.
>
> So finally I just remember the address of the previous page start
> (even page size may vary with selected display layout) and use
> the cursor keys for moving back one byte at a time.
> This way allow to see otherwise hidden entry-points in addition.
>
> __
> wolfgang

However, if we make certain assumptions and arrangements, we can make
it work most of the time. For example, if the disassembler knows that
the memory content has come from an executable file that has separate
sections for code, constants and variables/stack/heap and it can get
all this info, then the code section or maybe even individual
subroutines from it may be pre-disassembled (or just pre-parsed) from
start to end and that would give the disassembler knowledge of where
every instruction in the code section begins (for other sections you
won't use this method). Of course, this still won't work for programs
with data mixed in the code section and this won't help much with code
that jumps into the middle of its instructions, but for most typical
applications this will work just fine.

Alex
From: Jake Waskett on
On Thu, 18 Feb 2010 11:06:19 -0800, Alexei A. Frounze wrote:

> On Feb 18, 3:00 am, "wolfgang kern" <nowh...(a)never.at> wrote:
>> peter asked:> Hi
>>
>> Hello,
>>
>> >    I am adding a "page up" and "page down" button to the
>> >    instruction
>> > panel
>> > (http://peter-bochs.googlecode.com/files/screendump20100203.png) For
>> > the page up button, I don't know how to calculate the address to
>> > start to disassemble. For example, if I am disassembling 0x1000
>> > address, how can I know what address I should disassemble after
>> > pressing the "page up" button, it could be 0xff0, 0xff1, 0xff2.
>> > Currently I use this method, but it has bug, arround 50% will
>> > disassemble the correct result:  save the first 10 instructions into
>> > an array, keep trying to disassemble the previous address (decrease
>> > the address one by one to try), if those 10 instructions appears
>> > again, that mean I have disassemble the correct address. thanks
>> > from Peter (cmk...(a)hotmail.com)
>>
>> I once tried this too and also used backwards byte stepping, but it
>> will only be correct if it starts with the max. possible
>> instruction-length (14/15 bytes depending on mode) and it heavy fails
>> if code is mixed with data. It only remembered the last known start
>> address of the first visible line for matching.
>>
>> So finally I just remember the address of the previous page start (even
>> page size may vary with selected display layout) and use the cursor
>> keys for moving back one byte at a time. This way allow to see
>> otherwise hidden entry-points in addition.
>>
>> __
>> wolfgang
>
> However, if we make certain assumptions and arrangements, we can make it
> work most of the time. For example, if the disassembler knows that the
> memory content has come from an executable file that has separate
> sections for code, constants and variables/stack/heap and it can get all
> this info, then the code section or maybe even individual subroutines
> from it may be pre-disassembled (or just pre-parsed) from start to end
> and that would give the disassembler knowledge of where every
> instruction in the code section begins (for other sections you won't use
> this method). Of course, this still won't work for programs with data
> mixed in the code section and this won't help much with code that jumps
> into the middle of its instructions, but for most typical applications
> this will work just fine.
>
> Alex

I wonder whether a statistical approach might work... Some instructions
are much more likely to occur than others - for example jumps (conditional
and otherwise) are likely every few instructions. So you could try a set
of start addresses, compute some measure of the likelihood of the sequence
of disassembled instructions occurring in normal code, and pick the most
likely. It wouldn't be foolproof, but it would probably be correct most
of the time.
From: robertwessel2 on
On Feb 17, 10:58 pm, peter <cmk...(a)gmail.com> wrote:
> Hi
>     I am adding a "page up" and "page down" button to the instruction
> panel (http://peter-bochs.googlecode.com/files/screendump20100203.png)
>
> For the page up button, I don't know how to calculate the address to
> start to disassemble. For example, if I am disassembling 0x1000
> address, how can I know what address I should disassemble after
> pressing the "page up" button, it could be 0xff0, 0xff1, 0xff2.
>
> Currently I use this method, but it has bug, arround 50% will
> disassemble the correct result:  save the first 10 instructions into
> an array, keep trying to disassemble the previous address (decrease
> the address one by one to try), if those 10 instructions appears
> again, that mean I have disassemble the correct address.


To add to the other comments, while it's impossible in general, you
can do a pretty good job unless you run into data (or you have code
jumping into the middle of instructions), by backing up several dozen
bytes (the further you back up, the higher the odds of getting the
alignment correct - again, except in the case of data or jumps into
the middle of instructions) before your candidate location, and then
disassembling from there. Then try disassembling from there and the
subsequent 14 bytes (since instructions cannot be more than 15 bytes
long) up to where you know you had a good instruction (presumably the
top of the page before the "PgUp"). Only those positions that align
correctly with the known instructions, and include only good
instructions, are candidates. You might still get more than one
possible sequence, but the further back you go, the less likely that
will be (again subject to the above limits).

If you get no hits, you can try a shorter backwards step to start your
scanning.

You can also do this backwards, one instruction at a time. Try to
determine how many instructions (or bytes) back you can construct a
plausible sequence (basically constructing a tree of possible
instructions sequences), and then assuming that the version of the
last instruction (that you're trying to back over) that leads to the
longest plausible sequence of preceding instructions is the most
likely one. IOW, at each position try positions (n-1)..(n-15) to see
if in decodes as a 1..15 byte instruction.

The going forward approach has the advantage of simplicity in terms of
data structures, but requires more care to deal with the inevitable
situations where you can't find a good sequence from a given starting
point (basically you have to start over and try a shorter backstep,
possibly binary searching to the longest one you can find). The going
backwards approach has the advantage of dealing well with the point
where you backup into non-instruction data, but requires a more
complex search.

You might apply some additional heuristics. For example, most x86-32
and -64 code does not contain data intermixed with code (although it's
certainly possible), and because of the way most OSs, linkers and
program loaders work, you can usually assume that code sections start
on page (4KB) boundaries. And there are certainly some more common
sequences that you might search for (a RET followed by enough NOPs to
get to a 16 byte boundary, or the sequence push ebp, mov ebp,esp, sub
esp,##, for example).

And the suggestion to give the user a chance to fiddle the alignment
manually is a good one.
From: James Harris on
On 18 Feb, 06:43, "Alexei A. Frounze" <alexfrun...(a)gmail.com> wrote:
> On Feb 17, 8:58 pm, peter <cmk...(a)gmail.com> wrote:
>
>
>
> > Hi
> >     I am adding a "page up" and "page down" button to the instruction
> > panel (http://peter-bochs.googlecode.com/files/screendump20100203.png)
>
> > For the page up button, I don't know how to calculate the address to
> > start to disassemble. For example, if I am disassembling 0x1000
> > address, how can I know what address I should disassemble after
> > pressing the "page up" button, it could be 0xff0, 0xff1, 0xff2.
>
> > Currently I use this method, but it has bug, arround 50% will
> > disassemble the correct result:  save the first 10 instructions into
> > an array, keep trying to disassemble the previous address (decrease
> > the address one by one to try), if those 10 instructions appears
> > again, that mean I have disassemble the correct address.

Given the information available ISTM important that the user be in no
doubt that it's the *user's* responsibility to align the disassembly.
Consider that the instruction on which processing is stopped may have
just been jumped to from somewhere else and be preceded by nothing
useful.

> You can't implement this correctly in general when your instructions
> have variable length and even may overlap with data. You may only try
> or let the user do this by allowing him to adjust the address on the
> first line by +/-1 in an easy manner.

Bumping the start byte by one is ideal if the disassembly is quick
enough. When the resulting instruction stream fails to perfectly meet
the known instruction some clear indication would be helpful.

Of course, any jump may really have been to the middle of an
instruction....

James