From: Terje Mathisen on
Skybuck Flying wrote:
> "MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
> news:cb20dc12-415b-4a0c-8bd9-4985a7c04946(a)w74g2000hsh.googlegroups.com...
>> Lets do this in an instruction set that it makes it easy, and use C;
>>
>> void WriteLongwordBits( unsigned long Value, unsignedint BitCount,
>> unsigned long int *DestAddress, unsigned int DestBitIndex )
>> {
>> ASM{ // get stuff from the parameter passing area
>> mov a1,12[sp]
>> mov d2,16[sp]
>> mov a3,20[sp]
>> mov d4,24[sp]
>>
>> BFEXTU d5,d2,[a1] // get the source field
>> BFINS d5,d4,[a3] // put it where it belongs
>> }
[snip]
> Maybe somebody can re-write all of this assembler to be more efficient ? ;)

Mitch already did, see above.
>
> Or simply come up with an alternative assembler implementation.

Your example code is simply horrible, besides the fact that it can trap
on a 64-bit platform since you don't seem to care about doing aligned
64-bit accesses.

You also imply a 32-bit little-endian architecture since you use a array
of 32-bit values as your target address, which together with a random
bit offset means that writes will probably straddle between two 32-bit
words.

However, none of this really matters at all, because the function can
_never_ be time critical: If it is, you simply have no idea at all how
to write fast code.

So, with that out of the way, lets look at your function:

void WriteLongwordBits( unsigned long Value, unsigned int BitCount,
unsigned long int *DestAddress, unsigned int DestBitIndex )
{
/* Assumptions for this function:
* 1. unsigned long is 32-bit
* 2. little endian: bits are numbered from the first/least
significant (0) up to the msb (31)
* 3. Bit 32 == bit 0 in the next 32-bit word
*/
/* Normalize bit index */
DestAddress += DestBitIndex >> 5;
DestBitIndex &= 31;

/* Where is the last bit going to end up? */
unsigned int lastbit = BitCount + DestBitIndex - 1;

/* This mask will be zero starting at the first inserted bit */
unsigned int bottom_mask = ((unsigned int) -1) >> DestBitIndex;
/* Zero mask for bits after the last bit inserted */
/* The masked shift count is implicit on x86! */
unsigned int top_mask = ((unsigned int) -1) >> (lastbit & 31);

/* One or two affected words? */
if (lastbit >= 32) {
DestAddress[0] = (DestAddress[0] & bottom_mask) |
(Value << DestBitIndex);
DestAddress[1] = (DestAddress[1] & ~top_mask) |
(Value >> (32-DestBitIndex));
}
else { /* It all fits in a single word! */
DestAddress[0] = (DestAddress[0] & (bottom_mask | ~top_mask)) |
(Value << DestBitIndex);
}
}

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
From: Skybuck Flying on
Nice try, except your routine is flawed.

Here is the delphi translation with comments for your flaws:

procedure WriteLongwordBitsTerje( Value : longword; BitCount : longword;
DestAddress : pointer; DestBitIndex : longword );
var
vLastBit : longword;
vBottomMask : longword;
vTopMask : longword;
begin
// Normalize bit index
longword(DestAddress) := longword(DestAddress) + DestBitIndex shr 5; // div
32
DestBitIndex := DestBitIndex and 31; // mod 32, resulting range 0 to 31

// Where is the last bit going to end up ?
vLastBit := BitCount + DestBitIndex - 1; // resulting range -1 to 62

// This mask will be zero starting at the first inserted bit
vBottomMask := longword(-1) shr DestBitIndex;

// Zero mask for bits after the last bit inserted
// The masked shift count is implicit on x86!
vTopMask := ( longword(-1) ) shr (vLastBit and 31); // vLastBit resulting
range 0 to 31

// One or two affected words?
if (vLastBit >= 32) then // could malfunction for -1 ;)
begin
// DestAddress[0] := (DestAddress[0] and vBottomMask) or (Value shl
DestBitIndex);
longword(DestAddress^) := (longword(DestAddress^) and vBottomMask) or
(Value shl DestBitIndex);

// DestAddress[1] := (DestAddress[1] and not vTopMask) or (Value shr
(32-DestBitIndex));
longword( pointer(longword(DestAddress)+4)^ ) :=
( longword( pointer(longword(DestAddress)+4)^ ) and not vTopMask) or
(Value shr (32-DestBitIndex)); // shr 32 will malfunction
end else
begin
// It all fits in a single word!
// DestAddress[0] := (DestAddress[0] and (vBottomMask or not vTopMmask)) or
(Value shl DestBitIndex);
longword(DestAddress^) := (longword(DestAddress^) and (vBottomMask or not
vTopMask)) or (Value shl DestBitIndex);
end;
end;

Your routine fails for my current test value:

Value := 4294967295; // all bits set
BitCount := 31; // write only 31 bits
BitPointer := 0; // at index zero
WriteLongWordBitsTerje( Value, BitCount, @Buffer, BitPointer );

Performance wise your routine uses a branch and a branch could potentially
stall the pipeline.

So you see writing a flawless fast 32 bit version is not as easy as you
thought it would be.

Now it's up to you to correct and possibly speed up your routine, otherwise
I'll have to remove it from the contest.

Since flawed routines are not allowed ofcourse.

I'll shall help you understand your flaw and the problem in general:

Value (shr 32 - DestBitIndex) won't work if DestBitIndex is zero.

The shr instruction is limited to a range of 0 to 31.

You are trying to shift with 32 which means the shr won't happen.

Good luck with fixing it.

Thanks for trying anyway your code does have some interesting concepts which
might or might not come in handy in writing a correct version ;) :)

Bye,
Skybuck.


From: Skybuck Flying on
Your new version is better, but still has a serious shortcoming:

the DestBitIndex should be able to have any range.

Your method assumes parameter DestBitIndex has range 0 to 31, this is too
restricted.

See specification in first posting.

Furthermore your method is less desireable, currently it falls in
classification A2B1

The best classification is A2B2.

Your method's lower classification explained:

Your method writes garbage bits from the value as well...

Suppose the value is:

LSB ------------------------------> MSB
| |
1111 1111 1111 0000 0000 0000 0000 0000

Suppose the bit count is: 3

Your method copies all set bits from value to output.

Buffer before call:

0000 0000 0000 0000 0000 0000 0000 0000

Buffer after call:

1111 1111 1111 0000 0000 0000 0000 0000

So you see, your method ignores the bit count in this case.

It would have been better/more desireable if your method preserves
the remaining bits in the buffer, and treats accessive bits in the value as
garbage which should not be written to the buffer.

So the best/most desired result would be:

1110 0000 0000 0000 0000 0000 0000 0000

Your method will be classified as: A2B1

While my methods are classified as: A2B2

See first post to understand classifications :)

Here is your version translated to Delphi.

Your method cannot enter the contest because it fails a basic requirement as
explained in the top of this posting and in the failure comment below and in
the specification in the first posting ;)

// 59 instructions.
// Failure: writes to wrong memory locations when DestBitIndex is larger
than 31
procedure WriteLongwordBitsTerje2
(
Value : longword;
BitCount : longword;
DestAddress : pointer;
DestBitIndex : longword
);
var
vLastBit : longword;
vBottomMask : longword;
vTopMask : longword;
begin
// Normalize bit index
longword(DestAddress) := longword(DestAddress) + (DestBitIndex shr 5);
DestBitIndex := DestBitIndex and 31;

// Index of last inserted bit:
vLastBit := DestBitIndex + BitCount - 1;

// Generate a mask of 0 bits below the first inserted bit */
vBottomMask := ( longword(-1) ) shl DestBitIndex;

// There must be 0 to 31 unmodified bits after the last inserted: */
vTopMask := ( longword(-1) ) shr (31-(vLastBit and 31));

// One or two affected words? */
if (vLastBit >= 32) then
begin
Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vBottomMask)
or (Value shl DestBitIndex);

Longword(DestAddress) := Longword(DestAddress) + 4;

// Since we modify two words, DestBitIndex MUST be > 0! */
Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vTopMask) or
(Value shr (32-DestBitIndex));
end else
begin
// It all fits in a single word !
Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not (vBottomMask
and vTopMask)) or (Value shl DestBitIndex); // topmask incorrect.
end;
end;

// Generated Assembler:

{

Project1.dpr.1833: begin
00409058 55 push ebp
00409059 8BEC mov ebp,esp
0040905B 83C4F0 add esp,-$10
0040905E 53 push ebx
0040905F 56 push esi
00409060 57 push edi
00409061 8BF1 mov esi,ecx
00409063 8955F8 mov [ebp-$08],edx
00409066 8945FC mov [ebp-$04],eax
00409069 8B5D08 mov ebx,[ebp+$08]
Project1.dpr.1835: longword(DestAddress) := longword(DestAddress) +
(DestBitIndex shr 5);
0040906C 8BC3 mov eax,ebx
0040906E C1E805 shr eax,$05
00409071 01C6 add esi,eax
Project1.dpr.1836: DestBitIndex := DestBitIndex and 31;
00409073 83E31F and ebx,$1f
Project1.dpr.1839: vLastBit := DestBitIndex + BitCount - 1;
00409076 8B45F8 mov eax,[ebp-$08]
00409079 03C3 add eax,ebx
0040907B 48 dec eax
Project1.dpr.1842: vBottomMask := ( longword(-1) ) shl DestBitIndex;
0040907C 8BCB mov ecx,ebx
0040907E 83CAFF or edx,-$01
00409081 D3E2 shl edx,cl
00409083 8955F4 mov [ebp-$0c],edx
Project1.dpr.1845: vTopMask := ( longword(-1) ) shr (31-(vLastBit and 31));
00409086 8BD0 mov edx,eax
00409088 83E21F and edx,$1f
0040908B B91F000000 mov ecx,$0000001f
00409090 2BCA sub ecx,edx
00409092 83CAFF or edx,-$01
00409095 D3EA shr edx,cl
00409097 8955F0 mov [ebp-$10],edx
Project1.dpr.1848: if (vLastBit >= 32) then
0040909A 83F820 cmp eax,$20
0040909D 7236 jb $004090d5
Project1.dpr.1850: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and
not vBottomMask) or (Value shl DestBitIndex);
0040909F 8BCB mov ecx,ebx
004090A1 8B55FC mov edx,[ebp-$04]
004090A4 D3E2 shl edx,cl
004090A6 8BC6 mov eax,esi
004090A8 8B08 mov ecx,[eax]
004090AA 8B7DF4 mov edi,[ebp-$0c]
004090AD F7D7 not edi
004090AF 23CF and ecx,edi
004090B1 0BD1 or edx,ecx
004090B3 8910 mov [eax],edx
Project1.dpr.1852: Longword(DestAddress) := Longword(DestAddress) + 4;
004090B5 83C604 add esi,$04
Project1.dpr.1855: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and
not vTopMask) or (Value shr (32-DestBitIndex));
004090B8 B920000000 mov ecx,$00000020
004090BD 2BCB sub ecx,ebx
004090BF 8B55FC mov edx,[ebp-$04]
004090C2 D3EA shr edx,cl
004090C4 8BC6 mov eax,esi
004090C6 8B08 mov ecx,[eax]
004090C8 8B5DF0 mov ebx,[ebp-$10]
004090CB F7D3 not ebx
004090CD 23CB and ecx,ebx
004090CF 0BD1 or edx,ecx
004090D1 8910 mov [eax],edx
004090D3 EB19 jmp $004090ee
Project1.dpr.1859: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and
not (vBottomMask and vTopMask)) or (Value shl DestBitIndex);
004090D5 8BCB mov ecx,ebx
004090D7 8B55FC mov edx,[ebp-$04]
004090DA D3E2 shl edx,cl
004090DC 8BC6 mov eax,esi
004090DE 8B08 mov ecx,[eax]
004090E0 8B5DF4 mov ebx,[ebp-$0c]
004090E3 235DF0 and ebx,[ebp-$10]
004090E6 F7D3 not ebx
004090E8 23CB and ecx,ebx
004090EA 0BD1 or edx,ecx
004090EC 8910 mov [eax],edx
Project1.dpr.1861: end;
004090EE 5F pop edi
004090EF 5E pop esi
004090F0 5B pop ebx
004090F1 8BE5 mov esp,ebp
004090F3 5D pop ebp
004090F4 C20400 ret $0004

}

Thanks for trying anyway...

Bye,
Skybuck.


From: Skybuck Flying on
Skybuck to the resque ;)

I thought long,

and

I thought hard,

and

Now I might have a pretty simple branchless little fix/patch/hack for it.

Since DestBitIndex is modded to 32, it's range is 0 to 31.

The problem code is:

shr 32 - DestBitIndex.

So the range of this instruction will always be:

shr 32 - 0 = 32
shr 32 - 31 = 1

So a possible fix is to split the shr into two shr's.

Simply do something like:

shr 1

This is valid because the shr is always 1 or greater.

Then to correct the situation, add +1 to DestBitIndex...
then one more will be subtracted from shr 32... which will fix it.

shr 32 - (DestBitIndex+1)

Now let's examine the other case:

DestBitIndex = 31.

shr 32 - (31+1) = shr 0

shr 0 is ok... it doesn't do anything.

The original code should have done

shr 1...

But it will do that because of the extra shift shr 1.

So the fix is:

shr 1
shr 32 - (DestBitIndex+1)

One bug KILLED :) =D

Bye,
Skybuck.


From: Skybuck Flying on
Well, where did you learn to program son ? LOL.

There are other bugs in your code as well.

For example, you not calculating the top mask correctly.

Bye,
Skybuck.