|
Prev: Parallel execution of carry dependent instructions ?
Next: Skybuck presents FastGetBit ( Value, BitPosition ) (latency 3)
From: Terje Mathisen on 4 May 2008 11:30 Skybuck Flying wrote: > "MitchAlsup" <MitchAlsup(a)aol.com> wrote in message > news:cb20dc12-415b-4a0c-8bd9-4985a7c04946(a)w74g2000hsh.googlegroups.com... >> Lets do this in an instruction set that it makes it easy, and use C; >> >> void WriteLongwordBits( unsigned long Value, unsignedint BitCount, >> unsigned long int *DestAddress, unsigned int DestBitIndex ) >> { >> ASM{ // get stuff from the parameter passing area >> mov a1,12[sp] >> mov d2,16[sp] >> mov a3,20[sp] >> mov d4,24[sp] >> >> BFEXTU d5,d2,[a1] // get the source field >> BFINS d5,d4,[a3] // put it where it belongs >> } [snip] > Maybe somebody can re-write all of this assembler to be more efficient ? ;) Mitch already did, see above. > > Or simply come up with an alternative assembler implementation. Your example code is simply horrible, besides the fact that it can trap on a 64-bit platform since you don't seem to care about doing aligned 64-bit accesses. You also imply a 32-bit little-endian architecture since you use a array of 32-bit values as your target address, which together with a random bit offset means that writes will probably straddle between two 32-bit words. However, none of this really matters at all, because the function can _never_ be time critical: If it is, you simply have no idea at all how to write fast code. So, with that out of the way, lets look at your function: void WriteLongwordBits( unsigned long Value, unsigned int BitCount, unsigned long int *DestAddress, unsigned int DestBitIndex ) { /* Assumptions for this function: * 1. unsigned long is 32-bit * 2. little endian: bits are numbered from the first/least significant (0) up to the msb (31) * 3. Bit 32 == bit 0 in the next 32-bit word */ /* Normalize bit index */ DestAddress += DestBitIndex >> 5; DestBitIndex &= 31; /* Where is the last bit going to end up? */ unsigned int lastbit = BitCount + DestBitIndex - 1; /* This mask will be zero starting at the first inserted bit */ unsigned int bottom_mask = ((unsigned int) -1) >> DestBitIndex; /* Zero mask for bits after the last bit inserted */ /* The masked shift count is implicit on x86! */ unsigned int top_mask = ((unsigned int) -1) >> (lastbit & 31); /* One or two affected words? */ if (lastbit >= 32) { DestAddress[0] = (DestAddress[0] & bottom_mask) | (Value << DestBitIndex); DestAddress[1] = (DestAddress[1] & ~top_mask) | (Value >> (32-DestBitIndex)); } else { /* It all fits in a single word! */ DestAddress[0] = (DestAddress[0] & (bottom_mask | ~top_mask)) | (Value << DestBitIndex); } } Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Skybuck Flying on 4 May 2008 15:25 Nice try, except your routine is flawed. Here is the delphi translation with comments for your flaws: procedure WriteLongwordBitsTerje( Value : longword; BitCount : longword; DestAddress : pointer; DestBitIndex : longword ); var vLastBit : longword; vBottomMask : longword; vTopMask : longword; begin // Normalize bit index longword(DestAddress) := longword(DestAddress) + DestBitIndex shr 5; // div 32 DestBitIndex := DestBitIndex and 31; // mod 32, resulting range 0 to 31 // Where is the last bit going to end up ? vLastBit := BitCount + DestBitIndex - 1; // resulting range -1 to 62 // This mask will be zero starting at the first inserted bit vBottomMask := longword(-1) shr DestBitIndex; // Zero mask for bits after the last bit inserted // The masked shift count is implicit on x86! vTopMask := ( longword(-1) ) shr (vLastBit and 31); // vLastBit resulting range 0 to 31 // One or two affected words? if (vLastBit >= 32) then // could malfunction for -1 ;) begin // DestAddress[0] := (DestAddress[0] and vBottomMask) or (Value shl DestBitIndex); longword(DestAddress^) := (longword(DestAddress^) and vBottomMask) or (Value shl DestBitIndex); // DestAddress[1] := (DestAddress[1] and not vTopMask) or (Value shr (32-DestBitIndex)); longword( pointer(longword(DestAddress)+4)^ ) := ( longword( pointer(longword(DestAddress)+4)^ ) and not vTopMask) or (Value shr (32-DestBitIndex)); // shr 32 will malfunction end else begin // It all fits in a single word! // DestAddress[0] := (DestAddress[0] and (vBottomMask or not vTopMmask)) or (Value shl DestBitIndex); longword(DestAddress^) := (longword(DestAddress^) and (vBottomMask or not vTopMask)) or (Value shl DestBitIndex); end; end; Your routine fails for my current test value: Value := 4294967295; // all bits set BitCount := 31; // write only 31 bits BitPointer := 0; // at index zero WriteLongWordBitsTerje( Value, BitCount, @Buffer, BitPointer ); Performance wise your routine uses a branch and a branch could potentially stall the pipeline. So you see writing a flawless fast 32 bit version is not as easy as you thought it would be. Now it's up to you to correct and possibly speed up your routine, otherwise I'll have to remove it from the contest. Since flawed routines are not allowed ofcourse. I'll shall help you understand your flaw and the problem in general: Value (shr 32 - DestBitIndex) won't work if DestBitIndex is zero. The shr instruction is limited to a range of 0 to 31. You are trying to shift with 32 which means the shr won't happen. Good luck with fixing it. Thanks for trying anyway your code does have some interesting concepts which might or might not come in handy in writing a correct version ;) :) Bye, Skybuck.
From: Skybuck Flying on 6 May 2008 14:04 Your new version is better, but still has a serious shortcoming: the DestBitIndex should be able to have any range. Your method assumes parameter DestBitIndex has range 0 to 31, this is too restricted. See specification in first posting. Furthermore your method is less desireable, currently it falls in classification A2B1 The best classification is A2B2. Your method's lower classification explained: Your method writes garbage bits from the value as well... Suppose the value is: LSB ------------------------------> MSB | | 1111 1111 1111 0000 0000 0000 0000 0000 Suppose the bit count is: 3 Your method copies all set bits from value to output. Buffer before call: 0000 0000 0000 0000 0000 0000 0000 0000 Buffer after call: 1111 1111 1111 0000 0000 0000 0000 0000 So you see, your method ignores the bit count in this case. It would have been better/more desireable if your method preserves the remaining bits in the buffer, and treats accessive bits in the value as garbage which should not be written to the buffer. So the best/most desired result would be: 1110 0000 0000 0000 0000 0000 0000 0000 Your method will be classified as: A2B1 While my methods are classified as: A2B2 See first post to understand classifications :) Here is your version translated to Delphi. Your method cannot enter the contest because it fails a basic requirement as explained in the top of this posting and in the failure comment below and in the specification in the first posting ;) // 59 instructions. // Failure: writes to wrong memory locations when DestBitIndex is larger than 31 procedure WriteLongwordBitsTerje2 ( Value : longword; BitCount : longword; DestAddress : pointer; DestBitIndex : longword ); var vLastBit : longword; vBottomMask : longword; vTopMask : longword; begin // Normalize bit index longword(DestAddress) := longword(DestAddress) + (DestBitIndex shr 5); DestBitIndex := DestBitIndex and 31; // Index of last inserted bit: vLastBit := DestBitIndex + BitCount - 1; // Generate a mask of 0 bits below the first inserted bit */ vBottomMask := ( longword(-1) ) shl DestBitIndex; // There must be 0 to 31 unmodified bits after the last inserted: */ vTopMask := ( longword(-1) ) shr (31-(vLastBit and 31)); // One or two affected words? */ if (vLastBit >= 32) then begin Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vBottomMask) or (Value shl DestBitIndex); Longword(DestAddress) := Longword(DestAddress) + 4; // Since we modify two words, DestBitIndex MUST be > 0! */ Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vTopMask) or (Value shr (32-DestBitIndex)); end else begin // It all fits in a single word ! Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not (vBottomMask and vTopMask)) or (Value shl DestBitIndex); // topmask incorrect. end; end; // Generated Assembler: { Project1.dpr.1833: begin 00409058 55 push ebp 00409059 8BEC mov ebp,esp 0040905B 83C4F0 add esp,-$10 0040905E 53 push ebx 0040905F 56 push esi 00409060 57 push edi 00409061 8BF1 mov esi,ecx 00409063 8955F8 mov [ebp-$08],edx 00409066 8945FC mov [ebp-$04],eax 00409069 8B5D08 mov ebx,[ebp+$08] Project1.dpr.1835: longword(DestAddress) := longword(DestAddress) + (DestBitIndex shr 5); 0040906C 8BC3 mov eax,ebx 0040906E C1E805 shr eax,$05 00409071 01C6 add esi,eax Project1.dpr.1836: DestBitIndex := DestBitIndex and 31; 00409073 83E31F and ebx,$1f Project1.dpr.1839: vLastBit := DestBitIndex + BitCount - 1; 00409076 8B45F8 mov eax,[ebp-$08] 00409079 03C3 add eax,ebx 0040907B 48 dec eax Project1.dpr.1842: vBottomMask := ( longword(-1) ) shl DestBitIndex; 0040907C 8BCB mov ecx,ebx 0040907E 83CAFF or edx,-$01 00409081 D3E2 shl edx,cl 00409083 8955F4 mov [ebp-$0c],edx Project1.dpr.1845: vTopMask := ( longword(-1) ) shr (31-(vLastBit and 31)); 00409086 8BD0 mov edx,eax 00409088 83E21F and edx,$1f 0040908B B91F000000 mov ecx,$0000001f 00409090 2BCA sub ecx,edx 00409092 83CAFF or edx,-$01 00409095 D3EA shr edx,cl 00409097 8955F0 mov [ebp-$10],edx Project1.dpr.1848: if (vLastBit >= 32) then 0040909A 83F820 cmp eax,$20 0040909D 7236 jb $004090d5 Project1.dpr.1850: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vBottomMask) or (Value shl DestBitIndex); 0040909F 8BCB mov ecx,ebx 004090A1 8B55FC mov edx,[ebp-$04] 004090A4 D3E2 shl edx,cl 004090A6 8BC6 mov eax,esi 004090A8 8B08 mov ecx,[eax] 004090AA 8B7DF4 mov edi,[ebp-$0c] 004090AD F7D7 not edi 004090AF 23CF and ecx,edi 004090B1 0BD1 or edx,ecx 004090B3 8910 mov [eax],edx Project1.dpr.1852: Longword(DestAddress) := Longword(DestAddress) + 4; 004090B5 83C604 add esi,$04 Project1.dpr.1855: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not vTopMask) or (Value shr (32-DestBitIndex)); 004090B8 B920000000 mov ecx,$00000020 004090BD 2BCB sub ecx,ebx 004090BF 8B55FC mov edx,[ebp-$04] 004090C2 D3EA shr edx,cl 004090C4 8BC6 mov eax,esi 004090C6 8B08 mov ecx,[eax] 004090C8 8B5DF0 mov ebx,[ebp-$10] 004090CB F7D3 not ebx 004090CD 23CB and ecx,ebx 004090CF 0BD1 or edx,ecx 004090D1 8910 mov [eax],edx 004090D3 EB19 jmp $004090ee Project1.dpr.1859: Plongword(DestAddress)^ := (Plongword(DestAddress)^ and not (vBottomMask and vTopMask)) or (Value shl DestBitIndex); 004090D5 8BCB mov ecx,ebx 004090D7 8B55FC mov edx,[ebp-$04] 004090DA D3E2 shl edx,cl 004090DC 8BC6 mov eax,esi 004090DE 8B08 mov ecx,[eax] 004090E0 8B5DF4 mov ebx,[ebp-$0c] 004090E3 235DF0 and ebx,[ebp-$10] 004090E6 F7D3 not ebx 004090E8 23CB and ecx,ebx 004090EA 0BD1 or edx,ecx 004090EC 8910 mov [eax],edx Project1.dpr.1861: end; 004090EE 5F pop edi 004090EF 5E pop esi 004090F0 5B pop ebx 004090F1 8BE5 mov esp,ebp 004090F3 5D pop ebp 004090F4 C20400 ret $0004 } Thanks for trying anyway... Bye, Skybuck.
From: Skybuck Flying on 5 May 2008 09:11 Skybuck to the resque ;) I thought long, and I thought hard, and Now I might have a pretty simple branchless little fix/patch/hack for it. Since DestBitIndex is modded to 32, it's range is 0 to 31. The problem code is: shr 32 - DestBitIndex. So the range of this instruction will always be: shr 32 - 0 = 32 shr 32 - 31 = 1 So a possible fix is to split the shr into two shr's. Simply do something like: shr 1 This is valid because the shr is always 1 or greater. Then to correct the situation, add +1 to DestBitIndex... then one more will be subtracted from shr 32... which will fix it. shr 32 - (DestBitIndex+1) Now let's examine the other case: DestBitIndex = 31. shr 32 - (31+1) = shr 0 shr 0 is ok... it doesn't do anything. The original code should have done shr 1... But it will do that because of the extra shift shr 1. So the fix is: shr 1 shr 32 - (DestBitIndex+1) One bug KILLED :) =D Bye, Skybuck.
From: Skybuck Flying on 5 May 2008 09:21
Well, where did you learn to program son ? LOL. There are other bugs in your code as well. For example, you not calculating the top mask correctly. Bye, Skybuck. |