From: Phil Carmody on
"Terry Russell" <trochilus(a)RoEpMtOuVsEnet.com.au> writes:
> 256 byte lookup table
> fill algorithm...30-400 clocks per byte

Woh! you need a better fill algorithm.

Can I offer you a 16-byte lookup table, and some back-to-basics
shift-by-4 and add?

Or the old paired LFSR technique. I say 'old', but as far as
I know no-one's ever used it. When searching the literature for
one possible application, I was unable to find any reference to
it amongst dozens of techniques in hundreds of papers. Spin
2 in opposite directions, et voila. x86 Parity flag useful.

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
From: Terry Russell on

"Phil Carmody" <thefatphil_demunged(a)yahoo.co.uk> wrote in message
news:87sljri9gz.fsf(a)nonospaz.fatphil.org...
> "Terry Russell" <trochilus(a)RoEpMtOuVsEnet.com.au> writes:
>> 256 byte lookup table
>> fill algorithm...30-400 clocks per byte
>
> Woh! you need a better fill algorithm.
>
> Can I offer you a 16-byte lookup table, and some back-to-basics
> shift-by-4 and add?
>
> Or the old paired LFSR technique. I say 'old', but as far as
> I know no-one's ever used it. When searching the literature for
> one possible application, I was unable to find any reference to
> it amongst dozens of techniques in hundreds of papers. Spin
> 2 in opposite directions, et voila. x86 Parity flag useful.

The fill was a guess of the range of algorithms proferred, probably more
like
50 without trying too hard. I was mostly noting it is comparable to the time
to
load a precalculated table from disk and that a lookup becomes independent
of fill algorithm very quickly.
Of course you could speed up the filler , but if it used seldom it isn't
worth it,
and if it is used often it isn't worth it.

128k of common byte operations precalculated in exe, getting towards 1 cents
of
ram and 0.01 cent of disk space

not a real budget breaker

Byte/bit optimisation of microcent per day resources is fine but the greater
efficiency gain may be in allowing the kilobuck a week user to make full use
of their time , multikilobuck resources and kilobuck a quarter support
costs.

Take window shifting and bitmaps, huge chunks of data
write a lot of slow tiling code for generic systems, or specify (demand) at
least a $50
later model card with gigapixel bitmap support and have the hardware
do your 20kx20k bitmap transform?



From: Crazy on
OK....

So the reason this is even needed is because of the Motorola 68000
series CPU's.

I found this to be required when working with older versions of Novell
NetWare API's


Intel = LSB
Motorola = MSB

In 99% of cases, the BIT order of a BYTE is BIG ENDIAN

Read THESE:

http://www.cs.umass.edu/~verts/cs32/endian.html
or
http://www.webopedia.com/TERM/b/big_endian.html
or
http://en.wikipedia.org/wiki/Endianness


Now I haven't tested this, but this should work, I think.


Function ReturnBE(value:word) : word ;

var
MyWord : word
StoreBE : word ;

begin

MyWord := 10241 ;

StoreBE := $0000 ;

StoreBE := lo(MyWord);
StoreBE shl 8 ;
StoreBE := StoreBE + Hi(MyWord) ;

end;


So ( I think ) this would look like in Binary

--> MyWord := 10241 ;

result in bin = 00101000:00000001

--> StoreBE = $0000 ;

result in bin = 00000000:00000000

--> StoreBE := Lo(MyWord) ;

result in bin = 00000000:00000001

--> StoreBE SHL 8 ;

result in bin = 00000001:00000000

--> StoreBE := StoreBE + HI(MyWord);

result in bin = 00000001:00101000


Now if you get into 32 bit numbers this wont work because you have:

------ LSB ------:-------MSB ------
---LSB--:---MSB--:---LSB--:---MSB--
00000000:00000000:00000000:00000000


Now if you get into 64 bit numbers this wont work because you have:

----------------LSB----------------:----------------MSB ---------------
------ LSB ------:-------MSB ------:------ LSB ------:-------MSB ------
---LSB--:---MSB--:---LSB--:---MSB--:---LSB--:---MSB--:---LSB--:---MSB--
00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000


At least >> I THINK << but I could damn well be wrong on the 32 and 64
bit part. I am not sure if the processor see's them as pure bits or as
groups of bits ( bytes ) that make up a pure 32 bit or 64 bit number.

Man am I sleepy.

- Crazy

Skybuck wrote:
> Woops, sorry, I did not know comp.lang.asm.x86 is a moderated
> newsgroup, as I don't like moderated newsgroups I post it again and
> exclude comp.lang.asm.x86:
>
> Reversing bit order in delphi ?
>
> Hello,
>
> I think intel uses most significant bit first ?
>
> However there are some computers which use least significant bit first.
>
> So what would be the fastest way in pure delphi code to reverse the
> bits ?
>
> For example:
>
> [intel system, most significant bit first]
>
> Bit order:
>
> 7, 6, 5, 4, 3, 2, 1, 0
>
> 128, 64, 32, 16, 8, 4, 2, 1
> 0, 1, 1, 0, 1, 1, 0, 1 = 109
>
> Reversing the bits would give:
> 1, 0, 1, 1, 0, 1, 1, 0 = 182
>
> [ibm system or something else, least significant bit first]
>
> Bit order:
>
> 0, 1, 2, 3, 4, 5, 6, 7
>
> 1, 2, 4, 8, 16, 32, 64,128
> 0, 1, 1, 0, 1, 1, 0, 1 = 182
>
> Reversing the bits would give:
> 1, 0, 1, 1, 0, 1, 1, 0 = 109
>
> Reversing the bits in delphi code might be too slow, if that's the case
> is there a special (intel/amd) assembler instruction which might do the
> trick ?
>
> Note:
>
> This message/thread is cross posted to:
>
> alt.comp.lang.borland-delphi, alt.lang.asm
>
> Bye,
> Skybuck.
>
From: [Jongware] on
"Skybuck" <skybuck2000(a)hotmail.com> wrote in message
news:1156016326.599168.43620(a)75g2000cwc.googlegroups.com...
>> xlat.
>> At the cost of a 256-byte table, it is easiest and fastest to just look
>> the
>> byte up. It even used to be a well-known trick (you can mirror bitmap
>> images
>> this way).
>> As a side note I might add that though 'xlat' *is* a special instruction
>> for
>> table lookup, it's usually faster (!) to use the regular memory read
>> functions. (That's from hear-say!)
>>
>> [Jongware]
>
> Euhm can you provide some assembler code ?
>
> Then maybe I can enter you into the competition ! ;)

This is a winner every time.

..mirrortable db 0, 80h, 40h, 0c0h, 20h, 0a0h, 60h, 0e0h, 10h, 90h
... (etc. -- rest of look-up table omitted)

-- and whereever you need the mirrored byte of AL
xor ebx,ebx
mov bl,al
mov al, ds:[mirrortable+ebx]

-- or, whatever register you need the data in. You don't have to clear the
offset register ebx every time, but, since it can be a source of weird bugs,
I've included it for clarity.
Requirement: the table itself (256 bytes)
Cost: As near to zero as you can imagine.
Used this particular trick in my ZX Spectrum era, with its monochrome
screen. Drawing any image mirrored took only 1 cpu clock tick per byte more
than drawing the original.

The Intel xlat command does the same in a single instruction, but I
understand it's being 'phased out' and translated internally to the sequence
I describe above (and at the cost of a higher clock count).

[Jongware]


From: Skybuck on

[Jongware] schreef:

> "Skybuck" <skybuck2000(a)hotmail.com> wrote in message
> news:1156016326.599168.43620(a)75g2000cwc.googlegroups.com...
> >> xlat.
> >> At the cost of a 256-byte table, it is easiest and fastest to just look
> >> the
> >> byte up. It even used to be a well-known trick (you can mirror bitmap
> >> images
> >> this way).
> >> As a side note I might add that though 'xlat' *is* a special instruction
> >> for
> >> table lookup, it's usually faster (!) to use the regular memory read
> >> functions. (That's from hear-say!)
> >>
> >> [Jongware]
> >
> > Euhm can you provide some assembler code ?
> >
> > Then maybe I can enter you into the competition ! ;)
>
> This is a winner every time.

Are you sure ?

I need proof :)

>
> .mirrortable db 0, 80h, 40h, 0c0h, 20h, 0a0h, 60h, 0e0h, 10h, 90h
> ... (etc. -- rest of look-up table omitted)
>
> -- and whereever you need the mirrored byte of AL
> xor ebx,ebx
> mov bl,al
> mov al, ds:[mirrortable+ebx]

Delphi only uses 2 instructions for

SwapBits := LookUp[Byte];

It generates something like:

movzx ax, byte
movzx ax, [table+ax]

or something like that.

That's two instructions versus your three instructions.

So I doubt your three instructions would be faster, unless it can be
paired ?

By maybe the other two are paired as well... I don't know... probably
not since they rely on each other ;)

> The Intel xlat command does the same in a single instruction, but I
> understand it's being 'phased out' and translated internally to the sequence
> I describe above (and at the cost of a higher clock count).

xlat is slower than the two delphi generated instructions, see test
program ;)

Ofcourse more testing should be done.

For example in the future I might test "swapping" a whole array of
bytes... to see if the performance chart remains the same ;)

For now I have utter urgent matters to attend to, so I am not going to
test your method for now... maybe in the future though... just for
curiosity sake ;)

I expect your method to be 1% slower than the leading method :) However
maybe your method will be even slower because you are using EBX which
needs to be pushed and popped from the stack.... so let's make that 2%
;)

Bye,
Skybuck.

First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: Apple II Debugger
Next: TMA Assembler?