Prev: Maximum number of operands for x86 and x64 instruction set ?
Next: Skybuck's Universal Code Version 6 (The Fast Version)
From: Skybuck Flying on 7 Jun 2010 23:25
I have finished implementing a variation of Skybuck's Universal Code in
It turns out it needs BSR for decoding... this instruction seems quite slow
on AMD's processors according to documents.
The slowness goes up to 16 to 17 clock cycles/latency ?!? which is a bit too
high for my taste...
In the future LZCNT might be available to mimic BSR's behaviour/need in a
slightly different way... but this would require supporting this new
and adding two code paths and so forth... I rather just use BSR for now.
So I hope AMD will make the BSR instruction faster in their next
processor... just in case... ;)
I am about to finish my implementation and test it's speed... it will
probably be fast enough to do what I want... but the problem is ofcourse
that I might want
to do more in the future... and BSR is eating into my "instruction budget"
;) :) Bring down the costs of BSR ! ;) :)
From: wolfgang kern on 16 Jun 2010 13:19
The 'Flying Bucket' posted one more time the same question:
> I have finished implementing a variation of Skybuck's Universal Code in
> something :)
What might the Bucket have done again against pure Logic ?
AMD introduced a fast (faster than Intel) BSR/BSF instruction
with their very first 80486 CPU-variants already.
Because AMD dont need to shift, they used bit addressing instead.
Attempts to use this fast opcode within compilers may have
been delayed or ignored at this time. Today, even newer Intel
CPUs may bypass the barrel-shift with bit-address-logic.
So where could this 'Bucket's Universal Code' apply to,
I think to the "museum of never taken opportunities" :)
the 'Flying Bucket' usually reside in my 'quiet cage',
but just for fun I may release him for some time...