From: randyhyde@earthlink.net on

o//annabee wrote:
> På Thu, 16 Mar 2006 16:57:09 +0100, skrev randyhyde(a)earthlink.net
> <randyhyde(a)earthlink.net>:
>
> > I see you've got your head buried in the same whole in the sand that
> > Rene does. Ignoring reality just because you don't like it is a sign of
> > insanity, you know?
>
> Well. Dont know, but I think its spelled "Hole" anyway. "Whole" is meaning
> more like "Complete".

Believe me, proof-reading posts to this newsgroup would be a waste of
time. But if you want to play grammar police, you're in a very weak
position to do so.

>
> Which reminds me. Where can we download the 6 non-trival masterful
> applications you have written in assembly?

Grab the examples download from the HLA downloads page on Webster.

> I looked at at webster, but
> couldnt find anything but christian resources, which I though wore odd.

Anything but, eh?
Learn the basics of a web browser, dude.

> >
> > So why are you complaining if you actually believe this?
>
> I m not complaining. But I think when you have such wivid imagination, you
> should make this daydream more realistic. Unless of course this "breaking
> others code component" is important to you.

Yes. Making sure that HLA is the best possible assembly language, even
if it means breaking a *few* source files out there that use constructs
that, by experience, have been shown to be "less than the best" is
important to me. Unlike Rene, I'm not so egotistical to believe that I
got everything right the first time around. And I'm certainly not so
stubborn as to argue that obvious design flaws (such as the inability
to link static library modules in RosAsm) shouldn't be corrected. Of
course, before you go off on how bad the HLA design might be, I might
also point out that although there *have* been some changes to the
language, most of them have been additions, rather than deletions.
Though breaking user code is *definitely* a possibility in any change,
in reality this occurs very rarely. Indeed, the latest go-round that
Sevag is talking about is *not* in the language, but in the standard
library; and that's a different animal altogether.
Cheers,
Randy Hyde

From: sevagK on

randyhyde(a)earthlink.net wrote:
>
> >
> > This was tested on an AMD 64, 3700+ running win2000,
>
> And mine was tested on a PIV running XP.
>
> Randy Hyde

I have an AMD 64 3400+ and the 2 routines run with your routine just
*slightly* faster.

After 100 iterations of each loop, Wannabee's version falls behind by
an average of about 400.

Of course, when dealing with billions of iterations, the score adds up,
but if it you're pinching cycles, you probably won't be using macros!

-sevag.k
www.geocities.com/kahlinor

From: o//annabee on
P? Thu, 16 Mar 2006 17:57:44 +0100, skrev randyhyde(a)earthlink.net
<randyhyde(a)earthlink.net>:

> Well, I don't have your particular CPU, but when I run this code on a
> PIV, here are the results that I get:
>
> My code Your Code
> 8418 c76c
> 8600 c76c
> 8590 c7d0
> 8568 c9b4
> 8648 c970


I just ran the tests, with the identical code also on the AthlonXp. The
tests show there that you are indeed correct on this one.

The timings I get for the athlon is
4f65 7549

So, then this is true for an old AMD, and must have changed for the new
one.
Whats the ticks per seconds rate for the PIV ?

> Your version seems to be about 50% slower than mine on a PIV. Again, I
> don't have access to your CPU, so can I can't verify your numbers, but
> if you look at the actual code generated by RosAsm for the two
> routines:

> Well, the difference becomes pretty obvious. What you're trying to tell
> me is that a loop with 50% more instructions, that is,
> is actually *faster*? Hmmm... I sure seems like *my* measurements are
> a lot more intuitive. That is, the code with 50% more instructions
> (your's) runs 50% slower. That AMD CPU is quite amazing indeed, if
> this is really the case.

Yes. In this case you're right. On older CPUs like the PIV and Athlon your
code is faster. But who can explain the timings on the 64bits CPU?

> If you look at the two pieces of disassembled code, I think that this
> alone should scare people away from using macros if they want the
> fastest possible code. And, btw, I want to emphasize *macros*, not
> *RosAsm macros*. You get the same problem whether the macro was written
> for RosAsm, MASM, HLA, FASM, or whatever.

Ok. Anyways, I am bit surpriced that an old athlon could give better
clockings than a PIV ?

> What I *have* claimed is that MASM's implementation of "if" statements
> is *better* than the macros that come with RosAsm. This is because MASM
> is a bit smarter about this stuff. You will also discover that HLA's
> "while" loop generates the "test for loop at the end" rather than the
> same code that RosAsm generates. Now perhaps that fails to be better
> code on your particular AMD CPU, I cannot verify that as I do not have
> access to that CPU. But an inspection of the code and measurements that
> I've made suggest that putting the branch at the bottom of the loop and
> removing an extra jump is *much* better coding indeed.

Yes. But on newer hardware it doesnt seem to make much diffrence.

> Yes, not to mention your failure to serialize before the second rdtsc
> in each example.

I tried this of course, but it does nothing for the timings either this
way or that, so I felt it uneeded to have them there.

> But that still doesn't explain the 50% difference that
> *I* see on a PIV. And the difference I see is right in line with the
> number of instructions. Imagine that.

Yes. You are correct. But who will explain that this larger code runs
faster on an AMD64.

> On *your* CPU, things like pairable instructions and branch prediction
> *could* be why the two loops execute in a similar amount of time. It's
> not like the PIV is a paragon of great microcoding. But it *really*
> smells like you've made an error somewhere. I'd suggest that you try
> putting several additional instructions in the loop and see what
> happens then. That would counter any bizzarre instruction pairing
> phenomenon that is going on.

Ok. Code posted below.

>> This was tested on an AMD 64, 3700+ running win2000,

> And mine was tested on a PIV running XP.


>> TestProc:
>>
>> cpuid
>> rdtsc | push eax
>> mov ecx D$n
>> xor eax eax
>
> ;You've just discovered the problem with
> ; relative local labels here. Do you see
> ; the problem in this code? This is
> ; *exactly* why I refused to put this
> ; lame form of local labels into HLA.


> ; Earlier assemblers I'd written
> ; had relative local labels and I
> ; saw this problem *far* too often.

Yes, but this mean nothing here. This code is not running in a library and
it is garantied to never execute with ecx = 0. Actually this is a
guarantie. It _will_ never execute in that way.
But yes, I should have seen it anyhow.

>
>> jecxz L0>
>> Align 16
>> L0:
>>
>> add eax ecx
>> dec ecx
>> jnz L0<
>> L0:
>>

> Another issue- Caching effects are not allowed for in this code. The
> way you executed it, by running and the stopping, guarantees that the
> code will *not* be in the cache when you run it.

Unless I misunderstand what you mean, this is what happens to code like
this in a real situation.

> What you should
> *really* do is run each code fragment in a loop a couple of times and
> then use the last measurement.

Ok. Posted below.

> That way, everything is in cache and
> you'll get more realistic readings. Indeed, the reason your timings may
> be so close is because the memory subsystem on your PC is sub-par and
> what you're really measuring is the amount of time it takes to read
> data from main memory.

I use Twinmos DDR 3200 dual 512 chips on a ABIT KN8 Ultra board with
nvidia chipset. Surely not the fastest money can buy, but is what I could
handle at the time.

> Cheers,
> Randy Hyde


Code for timings with more instructions between the loop. And again, you
are correct.
When adding more instructions in the loop, your code gains a foothold and
wins.

Amazing.


TestProc:

cpuid
rdtsc | push eax
mov ecx 10000
xor eax eax
Align 16
while ecx > 0
add eax ecx

add edx ebx
add esi edi
add edx ebx
add esi edi
add edx ebx
add esi edi
add edx ebx
add esi edi

dec ecx
End_While
rdtsc | pop ebx
sub eax ebx
int 3
;AFEF

cpuid
rdtsc | push eax
mov ecx D$n
xor eax eax
jecxz L0>
Align 16
L0:
add eax ecx

add edx ebx
add esi edi
add edx ebx
add esi edi
add edx ebx
add esi edi
add edx ebx
add esi edi

dec ecx
jnz L0<
L0:

rdtsc | pop ebx
sub eax ebx
int 3

;9CD7


From: o//annabee on
P? Thu, 16 Mar 2006 18:15:56 +0100, skrev Betov <betov(a)free.fr>:

> o//annabee <fack(a)szmyggenpv.com> ?crivait news:op.s6ik2cnqce7g4q(a)bonus:

> Courage: Kill him ! Kill him !

I think this one will not stop talkinmg even in death.

> At the end, he will point
> you to the pathetic "HLA Advantures" game, that we suffer
> since, now,... how many _years_ exactly?...

Thats not assembly. So it doesnt count.

>
> :]]]]]
>
> Betov.
>
> < http://rosasm.org >
>
>

From: o//annabee on
P? Fri, 17 Mar 2006 03:25:27 +0100, skrev sevagK <kahlinor(a)yahoo.com>:

>
> randyhyde(a)earthlink.net wrote:
>>
>> >
>> > This was tested on an AMD 64, 3700+ running win2000,
>>
>> And mine was tested on a PIV running XP.
>>
>> Randy Hyde
>
> I have an AMD 64 3400+ and the 2 routines run with your routine just
> *slightly* faster.

Thats sound more and more incredible. In all tests here, on the new
machine, the RosAsm code ran faster for all iterations, and the same
diffrence applied for 10000 and 2 iterations.
For the Athlon, the RosAsm version was nearly twice as slow.

> After 100 iterations of each loop, Wannabee's version falls behind by
> an average of about 400.

What does this mean? Where are the code you tested, and are you to be
understood that the code ran slower and slower?

> Of course, when dealing with billions of iterations, the score adds up,
> but if it you're pinching cycles, you probably won't be using macros!

This are my timings on the AMD64 3700+ for 1_000_000_000 iterations

Randall: 774c_e5f7
Mine: 774C_C9B3

>
> -sevag.k
> www.geocities.com/kahlinor
>