From: //o//annabee on
P� Thu, 10 Jan 2008 12:35:06 +0100, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> shikamuk asked:
>
>> Hello
>> I wonder which way is faster while doing arithmetical expressions.
>> For example, I can add to allocated variables, and I can move them to
>> the registers and then add.
>> Probably, addition in the registers is faster.
>> But what about time to move them to the registers?
>
> Every memory access takes its time, also if already cached or on stack.
>
> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
> and it is faster than its discrete replacement:
>
> MOV eax,[var1]
> MOV ebx,[var2]
> ADD eax,ebx
> MOV [var1],eax
>
> is much slower than:
>
> MOV eax,[var2]
> ADD [var1],eax

I mean, single core amd64.
8 cycles I read, inclusive, Using your previously posted technic.
However, there is a diffrence, when changing
variables such as length of timed code, the weather outside,
if the window is open, or having taken a leak before timing,
eg. doing it twice (code just doubled) and the last variant wins by
the magnificent 1 cycle.

A gigantic win :D

Lets say the 8 cycles are overhead?

:D

(K7 reads 0F cycles - for both, inclusive)

Why do you give advice on codeoptimization, when it is appear
to me so utterly useless in the real life? If your strategy is not
100% correct at the start of coding, you will have to repeat you rigorous
coding when you rewrite, and you have to TIME EACH CHANGE.
Given the errors I see I make each time I do that, and then I need to
re-verify my assumtions at least tree times,
and even then I am not sure I got everything right.....

Would it not be more fruitful to post more on the strategy of things?
Does anyone really need a fast hexstring to binary routine?
Very fast typers perhaps?

Why not post something interessting about AI? You once said that
the main reason that AI was not advanced more, was that ppl wore
"not that bright". Would love to read something you wrote that
conserns AI.....

Codeoptimizations are, as Betov said a hundred times, one of the
ill-arguments that work against the assembly community, because
assembly is mostly viewed to have a purpose for this kind of thing.
(Which must be viewed as a very weak argument I figure)

Whats your thoughts on that?

For the 100th time I like to repeat that all the problems I have comes
from finding solutions to problems, and not with asm itself.

> but it also depends on where you want the result
> ie:
>
> MOV eax,[var2]
> ADD eax,[var1]
>
> is a few micro-cycles faster than with [mem] as result destination.
>
> __
> wolfgang
>
>
>



--
http://www.youtube.com/watch?v=pZ6zzE8JUGY
From: Wolfgang Kern on

Wannabee skrev:
....
>> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
>> and it is faster than its discrete replacement:

>> MOV eax,[var1]
>> MOV ebx,[var2]
>> ADD eax,ebx
>> MOV [var1],eax

>> is much slower than:

>> MOV eax,[var2]
>> ADD [var1],eax

> I mean, single core amd64.
> 8 cycles I read, inclusive, Using your previously posted technic.
> However, there is a diffrence, when changing
> variables such as length of timed code, the weather outside,
> if the window is open, or having taken a leak before timing,
> eg. doing it twice (code just doubled) and the last variant wins by
> the magnificent 1 cycle.
>
> A gigantic win :D

And now try it again with the two variables (let's say 4KB) apart
from each other ...

> Lets say the 8 cycles are overhead?
> :D
> (K7 reads 0F cycles - for both, inclusive)

Sure, but only if the two vars share one cache line.
my K7 shows only 8 cycles for the first and 4 cycles for the shorter,
I haven't compared it on AMD64 yet, but it should be equal here.

> Why do you give advice on codeoptimization,

Because of the question in the topic ? :)

> when it is appear
> to me so utterly useless in the real life? If your strategy is not
> 100% correct at the start of coding, you will have to repeat you rigorous
> coding when you rewrite, and you have to TIME EACH CHANGE.

No, I mainly use the info from the manuals beside experience when
I modify or write my code and compare the speed of a whole
function with what I previously noted for it after finished.

> Given the errors I see I make each time I do that, and then I need to
> re-verify my assumtions at least tree times,
> and even then I am not sure I got everything right.....

I think learning ASM instruction together with timing issues
helps a lot on the later work.

> Would it not be more fruitful to post more on the strategy of things?

My stategy is 'small/fast/smart', so often just a compromise.

> Does anyone really need a fast hexstring to binary routine?
> Very fast typers perhaps?

Application code contain many small code parts and 'a few wasted cycles'
here and there may not look 'that' relevant.
But the effect is of multiplying nature ...

> Why not post something interessting about AI? You once said that
> the main reason that AI was not advanced more, was that ppl wore
> "not that bright". Would love to read something you wrote that
> conserns AI.....

I just played around with several ideas, but there is no AI-project
on my table yet.
So even some of my OS-features may look like AI, this are all just
automated configuration adjustments on track keeping of users typing
speed or count how often he hit BS,DEL in a text-session and respond
with a funny message if this exceeds his average count per page.

> Codeoptimizations are, as Betov said a hundred times, one of the
> ill-arguments that work against the assembly community, because
> assembly is mostly viewed to have a purpose for this kind of thing.
> (Which must be viewed as a very weak argument I figure)

> Whats your thoughts on that?

As above, the multiplying effect ...
An ASM-programmer who is aware of timing and instruction size
will always write fast and short code.

> For the 100th time I like to repeat that all the problems I have comes
> from finding solutions to problems, and not with asm itself.

IF problem CAUSE problem ITERATE IF ??

if you can't 'find' a solution then create one ;)
__
wolfgang



From: //o//annabee on
P� Thu, 10 Jan 2008 15:14:59 +0100, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> Wannabee skrev:
> ...
>>> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
>>> and it is faster than its discrete replacement:
>
>>> MOV eax,[var1]
>>> MOV ebx,[var2]
>>> ADD eax,ebx
>>> MOV [var1],eax
>
>>> is much slower than:
>
>>> MOV eax,[var2]
>>> ADD [var1],eax
>
>> I mean, single core amd64.
>> 8 cycles I read, inclusive, Using your previously posted technic.
>> However, there is a diffrence, when changing
>> variables such as length of timed code, the weather outside,
>> if the window is open, or having taken a leak before timing,
>> eg. doing it twice (code just doubled) and the last variant wins by
>> the magnificent 1 cycle.
>>
>> A gigantic win :D
>
> And now try it again with the two variables (let's say 4KB) apart
> from each other ...

Ok. But this is either

a) a pagefault,
b) a cachemiss
and doesnt relate to instruction sequence or the CPU?
A pagefault I have messured is in the 2000 cycle range.
A cachemiss, maybe as much as 100 or less cycles?

>> Lets say the 8 cycles are overhead?
>> :D
>> (K7 reads 0F cycles - for both, inclusive)
>
> Sure, but only if the two vars share one cache line.
> my K7 shows only 8 cycles for the first and 4 cycles for the shorter,
> I haven't compared it on AMD64 yet, but it should be equal here.

>> Why do you give advice on codeoptimization,
>
> Because of the question in the topic ? :)
>
>> when it is appear
>> to me so utterly useless in the real life? If your strategy is not
>> 100% correct at the start of coding, you will have to repeat you
>> rigorous
>> coding when you rewrite, and you have to TIME EACH CHANGE.
>
> No, I mainly use the info from the manuals beside experience when
> I modify or write my code and compare the speed of a whole
> function with what I previously noted for it after finished.

ok. I was more or less convinced you allways wrote 100% optimized code,
that you fitted each instruction to the entire sequence.
btw, what is it with you and the stack? It is sometimes helpful you know
for temporary data. Did you once have a terrible experience with it? :D
Its allmost as you treat the stack as if some sort of evil monster.

>> Given the errors I see I make each time I do that, and then I need to
>> re-verify my assumtions at least tree times,
>> and even then I am not sure I got everything right.....
>
> I think learning ASM instruction together with timing issues
> helps a lot on the later work.

>> Would it not be more fruitful to post more on the strategy of things?
>
> My stategy is 'small/fast/smart', so often just a compromise.

>> Does anyone really need a fast hexstring to binary routine?
>> Very fast typers perhaps?
>
> Application code contain many small code parts and 'a few wasted cycles'
> here and there may not look 'that' relevant.
> But the effect is of multiplying nature ...

Agree. When things gets generalized, as rules. And then forgotten.
But I must say that with so few registers, and two allways occupied,
the stack is rather handy for temporal data. It also sometimes makes the
code
less bloated. (eg, more clean looking, i belive, espesially if it is part
of a long sequence
of a complicated stateful monster)

>> Why not post something interessting about AI? You once said that
>> the main reason that AI was not advanced more, was that ppl wore
>> "not that bright". Would love to read something you wrote that
>> conserns AI.....
>
> I just played around with several ideas, but there is no AI-project
> on my table yet.
> So even some of my OS-features may look like AI, this are all just
> automated configuration adjustments on track keeping of users typing
> speed or count how often he hit BS,DEL in a text-session and respond
> with a funny message if this exceeds his average count per page.

:) Hex version of that Word feature Beth used to talk about?

>> Codeoptimizations are, as Betov said a hundred times, one of the
>> ill-arguments that work against the assembly community, because
>> assembly is mostly viewed to have a purpose for this kind of thing.
>> (Which must be viewed as a very weak argument I figure)
>
>> Whats your thoughts on that?
>
> As above, the multiplying effect ...
> An ASM-programmer who is aware of timing and instruction size
> will always write fast and short code.

nearly all my cycles are taken by drawing.
even writing a single char to a graphic screen cost more then counting
the entire string.

>> For the 100th time I like to repeat that all the problems I have comes
>> from finding solutions to problems, and not with asm itself.
>
> IF problem CAUSE problem ITERATE IF ??
>
> if you can't 'find' a solution then create one ;)

:) I mean finding out / discovering solutions is where _my_ cycles go.
You cannot create a solution until the problem is propperly mapped.
So finding it out, is a needed part of the "creating", imo.

> __
> wolfgang
>
>
>

From: //o//annabee on
P� Thu, 10 Jan 2008 15:14:59 +0100, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> Wannabee skrev:
> ...
>>> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
>>> and it is faster than its discrete replacement:
>
>>> MOV eax,[var1]
>>> MOV ebx,[var2]
>>> ADD eax,ebx
>>> MOV [var1],eax
>
>>> is much slower than:
>
>>> MOV eax,[var2]
>>> ADD [var1],eax
>
>> I mean, single core amd64.
>> 8 cycles I read, inclusive, Using your previously posted technic.
>> However, there is a diffrence, when changing
>> variables such as length of timed code, the weather outside,
>> if the window is open, or having taken a leak before timing,
>> eg. doing it twice (code just doubled) and the last variant wins by
>> the magnificent 1 cycle.
>>
>> A gigantic win :D
>
> And now try it again with the two variables (let's say 4KB) apart
> from each other ...

I just did.
They both of course suffer the exact same pagefault.



> wolfgang
>
>
>

From: Wolfgang Kern on

"//\\o//\\annabee" <w(a)www.akow> schrieb im Newsbeitrag
news:op.t4qnyvspwzh472(a)cyh1axtn1428g42...
> P� Thu, 10 Jan 2008 15:14:59 +0100, skrev Wolfgang Kern
<nowhere(a)never.at>:
>
> >
> > Wannabee skrev:
> > ...
> >>> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
> >>> and it is faster than its discrete replacement:
> >
> >>> MOV eax,[var1]
> >>> MOV ebx,[var2]
> >>> ADD eax,ebx
> >>> MOV [var1],eax
> >
> >>> is much slower than:
> >
> >>> MOV eax,[var2]
> >>> ADD [var1],eax
> >
> >> I mean, single core amd64.
> >> 8 cycles I read, inclusive, Using your previously posted technic.
> >> However, there is a diffrence, when changing
> >> variables such as length of timed code, the weather outside,
> >> if the window is open, or having taken a leak before timing,
> >> eg. doing it twice (code just doubled) and the last variant wins by
> >> the magnificent 1 cycle.
> >>
> >> A gigantic win :D
> >
> > And now try it again with the two variables (let's say 4KB) apart
> > from each other ...
>
> Ok. But this is either
>
> a) a pagefault,
> b) a cachemiss
> and doesnt relate to instruction sequence or the CPU?
> A pagefault I have messured is in the 2000 cycle range.
> A cachemiss, maybe as much as 100 or less cycles?

Yes. One cache one penalty for sure
(~35 on K7 500/33)
(~60 on AMD64 2000/100)

Windoze seem to occupy all the cache anytime, so we are lucky if we
got one free line for our tests.

[why try to write optimised...]
>> Application code contain many small code parts and 'a few wasted cycles'
>> here and there may not look 'that' relevant.
>> But the effect is of multiplying nature ...

> Agree. When things gets generalized, as rules. And then forgotten.
> But I must say that with so few registers, and two allways occupied,
> the stack is rather handy for temporal data. It also sometimes makes the
> code less bloated.
> (eg, more clean looking, i belive, espesially if it is part
> of a long sequence of a complicated stateful monster)

>>> Why not post something interessting about AI? You once said that
>>> the main reason that AI was not advanced more, was that ppl wore
>>> "not that bright". Would love to read something you wrote that
>>> conserns AI.....

>> I just played around with several ideas, but there is no AI-project
>> on my table yet.
>> So even some of my OS-features may look like AI, this are all just
>> automated configuration adjustments on track keeping of users typing
>> speed or count how often he hit BS,DEL in a text-session and respond
>> with a funny message if this exceeds his average count per page.

> :) Hex version of that Word feature Beth used to talk about?

Oh yeah, Beth posted many good ideas within her novels :)

....
>> An ASM-programmer who is aware of timing and instruction size
>> will always write fast and short code.

> nearly all my cycles are taken by drawing.
> even writing a single char to a graphic screen cost more then counting
> the entire string.

I need to compare my code with windoze one more time.
My screen routines write direct unbuffered to the VRAM and the
last upgrade on text display show an average of 33 cycles per dot,
but it still works on single characters and I think to improve
this and work on whole strings, so it may end up below 30.

>>> For the 100th time I like to repeat that all the problems I have comes
>>> from finding solutions to problems, and not with asm itself.
>> IF problem CAUSE problem ITERATE IF ??
>> if you can't 'find' a solution then create one ;)

> :) I mean finding out / discovering solutions is where _my_ cycles go.
> You cannot create a solution until the problem is propperly mapped.
> So finding it out, is a needed part of the "creating", imo.

:) boot an old DOS6.00 and run your code under test there ?
the problem with timing in windoze is just a windoze-problem ...
we measure cache penalties and page faults, and our code could perform
that fast, that we don't even see any difference.
__
wolfgang



First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: A little ASM 6809 program
Next: what is rsrc.rc?