From: //o//annabee on
P� Thu, 10 Jan 2008 21:10:40 +0100, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> "//\\o//\\annabee" <w(a)www.akow> schrieb im Newsbeitrag
> news:op.t4qnyvspwzh472(a)cyh1axtn1428g42...
>> P� Thu, 10 Jan 2008 15:14:59 +0100, skrev Wolfgang Kern
> <nowhere(a)never.at>:
>>
>> >
>> > Wannabee skrev:
>> > ...
>> >>> ADD [mem],... is a good example of a READ-MODIFY-WRITE sequence,
>> >>> and it is faster than its discrete replacement:
>> >
>> >>> MOV eax,[var1]
>> >>> MOV ebx,[var2]
>> >>> ADD eax,ebx
>> >>> MOV [var1],eax
>> >
>> >>> is much slower than:
>> >
>> >>> MOV eax,[var2]
>> >>> ADD [var1],eax
>> >
>> >> I mean, single core amd64.
>> >> 8 cycles I read, inclusive, Using your previously posted technic.
>> >> However, there is a diffrence, when changing
>> >> variables such as length of timed code, the weather outside,
>> >> if the window is open, or having taken a leak before timing,
>> >> eg. doing it twice (code just doubled) and the last variant wins by
>> >> the magnificent 1 cycle.
>> >>
>> >> A gigantic win :D
>> >
>> > And now try it again with the two variables (let's say 4KB) apart
>> > from each other ...
>>
>> Ok. But this is either
>>
>> a) a pagefault,
>> b) a cachemiss
>> and doesnt relate to instruction sequence or the CPU?
>> A pagefault I have messured is in the 2000 cycle range.
>> A cachemiss, maybe as much as 100 or less cycles?
>
> Yes. One cache one penalty for sure
> (~35 on K7 500/33)
> (~60 on AMD64 2000/100)
>
> Windoze seem to occupy all the cache anytime, so we are lucky if we
> got one free line for our tests.

No. You get a whole timeslice, which can be very long if you run in
realtime mode. The reason why windows seems so slow, is the constant
paging of
memoryhungry programs, plus (I guess) I/O sheduling. Window is simply slow
because hardrives, and I/O are slow. And because certain windows apps,
does not pay any attention
to avoid this problem. (Bittourment for instance, windows itself, and also
Opera is dog slow in this regard). Windows is often just I/O bound. Given
that you only talk to memory, windows is not at all significantly slow.
The kernel is _not_ faster then user apps. I think we should confirm it.
Write an app, in dos that floats a 256*256 bitmap across a 32 bit
formated, vesa canvas. Then I write one using only GDI. Then we can
compare the framerate. It would also be interessting to see for other
reasons. I never heard of speed comparisons between dos and windows. (Even
I guess they exist).


> [why try to write optimised...]
>>> Application code contain many small code parts and 'a few wasted
>>> cycles'
>>> here and there may not look 'that' relevant.
>>> But the effect is of multiplying nature ...
>
>> Agree. When things gets generalized, as rules. And then forgotten.
>> But I must say that with so few registers, and two allways occupied,
>> the stack is rather handy for temporal data. It also sometimes makes the
>> code less bloated.
>> (eg, more clean looking, i belive, espesially if it is part
>> of a long sequence of a complicated stateful monster)
>
>>>> Why not post something interessting about AI? You once said that
>>>> the main reason that AI was not advanced more, was that ppl wore
>>>> "not that bright". Would love to read something you wrote that
>>>> conserns AI.....
>
>>> I just played around with several ideas, but there is no AI-project
>>> on my table yet.
>>> So even some of my OS-features may look like AI, this are all just
>>> automated configuration adjustments on track keeping of users typing
>>> speed or count how often he hit BS,DEL in a text-session and respond
>>> with a funny message if this exceeds his average count per page.
>
>> :) Hex version of that Word feature Beth used to talk about?
>
> Oh yeah, Beth posted many good ideas within her novels :)

Yup. She was kind of the healtinsurance in this ng.

:))

> ...
>>> An ASM-programmer who is aware of timing and instruction size
>>> will always write fast and short code.
>
>> nearly all my cycles are taken by drawing.
>> even writing a single char to a graphic screen cost more then counting
>> the entire string.
>
> I need to compare my code with windoze one more time.
> My screen routines write direct unbuffered to the VRAM and the
> last upgrade on text display show an average of 33 cycles per dot,
> but it still works on single characters and I think to improve
> this and work on whole strings, so it may end up below 30.

Didnt understand anything after "but".
btw, did you like the youtube link I posted?

(yes in 3rd reading. Yes. Good idea. Write it a whole scanline at the time
will remove a Bunch of cache misses. I guess you are just toying with me
now eh?
).

>
>>>> For the 100th time I like to repeat that all the problems I have comes
>>>> from finding solutions to problems, and not with asm itself.
>>> IF problem CAUSE problem ITERATE IF ??
>>> if you can't 'find' a solution then create one ;)
>
>> :) I mean finding out / discovering solutions is where _my_ cycles go.
>> You cannot create a solution until the problem is propperly mapped.
>> So finding it out, is a needed part of the "creating", imo.
>
> :) boot an old DOS6.00 and run your code under test there ?
> the problem with timing in windoze is just a windoze-problem ...
> we measure cache penalties and page faults, and our code could perform
> that fast, that we don't even see any difference.

Well. I did manage to time your 40 cycles code to 48 cycles.
(and if we remove the overhead from that)?
If you want lets do the bitmap test I noted above, and see what comes out?

> __
> wolfgang
>
>
>



--
From: Frank Kotler on
//\\o//\\annabee wrote:

....
> Write an app, in dos that floats a 256*256 bitmap
> across a 32 bit formated, vesa canvas. Then I write one using only GDI.
> Then we can compare the framerate. It would also be interessting to see
> for other reasons. I never heard of speed comparisons between dos and
> windows. (Even I guess they exist).

What are you calling "dos"? A "dos box", it seems to me, can't
*possibly* be any faster than Windows, since Windows is providing "fake
dos". If we start from "real dos", so we can go into "flat real mode",
and use a Linear Frame Buffer instead of "bank-switched" hi-res mode, we
can go pretty fast. But Windows has access to "hardware acceleration".
No theoretical reason you couldn't do it from dos (AFAIK), but you'd
need a routine to detect what card you had, and routines to implement
hardware acceleration for as many cards as you'd like to support. This
is where Windows has got the jump on us - mfg's provide drivers for
Windows, and don't tell *us* squat... If we *did* do it, dos ought to be
faster, since it's single-user, single-tasking, unpaged... Timer
interrupt every 18th of a second or so, and we can turn that off... I
don't think I've ever seen hardware acceleration from dos, though - no
idea what it would look like.

If hardware acceleration were "outlawed" in the Windows version, I think
dos could be made to go faster, just from the lack of "competetion".

Best,
Frank

From: //o//annabee on
P� Fri, 11 Jan 2008 02:11:40 +0100, skrev Frank Kotler
<fbkotler(a)verizon.net>:

> //\\o//\\annabee wrote:
>
> ...
>> Write an app, in dos that floats a 256*256 bitmap across a 32 bit
>> formated, vesa canvas. Then I write one using only GDI. Then we can
>> compare the framerate. It would also be interessting to see for other
>> reasons. I never heard of speed comparisons between dos and windows.
>> (Even I guess they exist).
>
> What are you calling "dos"? A "dos box", it seems to me, can't
> *possibly* be any faster than Windows, since Windows is providing "fake
> dos". If we start from "real dos", so we can go into "flat real mode",
> and use a Linear Frame Buffer instead of "bank-switched" hi-res mode, we
> can go pretty fast. But Windows has access to "hardware acceleration".
> No theoretical reason you couldn't do it from dos (AFAIK), but you'd
> need a routine to detect what card you had, and routines to implement
> hardware acceleration for as many cards as you'd like to support. This
> is where Windows has got the jump on us - mfg's provide drivers for
> Windows, and don't tell *us* squat... If we *did* do it, dos ought to be
> faster, since it's single-user, single-tasking, unpaged... Timer
> interrupt every 18th of a second or so, and we can turn that off... I
> don't think I've ever seen hardware acceleration from dos, though - no
> idea what it would look like.

Well, I didnt mean "dos", I meant bare metal, eg non windows. (any that is
choosen)
As far as I know, GDI does not support hardware. (It does not start
DirectX in the
background and use that for the GDI builts. (DirectX takes ~half a second
to start the first time).(and GDI builts are too slow that I think it does
it by some spesific shortcut to the hardware acceleration - even I am not
100% sure)

When I checked the line APIS against a DIBsection (backbuffer=memory) they
was optimized pretty damn well since the books I read about the subject. I
was unable to write faster code, until I started playing with the prefetch
directive. And the GDI built is roughly as fast as that. But not as fast
as hardware. (Maybe Herbert knows better on this).

When it comes to multitasking, I just have to repeat that this is not a
problem. Both Windows and Linux uses premptive multitasking which means
that you can PRE empt other
processes. It is pictured as rings of priority, where the tasks are linked
to each other in each ring, and where no lower ring gets a chance to run,
if theres is still work to be done in a higher ring. You can preemt the
system, and steal near all cycles to yourself. I provided a demonstration
here earlier, where a GDI app, can prempt both the mouse and keyboard
messages, and keep going. Not very useful, but possible.

The reason I guess why some ppl dont think that it works that way, is that
windows is a eventdriven os. That means that until your app gets some kind
of message, it is simply a dead piece of memory, doing nothing. But,
having a timeslice, and been given a message,
(its not enough to just have a timeslice) - you can set your app at a
realtime priority, and call peekmessage instead of getmessage. Peekmessage
is not blocking, and so you can continue running the app, after it
returns, and with the full priority, you will receive all the timeslices
for the ring. And with that you can totally preempt the system, leaving
your app, the only survivor. Even the system reads and takes cares of
priority interrupts, it will not be able to send you any messages. But you
will have the slices, and you app will "work". Basically at this point you
can start DirectInput, and read the keyboardbuffer directly, to see if
there are keys beeing pressed...(not tested) etc.

(This is some guesswork, and some observation and some of the material
from the SDK).


> If hardware acceleration were "outlawed" in the Windows version, I think
> dos could be made to go faster, just from the lack of "competetion".

Thats what I like to find out.
It could be interessting to see the diffrence, if any.

>
> Best,
> Frank
>


From: Robert Redelmeier on
Frank Kotler <fbkotler(a)verizon.net> wrote in part:
> What are you calling "dos"? A "dos box", it seems to me,
> can't *possibly* be any faster than Windows, since Windows
> is providing "fake dos".

Not on any sort of consistant running basis. But what if your
MS-Win32 app measurements include startup CoW pagefaults that the
MS-Win*-dosbox incurs on start-up by zeroing 640kB? You cannot
measure startup time because your app doesn't have control.

-- Robert

From: Wolfgang Kern on

Wannabee skrev:
....
>> Windoze seem to occupy all the cache anytime, so we are lucky if we
>> got one free line for our tests.

> No. You get a whole timeslice, which can be very long if you run in
> realtime mode. The reason why windows seems so slow, is the constant
> paging of memoryhungry programs, plus (I guess) I/O sheduling.
> Window is simply slow because hardrives, and I/O are slow.

Haven't KESYS and L'unix to deal with the same hardware as windoze ?

> And because certain windows apps, does not pay any attention
> to avoid this problem. (Bittourment for instance, windows itself, and also
> Opera is dog slow in this regard). Windows is often just I/O bound.
> Given that you only talk to memory, windows is not at all significantly
> slow.
> The kernel is _not_ faster then user apps. I think we should confirm it.

The problem with an 'event-driven' OS is that an IRQ may direct trigger
a huge program part assigned to it and blocking other events and threads
(well known as the hour-glass and the frozen mice) for some time.

> Write an app, in dos that floats a 256*256 bitmap across a 32 bit
> formated, vesa canvas. Then I write one using only GDI. Then we can
> compare the framerate. It would also be interessting to see for other
> reasons.

Ye olde DOS is awful slow on this, because it needs to detour with
INT10h for every dot or has to use the little 32 KB frame at A0000
and switch the pages of this all the time, ...

> I never heard of speed comparisons between dos and windows. (Even
> I guess they exist).

...but with any 32-bit DOS-extension like KEMM (still found in DEMOs)
and flat framed VRAM, the screen memory can be written as fast as
the Bus allow (beside AGP- and HW-specific accelleration yet).

....
>>> nearly all my cycles are taken by drawing.
>>> even writing a single char to a graphic screen cost more then
>>> counting the entire string.

A solid graphic character needs ie: 16*8 = 128 dots to draw,
and foreground background colours are the obstacles here ;)
So it is faster to 'fill' the background rectangle with lines, if
required at all, and output the characters in transparent mode
(at least faster on KESYS).

>> I need to compare my code with windoze one more time.
>> My screen routines write direct unbuffered to the VRAM and the
>> last upgrade on text display show an average of 33 cycles per dot,
>> but it still works on single characters and I think to improve
>> this and work on whole strings, so it may end up below 30.

> Didnt understand anything after "but".
> btw, did you like the youtube link I posted?

> (yes in 3rd reading. Yes. Good idea. Write it a whole scanline at the time
> will remove a Bunch of cache misses. I guess you are just toying with me
> now eh? ).

Not yet :)
I like to save on the call/ret pair for every character and loop on
string size in the core instead, even this slow single characters then.

....
>> :) boot an old DOS6.00 and run your code under test there ?
>> the problem with timing in windoze is just a windoze-problem ...
>> we measure cache penalties and page faults, and our code could perform
>> that fast, that we don't even see any difference.

> Well. I did manage to time your 40 cycles code to 48 cycles.
> (and if we remove the overhead from that)?

which one do you have in mind here, the 32 bit ASCIIh2bin ?
yes, the BCD(BCH) packing could be improved here.

> If you want lets do the bitmap test I noted above,
> and see what comes out?

Fine, even I don't have Vendor specific accellerating DirectX-drivers
for it, I'll time a bitmap of this size using the DEMO-KESYS under DOS,
but I need to create a 256KB picture in 32-bit format first,
because KESYS isn't a game console it's standard is 8-bit colour.
I assume we both use 1024*768,32 (@100Hz, if this is relevant at all).

__
wolfgang



First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: A little ASM 6809 program
Next: what is rsrc.rc?