From: //o//annabee on
P� Mon, 14 Jan 2008 15:01:52 +0100, skrev Wolfgang Kern <nowhere(a)never.at>:

>
> Wannabee skrev:
>
> [about time]

:) I still keep making theese "brainos" (n1 frank)

> < http://szmyggenpv.com/downloads/ >
>
> if the peekmessage.dispatch-using bitblit is the one you mean here,
> then I wont be surprised that it is slow ...
> Call the API for every single dot ?

It ANTI ANTI Aliasing. Just 1/4 of a dot every run.

> I get a numeric figure of 970 (+/-2) on this test.

then you vesa isnt fast at all. But your gc and windows is.

> btw: I needed to an three finger salut to end it.

ALT+F4

>>>> Where is yours?
>>> To direct access a flat VideoRAM it needs a 32-bit OS which allow
>>> to write to this memory range (best without paging issues).
>
>> ok. But I still dont understand why you cannot just extract that code
>> insert it in the dos file, go to 32 bit flat mode and just do the blts
>> and write the numbers. After 26 years building an OS I imagine you
>> could do that inside of 10 minutes?
>
> It is possible during a DOS600 session, but rare within a windoze DOSbox.
> First it needs to scan the VESA-BIOS for supported modes and create
> a list with mode specific data and capabilities.
> Then it must have:
> full PM32 support with forward/backward links
> any kind of memory manager which allow access to PCI-ranges,
> which also asks for a PCI device detector ...
> So what you ask me here would be a tiny OS on top of DOS.

For next "x-mas" then. :))

>> I am pretty sure I could do that in a couple of hours, or a day,
>> if I had the info.
>> (even I never did any dos programming)
>
> You can try ... ;)

Yes without sorted info it would have been.

>> then how can I verify your findings?
>> The diffrence of my app is between a AMD64 and a 1500mhz Athlon XP
>> is 4900 copies per second, to just 460+ per second.
>
>> using the OS BitBlt which I have considered fast, and which I have few
>> alternatives to unless using hardware acceleration.
>
>> Yours run on a much slower computer, but achives 1/5 of the AMD64
>> running
>> at >2 gigahz
>
>> I can hardly belive it. Your code is 6+ times faster, then the atlon xp
>
>>> ok, the 32-bit colour part looks like:
>
>>> usage:
>>> MOV eax,00001019h |INT 7F ;set VESAmode to 1024*768,32
>>> ecx= 0100 Ysize
>>> ebx= 0100 Xsize
>>> edx= 0 X+Yposition (Y in hw)
>>> eax= 0 colour mask
>>> esi= source ;btw: KESYS.bitmaps aren't stored upside down!
>>> AND [vflag],0f0 ;clear all options
>>> CALL draw_bmp
>>> MOV eax 00001009h |INT 7F ;set VESAmode to 1024*768,8 again
>>> _________
>>> draw_bmp:
>>> OR edi,ebx
>>> OR edi,ecx
>
>> what the heck is this? (above)
>
> Just initialising regs and set Vmode,
> or if you mean the two ORs, they check if both x+y are zero.

(YSize/XSize) i guess.

>>> |JZ ret ;just in case
>>> PUSH ebx
>>> PUSH edx ;[esp]=Xpos [esp+2]=Ypos
>
>>> ;clip_it:
>>> MOVZX eax,w[esp+2] ;eax= Ypos
>
>> Stack abuse? :D
> Yes, classical 'LOCALs' in here ;)
> I could replace it by MOV eax,edx |SHR eax,010
> but the value is needed lateron too.

>>> MOV edx,0300 ;max lines (altered by Vmode)
>>> ADD eax,ebx
>>> CMP eax,edx |Jc L1>
>>> SUB eax,ebx |MOV ebx,edx |SUB ebx,eax |JS L9>
>>> L1:
>>> MOVZX eax,w[esp] ;Xpos
>>> MOV edx,01000 ;scan line size (altered by Vmode)
>
>> the Vmode change recode this one? SMC
>
> Yes, this immediate constant values were altered on Vmode changes.

Why?

>
>>> ADD eax,ecx
>>> CMP eax,edx |Jc L2>
>>> SUB eax,ecx |MOV ecx,edx |SUB ecx,eax |JS L9>
>>> L2:
>>> MOVZX eax,w[esp+2]
>>> IMUL eax,edx ;y*line size
>>> LEA edi,[eax+screen_start] ;from VESA-info,(altered by Vmode)
>
>> nice.
>
>>> MOVZX eax w[esp] ;+x for 8-bit, +4*x for 32bit
>>> TEST[Vflag]40h ;indicates 8/32 bit colours
>>> JZ draw8 ;not shown yet
>>> TEST[Vmode]04h indicates colour mask active
>>> JNZ draw_32_eax ;not shown yet
>>> LEA edi,[edi+eax*4];
>
>>> ;draw it:
>>> PUSH ecx
>>> L3:MOV eax,edi ;keep the line start
>>> REP MOVSD
>>> ADD eax,edx ;add scan line size
>>> DEC ebx |MOV ecx[esp]
>>> MOV edi,eax |JNZ L3<
>>> POP ecx
>>> L9: POP edx |POP ebx
>>> ret: RET
>>> ___________
>>> You see it's not optimised at all,
>
>> ? :D Looks very nice to me. short and excellent code I gather.
>
>>> I could try to improve the loop
>>> with MOVD/MOVNTQ or SSE 128-bit moves, even then any unaligned parts
>>> may destroy the gain.
>
>>>> If you want your OS out of the picture. why dont you just write
>>>> it as a dos image instead?
>
>>> It wont work in plain DOS because it must use 32-bit code to
>>> access a flat VRAM (usually above 2GB).
>>> EMM and XMS wont do well here, because IRQs become disabled for too
>>> long and may lock up some hardware then.
>
>> But shouldnt be all that hard still? To run a com, break the barried by
>> your own code? Or am I speaking of ignorance here?
>> I cant figure it could be much of a job for you?
>
> As said above, it would need to write a tiny OS on top of DOS,
> I've planned to release a new DOS6 based DEMO soon anyway

>>>> btw, I still have the copy of you demo. Will it run on that?
>
>>> I think this DEMO was a version.000 or 001, so it wont contain
>>> the bitmap draw nor any 32-bit colour support.
>
>> ok. I like your code, but would very much like to see it running
>> with printed numbers (fps). (as fast as it can run) Since we cant do
>> that
>> I would just have to trust you ... (I am not really hardwired for that)
>> :)
>
>> So you pushing 1/5 of a AMD64 400mhz fsb performance on a 500mhz
>> antique AMD?
>> And 6 times that of a 266mhz fsb athlon?
>> hmmm.....(teeth gnizzeling sounds) ..... Get out of here!
>
> I don't know how to interprete your '970' message.
> My estimation was about to be three times faster than windoze.

This is 970 builts of the bitmap, each second.
What resolution?

> What have you expected when you compare a HLL-driven peek&pokeOS solution
> with one written in machine code running in 'un'-protected mode without
> paging ?

I had expected that the result would equal out.
Given what you just said, ,maybe they do, if you use the same resolution
that I do (1024x768x32) for the K7. I even now think that windows may be
faster then your VESA.
or very close. For all I know, it uses vesa, why not.

But thats hard to tell when you arnt able to provide code.


>> Rewrite it to a dos image, that set up the flat mode, and vesa,
>> and runs the app.
>> I know you can do that easily. And I promise you, if you do that i read
>> the code in hex.
>
> Again as above, perhaps I do it one day.

You could even do it in windows.

>> And I also will then restart the testing of the demo, if you want to.
>> (I now have enough hardware for dedicating a machine to testing).
>
> This olde Demo is almost obsolete yet, I'd wait for the new version.
>
>> You want to prove a point, you have the means, (easily) so whats
>> stopping you?
>
> I don't need to sell my solution in this NG, and for me it's enough
> to know that my code performs much faster than winoze/L'unix/or else.

Until you post the resolution you ran my app in, thats unknown.

At this level with code like yours, the code plays a lesser part, and the
hardware is the only limiting factor. This is not where asm has
advantages. if you times are correct then the code only counts for 3.1% of
the speed.

__
> wolfgang
From: Wolfgang Kern on

Wannabee skrev:

[...about timing]

>> if the peekmessage.dispatch-using bitblit is the one you mean here,
>> then I wont be surprised that it is slow ...
>> Call the API for every single dot ?

> It ANTI ANTI Aliasing. Just 1/4 of a dot every run.

mmh? then it is actually 4 times faster then it looks.

>> I get a numeric figure of 970 (+/-2) on this test.
> then you vesa isnt fast at all. But your gc and windows is.

You reported 460 fps on K7 ?


>> btw: I needed to an three finger salut to end it.
> ALT+F4
I see, and rare use this windoze shortcut keys ;)


.....
>> So what you ask me here would be a tiny OS on top of DOS.
> For next "x-mas" then. :))

Perhaps I do it earlier.

....
>>> the Vmode change recode this one? SMC
>> Yes, this immediate constant values were altered on Vmode changes.
> Why?

It saves me from having several variants of similar code
beside that 'mov reg,imm' is faster than 'mov reg,[mem]'

....
>>>> ;draw it:
>>>> PUSH ecx
>>>> L3:MOV eax,edi ;1 ;keep the line start
>>>> REP MOVSD ;15+ecx*4/3 = 357
>>>> ADD eax,edx ;1 ;add scan line size
>>>> DEC ebx ;1
>>>> MOV ecx,[esp] ;3
>>>> MOV edi,eax ;1
>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440
>>>> POP ecx

All the main time is consumed by the above loop and
you can calculate the theoretical timing and compare it to the
actual reported timing to figure out how slow the hardware is.

Let's say this code show only ~100000 CPU cycles
now we add all the cache penalties [~35/64 byte on 500/66MHz K7]
(only for RD here, I have VRAM non-cachable on my PCs)
and this 256Kb are 4096 cache lines, so we add 143360 and get
about 250000 CPU cycles for the whole story.

So why RDTSC reported ~2'000'000 instead of ~250'000, this is
about eight times more ?
The answer might be that the BUS-speed is limited to 66 MHz
while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8.

I can try to improve my code and/or by a top-speed graphic card,
but it wont ever be faster when the copy comes from a slow bus.

....
> This is 970 builts of the bitmap, each second.
> What resolution?

1024*768,32

>> What have you expected when you compare a HLL-driven peek&pokeOS solution
>> with one written in machine code running in 'un'-protected mode without
>> paging ?

> I had expected that the result would equal out.
> Given what you just said, ,maybe they do, if you use the same
> resolution that I do (1024x768x32) for the K7.

This ~33 cycles per dot were measured on 1024*768,32
and I posted only this code part.

> I even now think that windows may be faster then your VESA.

Probably, this card can 200MHz on AGP, but wont work faster
than the given hardware environment.

> or very close. For all I know, it uses vesa, why not.

> But thats hard to tell when you arnt able to provide code.

Pushing on my ego? eh?

> You could even do it in windows.

Sure, if you can tell me how to get direct access to the VRAM.
I actually tried:
mov esi,0e000_0000 ;(screen address found in device mgr)
mov D$esi 0 ;access violation !!

>> I don't need to sell my solution in this NG, and for me it's enough
>> to know that my code performs much faster than winoze/L'unix/or else.

> Until you post the resolution you ran my app in, thats unknown.

Ok you cannot confirm it with your own eyes, but I posted it :)

> At this level with code like yours, the code plays a lesser part, and the
> hardware is the only limiting factor. This is not where asm has
> advantages. if you times are correct then the code only counts for
> 3.1% of the speed.

Yes, I'm on the HW-limit on this machine already.
So I need to check this on the 2GHz PCs soon.
__
wolfgang



From: Robert Redelmeier on
Wolfgang Kern <nowhere(a)never.at> wrote in part:
>>>>> PUSH ecx
>>>>> L3:MOV eax,edi ;1 ;keep the line start
>>>>> REP MOVSD ;15+ecx*4/3 = 357
>>>>> ADD eax,edx ;1 ;add scan line size
>>>>> DEC ebx ;1
>>>>> MOV ecx,[esp] ;3
>>>>> MOV edi,eax ;1
>>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440
>>>>> POP ecx
>
>
> So why RDTSC reported ~2'000'000 instead of ~250'000, this is
> about eight times more ?
> The answer might be that the BUS-speed is limited to 66 MHz
> while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8.

Perhaps, but if ESI is pointing at VRAM,
you are asking REP MOVSD to _read_ it.

Reading VRAM is _KNOWN_ to be extremely slow. Writing
VRAM is bad enough, but reading may wait for retrace.

Proper video drivers issue scrolling commands to the
GPU which crunches everything in an optimized manner.
And image loads from DRAM are done by the GPU busmastering.
Again, optimized with other graphics chores like rendering.
Unfortuneately, I don't think the docs are easily available.

-- Robert











From: //o//annabee on
P� Tue, 15 Jan 2008 20:37:26 +0100, skrev Robert Redelmeier
<redelm(a)ev1.net.invalid>:

> Wolfgang Kern <nowhere(a)never.at> wrote in part:
>>>>>> PUSH ecx
>>>>>> L3:MOV eax,edi ;1 ;keep the line start
>>>>>> REP MOVSD ;15+ecx*4/3 = 357
>>>>>> ADD eax,edx ;1 ;add scan line size
>>>>>> DEC ebx ;1
>>>>>> MOV ecx,[esp] ;3
>>>>>> MOV edi,eax ;1
>>>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440
>>>>>> POP ecx
>>
>>
>> So why RDTSC reported ~2'000'000 instead of ~250'000, this is
>> about eight times more ?
>> The answer might be that the BUS-speed is limited to 66 MHz
>> while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8.
>
> Perhaps, but if ESI is pointing at VRAM,
> you are asking REP MOVSD to _read_ it.

No, its pointing at "Kesys bitmap" Must assume thats ram.

"
>> eax= 0 colour mask
>> esi= source ;btw: KESYS.bitmaps aren't stored upside down!
>> AND [vflag],0f0 ;clear all options
"


> Reading VRAM is _KNOWN_ to be extremely slow. Writing
> VRAM is bad enough, but reading may wait for retrace.

>
> -- Robert
>
>
>
>
>
>
>
>
>
>
>

From: //o//annabee on
P� Tue, 15 Jan 2008 21:02:12 +0100, skrev //\\o//\\annabee <w(a)www.akow>:

> P� Tue, 15 Jan 2008 20:37:26 +0100, skrev Robert Redelmeier
> <redelm(a)ev1.net.invalid>:
>
>> Wolfgang Kern <nowhere(a)never.at> wrote in part:
>>>>>>> PUSH ecx
>>>>>>> L3:MOV eax,edi ;1 ;keep the line start
>>>>>>> REP MOVSD ;15+ecx*4/3 = 357
>>>>>>> ADD eax,edx ;1 ;add scan line size
>>>>>>> DEC ebx ;1
>>>>>>> MOV ecx,[esp] ;3
>>>>>>> MOV edi,eax ;1
>>>>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440
>>>>>>> POP ecx
>>>
>>>
>>> So why RDTSC reported ~2'000'000 instead of ~250'000, this is
>>> about eight times more ?
>>> The answer might be that the BUS-speed is limited to 66 MHz
>>> while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8.
>>
>> Perhaps, but if ESI is pointing at VRAM,
>> you are asking REP MOVSD to _read_ it.
>
> No, its pointing at "Kesys bitmap" Must assume thats ram.

Woops. Could it be a bitmap stored onship?

>
> "
>>> eax= 0 colour mask
>>> esi= source ;btw: KESYS.bitmaps aren't stored upside down!
>>> AND [vflag],0f0 ;clear all options
> "
>
>
>> Reading VRAM is _KNOWN_ to be extremely slow. Writing
>> VRAM is bad enough, but reading may wait for retrace.
>
>>
>> -- Robert
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: A little ASM 6809 program
Next: what is rsrc.rc?