|
Prev: A little ASM 6809 program
Next: what is rsrc.rc?
From: Wolfgang Kern on 16 Jan 2008 06:26 Wannabee 'in discussion': :) ..... >> Woops. Could it be a bitmap stored onship? > no. Yes it could, I have my mouse-images, the mouse-backgound icons and the zoom stuff in the inactive VRAM trail. __ wolfgang
From: Wolfgang Kern on 16 Jan 2008 06:48 Dirk Wolfgang Glomp posted: >> you'll need to scan the VESA info and find/set a linear mode, .... Yes, this code creates the required list and find the vesa-mode number for the desired resolution if it's supported. I do this similar before entering PM, except that I compare against my 'supported modes list' and keep only matching modes in the final. __ wolfgang > push ds > pop es > mov di, OFFSET VINF ; get vesa-info(buffer to store vesa info) > mov ax, 4F00h ; es:di 256 byte > int 10h > cmp ax, 4Fh > jnz ERROR > cmp BYTE PTR[di+5], 2 ; major version number of vesa <2? > jb ERROR > lfs si, [di+0Eh] ; copy all vesa modi(pointer of modelist) > mov bx, OFFSET MODI ; buffer to store vesa modi > P1: mov ax, fs:[si] ; get the mode number > add si, 2 > mov [bx], ax ; store the mode number > add bx, 2 > cmp ax, 0FFFFh ; end of modelist? > jnz P1 > mov si, OFFSET MODI ; scan for specific mode(with 32bit) > P2: mov cx, [si] ; get mode > add si, +2 > cmp cx, 0FFFFh ; end of modelist? > jz ERROR > mov ax, 4F01h ; get mode-Info > add cx, 4000h ; mode + linear acess > mov bp, cx ; store mode number > int 10h ; es:di 256 byte > cmp ax, 4Fh > jnz ERROR > cmp WORD PTR[di+12h], MaxX > jnz P2 > cmp WORD PTR[di+14h], MaxY > jnz P2 > cmp BYTE PTR[di+19h], 20h ; 32 Bit per pixel? > jnz P2 > and BYTE PTR[di], 80h ; linear access enable? > jz ERROR > cmp DWORD PTR[di+28h], 0 ; linear offset exist? > jz ERROR > ... > ------------------------------------- > ATI 9800 PRO VESA Modi(128MB) > ------------------------------------- > 4 8 15 16 24 32 Matrix > ------------------------------------- > 109 132 x 25 > 10A 132 x 43 > 130 132 x 44 > 182 10D 10E 10F 120 320 x 200 > 192 193 194 195 196 320 x 240 > 1A2 1A3 1A4 1A5 1A6 400 x 300 > 1B2 1B3 1B4 1B5 1B6 512 x 384 > 1C2 1C3 1C4 1C5 1C6 640 x 350 > 100 183 184 185 186 640 x 400 > 101 110 111 112 121 640 x 480 > 102 103 113 114 115 122 800 x 600 > 104 105 106 107 108 123 1024 x 768 > 107 119 11A 11B 124 1280 x 1024 > ------------------------------------- > NVIDIA GF 4 Ti4200 VESA Modi(64MB) > ------------------------------------- > 4 8 15 16 24 32 Matrix > ------------------------------------- > 108 80 x 60 > 109 132 x 25 > 10A 132 x 43 > 10B 132 x 50 > 10C 132 x 60 > 130 10E 10F 320 x 200 > 134 135 136 320 x 240 > 131 132 133 320 x 400 > 100 13D 13E 640 x 400 > 101 111 112 640 x 480 > 102 103 114 115 800 x 600 > 104 105 117 118 1024 x 768 > 106 107 11A 1280 x 1024 > 147 148 1400 x 1050 > 145 146 1600 x 1200 > ------------------------------------- > > Dirk
From: Wolfgang Kern on 16 Jan 2008 08:26 Wannabee skrev: [...about timing] .... >>>> So what you ask me here would be a tiny OS on top of DOS. >>> For next "x-mas" then. :)) >> Perhaps I do it earlier. > I hope so, because I would really like to see it running and > verify the code. What are you searching for? the code is still in this post. [SMC] >> It saves me from having several variants of similar code >> beside that 'mov reg,imm' is faster than 'mov reg,[mem]' > With a bit of work, it may sometimes be a good idea in fact. > (I assume) > I got an idea (not tested yet) for usermode code. > Since I am allowed to rewrite the stack in that case, > maybe there could be oportunities for pairing the data and the stack? > > _________________________ > Data.Stack 32 DWORDS > Data (plenty) > _________________________ > > Prefetch b$Data.Stack > mov D$OldStack esp > mov esp Data.Stack > call DoSomeThingWithData //and make use of the local local stack > mov esp D$OldStack > Do you think this could sometimes help speed up code? > Or is this redundant on new hardware? I don't understand the reason of this. Stack is just memory, even with a better chance to reside cached most of the time because it's frequently used (or even abused) within a certain range. So excessive stack usage will end up with the same delaying cache issues like 'normal' memory. Prefetch may help on data used more often in a following loop, not much more than to suffer on the cache-miss earlier ;) >>>>>> ;draw it: >>>>>> PUSH ecx >>>>>> L3:MOV eax,edi ;1 ;keep the line start >>>>>> REP MOVSD ;15+ecx*4/3 = 357 >>>>>> ADD eax,edx ;1 ;add scan line size >>>>>> DEC ebx ;1 >>>>>> MOV ecx,[esp] ;3 >>>>>> MOV edi,eax ;1 >>>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440 >>>>>> POP ecx >> All the main time is consumed by the above loop and >> you can calculate the theoretical timing and compare it to the >> actual reported timing to figure out how slow the hardware is. > I thought it easier to just compare the DWORD write (0-1-2 cycle?) > to your 32 cycles per dot number. (not sure here...) >> Let's say this code show only ~100000 CPU cycles >> now we add all the cache penalties [~35/64 byte on 500/66MHz K7] >> (only for RD here, I have VRAM non-cachable on my PCs) > What function has that VRAM-cacing? I thought it just for bios variables? Also possible to cache the video-RAM (BIOS-setup option), it helped a bit on RD-modify-WR cycles (like XOR) on old VGA-cards. >> and this 256Kb are 4096 cache lines, so we add 143360 and get >> about 250000 CPU cycles for the whole story. > You are multiplying 35cycles penalty for each 64 byte, with > the number of cache lines? Yes. >> So why RDTSC reported ~2'000'000 instead of ~250'000, this is >> about eight times more ? >> The answer might be that the BUS-speed is limited to 66 MHz >> while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8. > 1000 copies a second is 256*256 dots /s = (65536 dots) * 1000 per sec. > that is 65_536_000 dots / sec. or 65 mega pixels. > For 32 cycles per pixel this gives a total of 2_097_152_000 or two Giga > cycles Which is impossible on a machine like that. Something in your calculator seem to be stuck ...? :) I never said it was counted over 1000 copies. I reported ~2e6 cycles which turns out to be 3..4 times slower than your 'supported' variant. I makes me sick to see it, but I cannot write my own hw-support for every new graphic card (like I had in the past for a few cards, which turned out to be three times faster than DirectX), perhaps you totally misunderstood and that's why insinst for more code ? >> Pushing on my ego? eh? > No. This is the very _beginning_ of comparing notes. > An absolute rule that all coders would know and appreciate. > It the _first_ rule. OK, let's follow another absolute rule right now: * give windoze the same vendor support as I've got ['goanix','nada'] (I have only the information found in the VBIOS aka VESA) and the let's compare again: * set screen mode to 1024*768,32bit-colour * draw any picture or pattern (not just equal dots) * draw it to an unalinged position ie: x=15 y=13 Oh, I see, the three latter above wont work, because windoze defaults to the old VGA-mode: 1024*768,4bit-colour and cannot set any other mode beside VGA 03 (blue screen of death) without a vendors driver or any other third party support. >>> You could even do it in windows. >> Sure, if you can tell me how to get direct access to the VRAM. > Write a driver, get ring 0 access, and take over the OS. If this would be that easy or possible, we would have heard already. __ wolfgang
From: //o//annabee on 16 Jan 2008 11:45 P� Wed, 16 Jan 2008 12:26:07 +0100, skrev Wolfgang Kern <nowhere(a)never.at>: > > Wannabee 'in discussion': > > :) > .... >>> Woops. Could it be a bitmap stored onship? > >> no. > > Yes it could, I have my mouse-images, the mouse-backgound > icons and the zoom stuff in the inactive VRAM trail. Not with that code you dont. > > __ > wolfgang > >
From: //o//annabee on 16 Jan 2008 13:48
P� Wed, 16 Jan 2008 14:26:12 +0100, skrev Wolfgang Kern <nowhere(a)never.at>: > > Wannabee skrev: > > [...about timing] > ... >>>>> So what you ask me here would be a tiny OS on top of DOS. >>>> For next "x-mas" then. :)) >>> Perhaps I do it earlier. >> I hope so, because I would really like to see it running and >> verify the code. > > What are you searching for? the code is still in this post. whats is in this post is all but the code thats needs to be verfied. (nothing novel about this code, has been cut and paste avail from internet since 20 years) > [SMC] Does not explain VESA or 32 bit flat setup. I would have clipped the code and pasted it directly into the post. And explained every part in detail. The full code for 32 bit mode setup, from dos, with detailed explanation. Then the VESA setup, and the code for generating the bitmap and bascially all the needed code (should not need to be that much). Then i would comment it heavy, like you did well with the Bitmap code. Then it would be useful. The code posted is nice but not useful in order to verify the numbers. (not in any sense of the meaning, at all) Then i would add a link to the binary. But I guess words are easier.... >>> It saves me from having several variants of similar code >>> beside that 'mov reg,imm' is faster than 'mov reg,[mem]' > >> With a bit of work, it may sometimes be a good idea in fact. >> (I assume) > >> I got an idea (not tested yet) for usermode code. >> Since I am allowed to rewrite the stack in that case, >> maybe there could be oportunities for pairing the data and the stack? >> >> _________________________ >> Data.Stack 32 DWORDS >> Data (plenty) >> _________________________ >> >> Prefetch b$Data.Stack >> mov D$OldStack esp >> mov esp Data.Stack >> call DoSomeThingWithData //and make use of the local local stack >> mov esp D$OldStack > >> Do you think this could sometimes help speed up code? >> Or is this redundant on new hardware? > > I don't understand the reason of this. It was just a wild idea. As you say, it makes no diffrence. (AMD64) (and for short sequences is slower) > Stack is just memory, even with a better chance to reside cached > most of the time because it's frequently used (or even abused) > within a certain range. So excessive stack usage will end up > with the same delaying cache issues like 'normal' memory. > Prefetch may help on data used more often in a following loop, > not much more than to suffer on the cache-miss earlier ;) True. Also this code above wount work due to the way the stack works. >>>>>>> ;draw it: >>>>>>> PUSH ecx >>>>>>> L3:MOV eax,edi ;1 ;keep the line start >>>>>>> REP MOVSD ;15+ecx*4/3 = 357 >>>>>>> ADD eax,edx ;1 ;add scan line size >>>>>>> DEC ebx ;1 >>>>>>> MOV ecx,[esp] ;3 >>>>>>> MOV edi,eax ;1 >>>>>>> JNZ L3< ;1 ;total (357+8)*256 = 93440 >>>>>>> POP ecx > >>> All the main time is consumed by the above loop and >>> you can calculate the theoretical timing and compare it to the >>> actual reported timing to figure out how slow the hardware is. > >> I thought it easier to just compare the DWORD write (0-1-2 cycle?) >> to your 32 cycles per dot number. (not sure here...) > >>> Let's say this code show only ~100000 CPU cycles >>> now we add all the cache penalties [~35/64 byte on 500/66MHz K7] >>> (only for RD here, I have VRAM non-cachable on my PCs) > >> What function has that VRAM-cacing? I thought it just for bios >> variables? > > Also possible to cache the video-RAM (BIOS-setup option), > it helped a bit on RD-modify-WR cycles (like XOR) on old VGA-cards. > >>> and this 256Kb are 4096 cache lines, so we add 143360 and get >>> about 250000 CPU cycles for the whole story. > >> You are multiplying 35cycles penalty for each 64 byte, with >> the number of cache lines? > > Yes. > >>> So why RDTSC reported ~2'000'000 instead of ~250'000, this is >>> about eight times more ? >>> The answer might be that the BUS-speed is limited to 66 MHz >>> while the CPU runs on 500MHZ... and 500/66 = 7.57 = near 8. > >> 1000 copies a second is 256*256 dots /s = (65536 dots) * 1000 per sec. >> that is 65_536_000 dots / sec. or 65 mega pixels. >> For 32 cycles per pixel this gives a total of 2_097_152_000 or two Giga >> cycles Which is impossible on a machine like that. > > Something in your calculator seem to be stuck ...? :) You said to me that the app I posted a link to, printed 970 on the screen when ran in 1024*768*32 ? so the calculation is correct for that case. 970 _copies_ of a bitmap that is 256*256 pixels ~ roughly 1000 * (65536) (dots = pixels). If each pixel in your own code takes 33 cycles, then for my code to be slower, it will have to use at least 33 cycles per dot. So when multiplied out, 1000*65536*33 = 2_162_688_000 cycles _per second_ = you machine must be able to process 2 giga cycles per second. or 2Ghz. My own K7 gives only around 460 copies a second, and runs on 1543 megaherz currently. The machine you ran my code on is processing the pixels at twice the effiency for memory, gc, and CPU, while you claim it is an antiquated piece of garbage from about the stoneage. (btw, the 1/4 dot was a joke - bitblt process the entire bitmap per call, 65536 pixels) > I never said it was counted over 1000 copies. I reported ~2e6 cycles > which turns out to be 3..4 times slower than your 'supported' variant. Yes. But my code did 970 copies a second. On the same machine. (You said) > I makes me sick to see it, but I cannot write my own hw-support > for every new graphic card (like I had in the past for a few cards, > which turned out to be three times faster than DirectX), perhaps > you totally misunderstood and that's why insinst for more code ? I misunderstand a lot of things. But I do know that in order to verify a claim as for speed of some code, we need to see the two codes running, on the same machine, and have the code so we can verify what it does. >>> Pushing on my ego? eh? >> No. This is the very _beginning_ of comparing notes. >> An absolute rule that all coders would know and appreciate. >> It the _first_ rule. > > OK, let's follow another absolute rule right now: > * give windoze the same vendor support as I've got ['goanix','nada'] > (I have only the information found in the VBIOS aka VESA) > and the let's compare again: > * set screen mode to 1024*768,32bit-colour > * draw any picture or pattern (not just equal dots) > * draw it to an unalinged position ie: x=15 y=13 I have the deepest sympathy for that, and I think it is a crime. But is not related to the VESA claim. GDI doesnt use hardware acceleration. > Oh, I see, the three latter above wont work, because windoze defaults > to the old VGA-mode: 1024*768,4bit-colour and cannot set any other > mode beside VGA 03 (blue screen of death) without a vendors driver > or any other third party support. > >>>> You could even do it in windows. >>> Sure, if you can tell me how to get direct access to the VRAM. >> Write a driver, get ring 0 access, and take over the OS. > > If this would be that easy or possible, we would have heard already. At least explain why its not possible. > __ > wolfgang > > > |