|
Prev: Patching of a divide overflow error ?
Next: Skybuck presents FastGetBit ( Value, BitPosition ) (latency 2)
From: windenntw on 4 May 2008 16:09 On May 2, 12:06 am, Terence <tbwri...(a)cantv.net> wrote: > On Apr 28, 5:07 pm, "Skybuck Flying" <BloodySh...(a)hotmail.com> wrote: > > > > > Idea was to: > > > Take 1 bit from the first pixel. > > Take 1 bit from the second pixel. > > Take 1 bit from the third pixel. > > Take 1 bit from the fourth pixel. > > > Etc. > > > Up to 32 bits. > > > 32 bits fit in a longword. > > > Then write the longword to the output. > > > Repeat until all pixels processed. > > > Then repeat again for the next bit position. > > > So starts with first bit of each pixel. > > > Then repeat the loop again for the second bit of each pixel. > > > (Repeat this loop for all 24 bit positions). > > > This way finally all bit planes would be extracted. > > > Hope this clearifies the idea a bit ;) :) > > > Bye, > > Skybuck. > > I know I came in late. > The above is a description of bitwise transpose of a matrix from row > order to column order. > I faced this problem in robotic vision, extracting colour plane data > from camera input buffers. > One solution was old ferrite core meemories and both-ways access. > I suspect there will be on-chip operations is graphics cards for this > now; and a LOT faster. > The interesting thing is this also a problem in market research data > processing. high-speed bit-transposing is a solved problem: do a search for c2p or chunky to planar
From: Skybuck Flying on 4 May 2008 16:54 <windenntw(a)gmail.com> wrote in message news:2f4fe1c1-c069-4bc1-9427-335817d7a36f(a)c58g2000hsc.googlegroups.com... > high-speed bit-transposing is a solved problem: do a search for c2p or > chunky to planar Thanks more (ancient/amiga) material to study =D I love it =D LOL. Could come in handy ;) Bye, Skybuck.
From: Nils on 4 May 2008 17:11 That C2P stuff is well worth understanding. However, it's tuned for CPU cycles, not performance on machines with caches. What you have to do is to split your working-set of data into a portion that fits nicely into the data-cache. Check your algorithm target platform and try to do your processing in blocks of 64x64 pixels or so. You will see an huge speed improvment simply due to better cache usage.
From: Skybuck Flying on 4 May 2008 19:40 "Nils" <n.pipenbrinck(a)cubic.org> wrote in message news:686n42F2q3salU1(a)mid.uni-berlin.de... > > That C2P stuff is well worth understanding. > > However, it's tuned for CPU cycles, not performance on machines with > caches. What you have to do is to split your working-set of data into a > portion that fits nicely into the data-cache. Check your algorithm target > platform and try to do your processing in blocks of 64x64 pixels or so. > > You will see an huge speed improvment simply due to better cache usage. I am starting to have doubts about the claim that C2P solves the problem. In fact it probably doesn't really solve the problem fully. As you mention it just turns a block of pixels in r,g,b format into bit plane format but only for the block of pixels. So: R0R1R2R3R4R5R6R7, G0G1G2G3G4G5G6G7, B0B12B3B4B5B6B7, A0A1A2A3A4A5A6A7 R0R1R2R3R4R5R6R7, G0G1G2G3G4G5G6G7, B0B12B3B4B5B6B7, A0A1A2A3A4A5A6A7 R0R1R2R3R4R5R6R7, G0G1G2G3G4G5G6G7, B0B12B3B4B5B6B7, A0A1A2A3A4A5A6A7 R0R1R2R3R4R5R6R7, G0G1G2G3G4G5G6G7, B0B12B3B4B5B6B7, A0A1A2A3A4A5A6A7 Will be transformed to: Memory Address X: R0R0R0R0 G0G0G0G0 B0B0B0B0 A0A0A0A0 R1R1R1R1 G1G1G1G1 B1B1B1B1 A1A1A1A1 etc However the next pixel block well end up here: Memory Address Y: R0R0R0R0 G0G0G0G0 B0B0B0B0 A0A0A0A0 R1R1R1R1 G1G1G1G1 B1B1B1B1 A1A1A1A1 This means there is still a large gap to overcome between R0 of pixel block 0 and R0 of pixel block 1. Gap = Memory Y - Memory X. Thus if I am correct C2P does not solve the problem fully because it simply doesn't return a true bit plane. A true bit plane is a plane which has all the bits in order/sequential. In other words C2P returns a fragmented/chunky bit plane. Which is not really that desireable since it complicates things further. The whole idea was to get a nice sequential bit plane to make processing it easy, which now might even become harder ?!? ;) :) Anyway the tutorials do not mention this shortcoming in the C2P... The tutorials use a nice square looking input and output, so it's misleading at best, it doesn't show the fragmentation/gaps that will occur, because it just shows one pixel block. Had it shown multiple pixel blocks the gaps would have become apperent ;) If you think about it, it makes perfect sense. It's simply impossible to transform 64x64 pixels into 64x64 pixels with the same memory addressess. The bit planes don't belong there... they must be moved elsewhere to their true positions. Conclusion: It's simply impossible to prevent random memory access. Final conclusion: C2P might help by first converting to fragmented/chunky bit planes. Then 8 bits can be read in sequence with gaps and be moved to their final destinations, which could still give some speed ups ;) Bye, Skybuck.
From: Skybuck Flying on 4 May 2008 19:55
Let's clearify the memory (lineair) layouts: RGB Picture: RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB (R,G,B, R,G,B, R,G,B interleaved) Now if I am not mistaken a C2P picture will look like: R0group, G0group, B0group, R1group, G1group, B1group, R2group, G2group, B2group, etc then R0Group, G0group, B0group, R1group, etc (Bit Group interleaved) The group was the block width/height that was processed at once. So the picture will end up as being bit group interleaved. Not really desirable for (easy) compression processing. What I wanted was; All R0, All R1, All R2, All R3, All R4, All R5, All R6, All R7, All G1, All G2, All G3, etc. End of picture. (True sequential bit planes/one bit pictures) So there is definetly a difference. Bye, Skybuck. |