|
From: James Van Buskirk on 6 Apr 2008 20:46 "Terence" <tbwright(a)cantv.net> wrote in message news:022f6f0d-6e84-492b-845b-d7b42cb3d87a(a)s13g2000prd.googlegroups.com... >I think popcount is just a bit count of '1's in a string of bits. Correct. > The standard way in market research, which uses this a lot, is to > generate a table of bit counts for all possible 8-bit words (or use a > pre-calculated one). Better is use an assembler fucntion. Well, the lookup table method is quite slow. I recalled that AMD provided algorithms for efficient POPCNT. In their phenom docs, http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf they just recommend the POPCNT instruction in section 8.8. So we have to go back in time to http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF and we find a nifty divide and conquer algorithm in section 8.6. For bigger bit strings using a PSADBW instruction could clean up a set of 8-bit sums quickly. There is another trick that involves using m n-bit registers (actually 2*m-1) as n m-bit adders and knocks out large bit arrays ridiculously rapidly, but let's not get into that today. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
From: Dan on 6 Apr 2008 21:18 On Apr 6, 8:22 pm, Walter Spector <w6ws_xthiso...(a)earthlink.net> wrote: > All it does is the same as the Fortran-90 PACK intrinsic. Just use > the intrinsic. If it isn't fast enough, complain to your compiler > vendor Thank you - I will check this out immediately. > And it won't work on a Cray-1. It uses the compressed index and hardware > gather instructions. Which means mid-life X-MP, onwards. True. The code posted was second generation, after the lab replaced all the Cray-1s with 4-processor X and later 8-processor Ys. The first generation was from a lab-specific operating system and a lab-specific compiler and math library. The later machines used UNICOS and (usually) standard math libraries. The was much wailing and rending of garments when that happened. > The Cray compilers were eventually able to vectorize this idiom > automatically. The CAL routine was unnecessary, and probably slower > than letting the compiler generate inline code. I'm not sure. The first generation CAL was much faster than the fortran subroutine. However, as you noted, that was on the Cray-1. We just carried over onto the later Crays and I certainly never checked. Certainly the fortran compiler on the Y did a much better job (judged by instrumenting the code) than the 1 and X. However, since the routine did not use both "sides" of the vector array (you probably remember the V arrays were split between processors) management reasonably required us to stick to using the 1 or the X. Again, thanks very much for your comments.
From: glen herrmannsfeldt on 6 Apr 2008 21:24 Walter Spector wrote: (snip) > All it does is the same as the Fortran-90 PACK intrinsic. Just use > the intrinsic. If it isn't fast enough, complain to your compiler > vendor. > And it won't work on a Cray-1. It uses the compressed index and hardware > gather instructions. Which means mid-life X-MP, onwards. I had thought he meant gather (pack), also, but looking at the CRAY manual on bitsavers it does look more like merge. Without the vector registers on Cray machines I wouldn't expect the intrinsic to be a lot faster. (That is, I would guess less than twice as fast.) Remember Amdahl's law. -- glen
From: glen herrmannsfeldt on 7 Apr 2008 05:28 Dan wrote: > Yes, raw speed is needed. That's why we went from fortran to assembler > in the first place. Since the code was written, the amount of data has > increased by three, perhaps four, orders of magnitude. > I just looked over the standard's definition of integer, and it > implies all integers are signed. Is that true? If there are unsigned > integers, then there should be no problem. They are signed, but you can use them with the bit manipulation instructions, anyway. Funny things might happen on ones complement machines, but that should be rare. I offer two population count methods in Fortran that should be fairly fast. The second seems to generate lots of moves to/from memory on g95 without optimization, but keeps intermediates in registers with -O2. For g95 on IA32 the -O2 case it takes 29 instructions for the second method, straight line code with no branches or intermediate stores. For a vector population count it might be possible to optimize it a little more to overlap more of the additions. That might help on a machine with more registers than IA32 to keep intermediate values. do i=1,100000000 j=i k=0 do while(j.ne.0) j=iand(j,j-1) k=k+1 end do l1=iand(i,z'55555555')+ishft(iand(i,z'aaaaaaaa'),-1) l2=iand(l1,z'33333333')+ishft(iand(l1,z'cccccccc'),-2) l4=iand(l2,z'0f0f0f0f')+ishft(iand(l2,z'f0f0f0f0'),-4) l8=iand(l4,z'0f0f0f0f')+ishft(iand(l4,z'f0f0f0f0'),-4) l16=iand(l8,z'00ff00ff')+ishft(iand(l8,z'ff00ff00'),-8) l32=iand(l16,z'0000ffff')+ishft(iand(l16,z'ffff0000'),-16) m=l32 if(k.ne.m) write(*,*) i,k,m enddo end -- glen
From: Dan on 7 Apr 2008 08:18
Glen noted: > > Yes, including the ambiguity codes for all 2**4 combinations > including or excluding each base. Usually, you won't do a > direct comparison of ambiguity codes, though. But you probably > know that. The full code, called d-squared (no, not my initials, "distance squared"), is a sequence comparison algorithm rather than a sequence alignment algorithm. Without adding context by segmenting and windowing the library of sequences, all it tells you is that there is some region in sequence A whose composition (possibly with inversions, deletions, and unknown base codes via the full IUPAC set) matches sequence B within a certain tolerance. Only in extremely rare, or completely meaningless cases can the two sequences be aligned. It's a very fast way of finding Alu and L7 sequences in the human genome (only 1.8 million copies per cell!) which have many internal insertions, deletions, and inversions. Some newer and faster sequencing technologies produce an enormous amount of cruft initially, so this code is getting reworked. dan |