From: James Van Buskirk on
"Terence" <tbwright(a)cantv.net> wrote in message
news:022f6f0d-6e84-492b-845b-d7b42cb3d87a(a)s13g2000prd.googlegroups.com...

>I think popcount is just a bit count of '1's in a string of bits.

Correct.

> The standard way in market research, which uses this a lot, is to
> generate a table of bit counts for all possible 8-bit words (or use a
> pre-calculated one). Better is use an assembler fucntion.

Well, the lookup table method is quite slow. I recalled that AMD
provided algorithms for efficient POPCNT. In their phenom docs,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

they just recommend the POPCNT instruction in section 8.8. So we
have to go back in time to

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

and we find a nifty divide and conquer algorithm in section 8.6.
For bigger bit strings using a PSADBW instruction could clean up
a set of 8-bit sums quickly. There is another trick that involves
using m n-bit registers (actually 2*m-1) as n m-bit adders and knocks
out large bit arrays ridiculously rapidly, but let's not get into
that today.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


From: Dan on
On Apr 6, 8:22 pm, Walter Spector <w6ws_xthiso...(a)earthlink.net>
wrote:

> All it does is the same as the Fortran-90 PACK intrinsic. Just use
> the intrinsic. If it isn't fast enough, complain to your compiler
> vendor

Thank you - I will check this out immediately.


> And it won't work on a Cray-1. It uses the compressed index and hardware
> gather instructions. Which means mid-life X-MP, onwards.


True. The code posted was second generation, after the lab replaced
all the Cray-1s with 4-processor X and later 8-processor Ys. The first
generation was from a lab-specific operating system and a lab-specific
compiler and math library. The later machines used UNICOS and
(usually) standard math libraries. The was much wailing and rending of
garments when that happened.

> The Cray compilers were eventually able to vectorize this idiom
> automatically. The CAL routine was unnecessary, and probably slower
> than letting the compiler generate inline code.

I'm not sure. The first generation CAL was much faster than the
fortran subroutine. However, as you noted, that was on the Cray-1. We
just carried over onto the later Crays and I certainly never checked.
Certainly the fortran compiler on the Y did a much better job (judged
by instrumenting the code) than the 1 and X. However, since the
routine did not use both "sides" of the vector array (you probably
remember the V arrays were split between processors) management
reasonably required us to stick to using the 1 or the X.

Again, thanks very much for your comments.
From: glen herrmannsfeldt on
Walter Spector wrote:
(snip)

> All it does is the same as the Fortran-90 PACK intrinsic. Just use
> the intrinsic. If it isn't fast enough, complain to your compiler
> vendor.

> And it won't work on a Cray-1. It uses the compressed index and hardware
> gather instructions. Which means mid-life X-MP, onwards.

I had thought he meant gather (pack), also, but looking at the CRAY
manual on bitsavers it does look more like merge. Without the
vector registers on Cray machines I wouldn't expect the intrinsic
to be a lot faster. (That is, I would guess less than twice as fast.)

Remember Amdahl's law.

-- glen

From: glen herrmannsfeldt on
Dan wrote:

> Yes, raw speed is needed. That's why we went from fortran to assembler
> in the first place. Since the code was written, the amount of data has
> increased by three, perhaps four, orders of magnitude.

> I just looked over the standard's definition of integer, and it
> implies all integers are signed. Is that true? If there are unsigned
> integers, then there should be no problem.

They are signed, but you can use them with the bit manipulation
instructions, anyway. Funny things might happen on ones complement
machines, but that should be rare. I offer two population
count methods in Fortran that should be fairly fast.
The second seems to generate lots of moves to/from memory
on g95 without optimization, but keeps intermediates in
registers with -O2.

For g95 on IA32 the -O2 case it takes 29 instructions for the
second method, straight line code with no branches or
intermediate stores. For a vector population count it might
be possible to optimize it a little more to overlap more
of the additions. That might help on a machine with more
registers than IA32 to keep intermediate values.

do i=1,100000000
j=i
k=0
do while(j.ne.0)
j=iand(j,j-1)
k=k+1
end do
l1=iand(i,z'55555555')+ishft(iand(i,z'aaaaaaaa'),-1)
l2=iand(l1,z'33333333')+ishft(iand(l1,z'cccccccc'),-2)
l4=iand(l2,z'0f0f0f0f')+ishft(iand(l2,z'f0f0f0f0'),-4)
l8=iand(l4,z'0f0f0f0f')+ishft(iand(l4,z'f0f0f0f0'),-4)
l16=iand(l8,z'00ff00ff')+ishft(iand(l8,z'ff00ff00'),-8)
l32=iand(l16,z'0000ffff')+ishft(iand(l16,z'ffff0000'),-16)
m=l32
if(k.ne.m) write(*,*) i,k,m
enddo
end

-- glen

From: Dan on
Glen noted:
>
> Yes, including the ambiguity codes for all 2**4 combinations
> including or excluding each base. Usually, you won't do a
> direct comparison of ambiguity codes, though. But you probably
> know that.

The full code, called d-squared (no, not my initials, "distance
squared"), is a sequence comparison algorithm rather than a sequence
alignment algorithm. Without adding context by segmenting and
windowing the library of sequences, all it tells you is that there is
some region in sequence A whose composition (possibly with inversions,
deletions, and unknown base codes via the full IUPAC set) matches
sequence B within a certain tolerance. Only in extremely rare, or
completely meaningless cases can the two sequences be aligned.

It's a very fast way of finding Alu and L7 sequences in the human
genome (only 1.8 million copies per cell!) which have many internal
insertions, deletions, and inversions. Some newer and faster
sequencing technologies produce an enormous amount of cruft initially,
so this code is getting reworked.

dan