From: Dan on
Hi,

I have some legacy code which includes calls to the CAL2 vector mask
and vector population count instructions. Lacking a nearby Cray, does
anyone know of subroutines that implement these instructions?

I've searched comp.lang.fortran, sent email to cray computer (no
reply, probably no one left writing in CAL for speed), searched with
google (all Photoshop hits which I did not understand), checked the
'95, '03, and proposed '08 Fortran standards, but no luck.

The Intel manuals have MMX/SSE{1234} instructions which might work,
and AMD has the "something-now" which are apparently the same, but
I've not seen any demo code emulating the mask and popcount
instructions.

Any suggestions would be welcome.

TIA,
dan
From: Tim Prince on
Dan wrote:

>
> I have some legacy code which includes calls to the CAL2 vector mask
> and vector population count instructions. Lacking a nearby Cray, does
> anyone know of subroutines that implement these instructions?
>

Maybe it would help to show some of the code, if it isn't secret like much
usage of popcount.
Apparently, GMP supports popcount for those processors which implement it.

> The Intel manuals have MMX/SSE{1234} instructions which might work,
> and AMD has the "something-now" which are apparently the same, but
> I've not seen any demo code emulating the mask and popcount
> instructions.

SSE popcount instruction isn't available in currently released hardware,
but it will be along soon.

SSE auto-vectorization of conditional operations is often done with vector
masks. Fortran MERGE for SSE2 is vectorized with masks; the latest
additions to SSE include more direct merge support.
IAND and IEOR support masking operations. If your Fortran doesn't
auto-vectorize them in the context you require, there are SSE intrinsic
operators, supported by C (and possibly 1 or 2 Fortran) compilers.

From: Tim Prince on
Tim Prince wrote:
> Dan wrote:
>
>>
>> I have some legacy code which includes calls to the CAL2 vector mask
>> and vector population count instructions. Lacking a nearby Cray, does
>> anyone know of subroutines that implement these instructions?
>>
>
> Maybe it would help to show some of the code, if it isn't secret like
> much usage of popcount.
> Apparently, GMP supports popcount for those processors which implement it.
>
>> The Intel manuals have MMX/SSE{1234} instructions which might work,
>> and AMD has the "something-now" which are apparently the same, but
>> I've not seen any demo code emulating the mask and popcount
>> instructions.
>
> SSE popcount instruction isn't available in currently released hardware,
> but it will be along soon.

Wrong, it's in the original SSE4 supported by Penryn processors:
http://softwarecommunity.intel.com/articles/eng/1193.htm
(no compiler support, except for a C compatible library function).
From: FX on
> no compiler support

I'm not sure exactly what you mean by that:

$ cat a.c
int foo(int i) { return __builtin_popcount (i); }
$ gcc -S a.c -msse4
$ cat a.s
.text
..globl _foo
_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
popcntl %eax, %eax
leave
ret
.subsections_via_symbols

There clearly is a popcntl opcode used. If you don't specify -msse4, you get a
call to a function in the GCC support library (libgcc).


$ gcc -v
Using built-in specs.
Target: i386-apple-darwin8.10.1
Configured with: /tmp/gfortran-20080221/ibin/../gcc/configure --prefix=/usr/local/gfortran --enable-languages=c,fortran --with-gmp=/tmp/gfortran-20080221/gfortran_libs --enable-bootstrap
Thread model: posix
gcc version 4.4.0 20080221 (experimental) [trunk revision 132519] (GCC)

--
FX
From: Dan on
Hi all,

Yes, raw speed is needed. That's why we went from fortran to assembler
in the first place. Since the code was written, the amount of data has
increased by three, perhaps four, orders of magnitude.

I just looked over the standard's definition of integer, and it
implies all integers are signed. Is that true? If there are unsigned
integers, then there should be no problem.

For those with serious assembler leanings, appended below is a brief
description of the CAL routine, then the cal routine itself. I added "<
++++" at the vector mask instructions.

For those lucky enough not to have to program in CAL, S registers are
scalar, V are vector, T are transfer, A and B are data registers, This
code was originally written for a Cray-1, so it is simpler than X and
Y assembler. The code is uncommented deliberately. Sorry about that.

thanks all,
dan davison



CAL SUBROUTINES:
c
c CMPRS1N (number-1,element1 array1,element1 array2)
c
c The number of consecutive elements examined is specified,
c as well as the addresses of the first elements in each array.
c If the element of array2 is nonzero, the corresponding
c element of array1 is kept. The kept elements are written
c without space between and in their original order in
c the array1 commencing with the position of the first
c element specified. The value of the first argument is changed
c to the number of kept elements minus one.


* Copyright, 1989, The Regents of the University of California.
* This software was produced under a U. S. Government contract
* (W-7405-ENG-36) by the Los Alamos National Laboratory, which
* is operated by the University of California for the U. S.
* Department of Energy. The U. S. Government is licensed to use,
* reproduce, and distribute this software. Permission is granted
* to the public to copy and use this software without charge,
* provided that this Notice and any statement of authorship are
* reproduced on all copies. Neither the Government nor the
* University makes any warranty, express or implied, or assumes
* any liability or responsibility for the use of this software.
*
IDENT CMPRS1N (ju,iz(0),ii(0))
CMPRS1N ENTER PRELOAD=0,NP=3,ALIGN=ON
ARGADD A1,1,ARGPTR=A6 address of ju
ARGADD A2,2,ARGPTR=A6 address of iz(0)
ARGADD A6,3,ARGPTR=A6 address of ii(0)
B71 A1
A1 0,A1
S2 A1
A1 A1+1
VL A1
A7 VL
S2 S2>6
A1 S2
A1 -A1
A3 A2
B72 A2
JLUP = *
A0 A6
A6 A6+A7
V6 ,A0,1
V1,VM V6,N <++++++++ vector mask
S1 VM <++++++++ vector mask
B77 A7
A7 PS1
A0 A7
VL A7
JAZ SKIP
A0 A2
V2 ,A0,V1
A0 A3
A3 A3+A7
,A0,1 V2
SKIP = *
A7 B77
A2 A2+A7
A0 A1
A1 A1+1
A7 ZS0
VL A7
JAM JLUP
A2 B72
A2 A3-A2
A2 A2-1
A1 B71
,A1 A2
EXIT
END