|
From: Dan on 6 Apr 2008 11:01 Hi, I have some legacy code which includes calls to the CAL2 vector mask and vector population count instructions. Lacking a nearby Cray, does anyone know of subroutines that implement these instructions? I've searched comp.lang.fortran, sent email to cray computer (no reply, probably no one left writing in CAL for speed), searched with google (all Photoshop hits which I did not understand), checked the '95, '03, and proposed '08 Fortran standards, but no luck. The Intel manuals have MMX/SSE{1234} instructions which might work, and AMD has the "something-now" which are apparently the same, but I've not seen any demo code emulating the mask and popcount instructions. Any suggestions would be welcome. TIA, dan
From: Tim Prince on 6 Apr 2008 11:38 Dan wrote: > > I have some legacy code which includes calls to the CAL2 vector mask > and vector population count instructions. Lacking a nearby Cray, does > anyone know of subroutines that implement these instructions? > Maybe it would help to show some of the code, if it isn't secret like much usage of popcount. Apparently, GMP supports popcount for those processors which implement it. > The Intel manuals have MMX/SSE{1234} instructions which might work, > and AMD has the "something-now" which are apparently the same, but > I've not seen any demo code emulating the mask and popcount > instructions. SSE popcount instruction isn't available in currently released hardware, but it will be along soon. SSE auto-vectorization of conditional operations is often done with vector masks. Fortran MERGE for SSE2 is vectorized with masks; the latest additions to SSE include more direct merge support. IAND and IEOR support masking operations. If your Fortran doesn't auto-vectorize them in the context you require, there are SSE intrinsic operators, supported by C (and possibly 1 or 2 Fortran) compilers.
From: Tim Prince on 6 Apr 2008 15:10 Tim Prince wrote: > Dan wrote: > >> >> I have some legacy code which includes calls to the CAL2 vector mask >> and vector population count instructions. Lacking a nearby Cray, does >> anyone know of subroutines that implement these instructions? >> > > Maybe it would help to show some of the code, if it isn't secret like > much usage of popcount. > Apparently, GMP supports popcount for those processors which implement it. > >> The Intel manuals have MMX/SSE{1234} instructions which might work, >> and AMD has the "something-now" which are apparently the same, but >> I've not seen any demo code emulating the mask and popcount >> instructions. > > SSE popcount instruction isn't available in currently released hardware, > but it will be along soon. Wrong, it's in the original SSE4 supported by Penryn processors: http://softwarecommunity.intel.com/articles/eng/1193.htm (no compiler support, except for a C compatible library function).
From: FX on 6 Apr 2008 15:48 > no compiler support I'm not sure exactly what you mean by that: $ cat a.c int foo(int i) { return __builtin_popcount (i); } $ gcc -S a.c -msse4 $ cat a.s .text ..globl _foo _foo: pushl %ebp movl %esp, %ebp subl $8, %esp movl 8(%ebp), %eax popcntl %eax, %eax leave ret .subsections_via_symbols There clearly is a popcntl opcode used. If you don't specify -msse4, you get a call to a function in the GCC support library (libgcc). $ gcc -v Using built-in specs. Target: i386-apple-darwin8.10.1 Configured with: /tmp/gfortran-20080221/ibin/../gcc/configure --prefix=/usr/local/gfortran --enable-languages=c,fortran --with-gmp=/tmp/gfortran-20080221/gfortran_libs --enable-bootstrap Thread model: posix gcc version 4.4.0 20080221 (experimental) [trunk revision 132519] (GCC) -- FX
From: Dan on 6 Apr 2008 17:48 Hi all, Yes, raw speed is needed. That's why we went from fortran to assembler in the first place. Since the code was written, the amount of data has increased by three, perhaps four, orders of magnitude. I just looked over the standard's definition of integer, and it implies all integers are signed. Is that true? If there are unsigned integers, then there should be no problem. For those with serious assembler leanings, appended below is a brief description of the CAL routine, then the cal routine itself. I added "< ++++" at the vector mask instructions. For those lucky enough not to have to program in CAL, S registers are scalar, V are vector, T are transfer, A and B are data registers, This code was originally written for a Cray-1, so it is simpler than X and Y assembler. The code is uncommented deliberately. Sorry about that. thanks all, dan davison CAL SUBROUTINES: c c CMPRS1N (number-1,element1 array1,element1 array2) c c The number of consecutive elements examined is specified, c as well as the addresses of the first elements in each array. c If the element of array2 is nonzero, the corresponding c element of array1 is kept. The kept elements are written c without space between and in their original order in c the array1 commencing with the position of the first c element specified. The value of the first argument is changed c to the number of kept elements minus one. * Copyright, 1989, The Regents of the University of California. * This software was produced under a U. S. Government contract * (W-7405-ENG-36) by the Los Alamos National Laboratory, which * is operated by the University of California for the U. S. * Department of Energy. The U. S. Government is licensed to use, * reproduce, and distribute this software. Permission is granted * to the public to copy and use this software without charge, * provided that this Notice and any statement of authorship are * reproduced on all copies. Neither the Government nor the * University makes any warranty, express or implied, or assumes * any liability or responsibility for the use of this software. * IDENT CMPRS1N (ju,iz(0),ii(0)) CMPRS1N ENTER PRELOAD=0,NP=3,ALIGN=ON ARGADD A1,1,ARGPTR=A6 address of ju ARGADD A2,2,ARGPTR=A6 address of iz(0) ARGADD A6,3,ARGPTR=A6 address of ii(0) B71 A1 A1 0,A1 S2 A1 A1 A1+1 VL A1 A7 VL S2 S2>6 A1 S2 A1 -A1 A3 A2 B72 A2 JLUP = * A0 A6 A6 A6+A7 V6 ,A0,1 V1,VM V6,N <++++++++ vector mask S1 VM <++++++++ vector mask B77 A7 A7 PS1 A0 A7 VL A7 JAZ SKIP A0 A2 V2 ,A0,V1 A0 A3 A3 A3+A7 ,A0,1 V2 SKIP = * A7 B77 A2 A2+A7 A0 A1 A1 A1+1 A7 ZS0 VL A7 JAM JLUP A2 B72 A2 A3-A2 A2 A2-1 A1 B71 ,A1 A2 EXIT END
|
Next
|
Last
Pages: 1 2 3 4 5 6 Prev: reading integer text matrix into real array Next: MPICH and Fortran90 |