From: Richard Maine on
Ron Shepard <ron-shepard(a)NOSPAM.comcast.net> wrote:

> http://sourceforge.net/projects/math-atlas/
....
> This code ...The hard part of hand-tuning assembly
> is eliminated through brute force tuning of the various parameters

Note that if you don't do the hard part, hand-coded assembly might quite
likely be slower than the code from high-level languages. There is a
long and well established history of people allegedly hand optimizing
codes only to find that they made the codes actually run slower instead
of faster. I've certainly done such things myself. This history goes
back almost 50 years now, but it is probably more likely to happen now
than 50 years ago.

This is the case for many attempts at hand optimization - not just
assembly. Such things as loop unrolling, for example, can slow things
down in some scenarios because it can inhibit the compiler's ability to
do its own optimization or parallelization.

I suppose I should repeat the cliche about testing such things. It
shouldn't need saying, but it does turn out to need saying... a lot. If
you try to optimize code and don't test the results, you haven't done
much of a job of optimization. There have been many cases where people
didn't bother to do such testing, but claimed they had achieved wondrous
optimizations, only to have someone else find that the code could be
substantially sped up by removing the claimed optimizations.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain
From: glen herrmannsfeldt on
Richard Maine <nospam(a)see.signature> wrote:
(snip, someone wrote)

>> This code ...The hard part of hand-tuning assembly
>> is eliminated through brute force tuning of the various parameters

> Note that if you don't do the hard part, hand-coded assembly might quite
> likely be slower than the code from high-level languages. There is a
> long and well established history of people allegedly hand optimizing
> codes only to find that they made the codes actually run slower instead
> of faster. I've certainly done such things myself. This history goes
> back almost 50 years now, but it is probably more likely to happen now
> than 50 years ago.

There are stories back to the first Fortran compiler. After writing
the compiler, the develepers looked at the generated code, and were
surprised at some of the things it did.

There there was OS/360 Fortran H, reported to generate code
"as good as an experienced assembler programmer"

For RISCier processors, it is even harder to hand optimize the
code, IA64 being one of the harder ones. The interaction between
instructions is so strong that only computers can do it fast enough.

> This is the case for many attempts at hand optimization - not just
> assembly. Such things as loop unrolling, for example, can slow things
> down in some scenarios because it can inhibit the compiler's ability to
> do its own optimization or parallelization.

-- glen
From: Nick Maclaren on
In article <i43qcf$mj0$1(a)speranza.aioe.org>,
glen herrmannsfeldt <gah(a)ugcs.caltech.edu> wrote:
>Richard Maine <nospam(a)see.signature> wrote:
>
>>> This code ...The hard part of hand-tuning assembly
>>> is eliminated through brute force tuning of the various parameters
>
>> Note that if you don't do the hard part, hand-coded assembly might quite
>> likely be slower than the code from high-level languages. There is a
>> long and well established history of people allegedly hand optimizing
>> codes only to find that they made the codes actually run slower instead
>> of faster. I've certainly done such things myself. This history goes
>> back almost 50 years now, but it is probably more likely to happen now
>> than 50 years ago.
>
>There are stories back to the first Fortran compiler. After writing
>the compiler, the develepers looked at the generated code, and were
>surprised at some of the things it did.
>
>There there was OS/360 Fortran H, reported to generate code
>"as good as an experienced assembler programmer"

Experienced, yes - skilled, no. It was a ghastly compiler. Now,
there WERE others that did just that - though I can't remember
which ones they were (not on a System/370, anyway).

>For RISCier processors, it is even harder to hand optimize the
>code, IA64 being one of the harder ones. The interaction between
>instructions is so strong that only computers can do it fast enough.

And Terje :-)


Regards,
Nick Maclaren.
From: Vincenzo Mercuri on
Ron Shepard ha scritto:

> There is something in between using high-level language constructs
> and hand-coding assembly. An example of this is the ATLAS BLAS
> library
>
> http://sourceforge.net/projects/math-atlas/
>
> This code uses a high-level language, C, but it is used in a very
> low-level primitive way. Basically, it is writing assembly language
> in C. There is relatively little compiler optimization that can, or
> should be done on that code. The hard part of hand-tuning assembly
> is eliminated through brute force tuning of the various parameters
> (in ATLAS, that includes tuning for the number of registers, the
> size of cache, loop unrolling, matrix subblocking, and things like
> that). After a piece of code is written, it is run for hours at a
> time on the target architecture in order to search for the optimal
> set of tuning parameters, and then that final result is distributed
> for use.
>
> Why was ATLAS done in C? I don't know definitely, but I think it is
> simply because it relies heavily on use of the C preprocessor. If
> you look at some of the routines, there are more lines of
> preprocessor code than there are executable code. The low-level C
> code that is there is simple and could have been done just as easily
> (or maybe even easier) in fortran. In fact, considering the
> aliasing problems with C (look at the code, it is explicitly
> checked), fortran is probably the more natural language for things
> like ATLAS. But the C preprocessor has always been an integral part
> of the C language, and as all of us fortran programmers here know,
> the fortran standards process failed to produce anything similarly
> useful for fortran over the past 30+ years. So ATLAS (and many
> other similar low-level utility programs) is written in C rather
> than fortran.

Thank you, precious article and link.
I think that the use of the C language instead
of assembly is due as much to the extensive use of its
preprocessor as to the demands for portability.
Yes, I didn't look at Atlas code yet, and maybe
I am wrong, since there are many ways to write
non portable code even in C, but this is something
that cannot be underestimated. We cannot talk
about assembly regardless of the target machine.
Also, assembly code wouldn't be optimal enough
for all the targets and no longer optimizable as well.
A library in C (or Fortran) is to make the most
of the host compiler's optimization capabilities.




--
Vincenzo Mercuri
From: glen herrmannsfeldt on
Nick Maclaren <nmm(a)gosset.csi.cam.ac.uk> wrote:
(snip on compiled code vs. hand generated assembly code)

>>For RISCier processors, it is even harder to hand optimize the
>>code, IA64 being one of the harder ones. The interaction between
>>instructions is so strong that only computers can do it fast enough.

> And Terje :-)

Hmmm.

Many compilers that I know of now use dynamic programming to
select an optimal set of instructions. Given the appropriate
weights (instruction times) dynamic programming chooses the
appropriate instructions.

I have an actual IA64 machine, and the books describing the
instruction set. I haven't even thought about trying to do any
assembly programming for it.

For those that don't follow such things, IA64 instructions
are grouped into 128 bit bundles, with three 41 bit instructions
and a five bit template field in each bundle. There are five
different types of instructions, and 24 different combinations
of those types that can go into a bundle.

Much of the possible interaction between instructions that
most pipelined processors have to figure out for you is done
by the compiler for IA64. With most processors, you can
assume that the instructions are executed in order, with the
exception of branch delay slots on many RISC processors.
As far as I know, you can't make such assumptions for IA64.

-- glen