From: glen herrmannsfeldt on
nmm1(a)cam.ac.uk wrote:

(snip)

< Actually, not even on older hardware. Flipping between the integer
< and floating-point pipelines is slow, but often used to be done by
< storing to memory and reloading. That is one of the reasons that I
< think that the separation of the pipelines is a mistake - but see
< comp.arch for where I proposed a heretical solution :-)

The separation was an important part of the design of the 360/91
and the Cray-1. For IBM, the convenience of separate floating
point and general registers allowed for it.

The OS/360 Fortran library uses a fixed point initial approximation
for SQRT and such (without the exponent). EXP and LOG use fixed
point to move exponent bits around after or before a polynomial
expansion. Other than processors with hardware transcendental
functions, that seems likely the way it is done.

I still remember seeing long ago the timing for the OS/360
Fortran library routines on different S/360 models. On the
360/91, DSQRT is faster than SQRT. The initial approximation
for SQRT uses fixed point divide (and two NR iterations).
For DSQRT no fixed point divide, but four iterations. On the 91,
the extra time for the fixed point divide was more than the
cost of the two additional iterations. Not so for the other
S/360 models.

-- glen
From: Dan Nagle on
Hello,

On 2009-09-12 06:45:06 -0400, nmm1(a)cam.ac.uk said:

> It can be slower, too. Many modern chips have separate addition
> and multiplication floating-point units, or a fused multiply-add,
> or both, and the time for of N consecutive additions, N consecutive
> multiplications or N consecutive alternations of multiplication
> and addition is about the same.

Agree. Even on a machine as old as the Cray-1,
floating add was 6 clocks, floating multiply was 7.
Even when a shortstop is available, awaiting results
to reload a pipe was time wasted.

The time to do all the masks, and, add, and merge
would be (IIRC) 1 to load the mask, 1 to and,
3 to shift, 3 to add, 3 to shift, 1 to load a mask,
and 3 (or 2?) to merge (this uses the Cray scalar merge,
not available on all chips). There is some overlap,
but I don't want to bother resolving all the concurrency
possible. The result is longer than the 7 clock
multiply (the shift, add, shift sequence delays the most).

Note that in the OP's issue, there is no accuracy question,
as divide-by-two == multiply-by-one-half is (almost certainly)
exact.

--
Cheers!

Dan Nagle

From: nmm1 on
In article <h8gkss$6hl$4(a)naig.caltech.edu>,
glen herrmannsfeldt <gah(a)ugcs.caltech.edu> wrote:
>
>< Actually, not even on older hardware. Flipping between the integer
>< and floating-point pipelines is slow, but often used to be done by
>< storing to memory and reloading. That is one of the reasons that I
>< think that the separation of the pipelines is a mistake - but see
>< comp.arch for where I proposed a heretical solution :-)
>
>The separation was an important part of the design of the 360/91
>and the Cray-1. For IBM, the convenience of separate floating
>point and general registers allowed for it.

Yeah. Now, when was it that they were designed?

>The OS/360 Fortran library uses a fixed point initial approximation
>for SQRT and such (without the exponent). EXP and LOG use fixed
>point to move exponent bits around after or before a polynomial
>expansion. Other than processors with hardware transcendental
>functions, that seems likely the way it is done.

It varies, but that's usually how it starts.


Regards,
Nick Maclaren.
From: robin on
"Dan Nagle" <dannagle(a)verizon.net> wrote in message news:h8fsfb$ier$1(a)news.eternal-september.org...
| Hello,
|
| On 2009-09-11 21:27:10 -0400, "robin" <robin_v(a)bigpond.com> said:
|
| > Multipliocation? whio said anything about multiplication?
| > The operation is DIVISION here.
|
| Division by a constant, on modern compilers, it is almost always
| replaced by multiplication by the reciprocal.

They might if you ask for it.

| I thought you knew that.

I know that some optimisations can do that.

| > It may interest you to know that those operations take place
| > for any arithmeric operation , whether it be +, -, * and divide.
| > With halving, division reqires no actual division so that is
| > saved. That is why the operation is about 10 times fasyter than
| > actual division.
|
| Agreed, on older hardware. On modern pipelined chips,
| the repeated operations of masking, anding, and so on,
| all require awaiting the results of a previous operation
| in order to be inserted into a pipeline.

The operations required for halving are no different from ordinary
division, except that no division is required. Since there is no
divivion, that time is saved.

| That takes a long time.

No longer than addition.

| > And IF the operation were multiplication by 2, why that can
| > be achieved by simple addtion. Again, faster than multiplication.
|
| Not on most modern chips.

Overgeneralisation, like your earlier comments.

| > | > 2.0 can be treated as 2 for operations like x*2 and x/2,
| > | > and those operations (* or div) are done at run time of course
| > | > (the * being performed as x+x, again with considerable increase
| > | > in speed).
| > |
| > | On modern hardware, multiply is often (at least almost)
| > | as fast as addition.
| >
| > Often not.


From: robin on
<nmm1(a)cam.ac.uk> wrote in message news:h8fu3i$1b2$1(a)smaug.linux.pwf.cam.ac.uk...
| In article <h8fsfb$ier$1(a)news.eternal-september.org>,
| Dan Nagle <dannagle(a)verizon.net> wrote:
| >On 2009-09-11 21:27:10 -0400, "robin" <robin_v(a)bigpond.com> said:
| >
| >> Multipliocation? whio said anything about multiplication?
| >> The operation is DIVISION here.
| >
| >Division by a constant, on modern compilers, it is almost always
| >replaced by multiplication by the reciprocal.
| >
| >I thought you knew that.
|
| Well, only if you enable serious optimisation :-) If you don't, why
| are you worrying about performance?
|
| >> It may interest you to know that those operations take place
| >> for any arithmeric operation , whether it be +, -, * and divide.
| >> With halving, division reqires no actual division so that is
| >> saved. That is why the operation is about 10 times fasyter than
| >> actual division.
| >
| >Agreed, on older hardware. On modern pipelined chips,
| >the repeated operations of masking, anding, and so on,
| >all require awaiting the results of a previous operation
| >in order to be inserted into a pipeline.
| >
| >That takes a long time.
|
| Actually, not even on older hardware. Flipping between the integer
| and floating-point pipelines is slow, but often used to be done by
| storing to memory and reloading.

Halving (using a hardware instruction) an FPN does not require
use of integer registers.