From: Mok-Kong Shen on

This a a layman's question: If I don't err, multi-precision integer
arithmetics are done with algorithms employing one accumulator. Would
parallel processing exploiting multiple accumulators be advantageous
and, if so, could someone give some references?

Thanks.

M. K. Shen
From: Walter Banks on


Mok-Kong Shen wrote:
>
> This a a layman's question: If I don't err, multi-precision integer
> arithmetics are done with algorithms employing one accumulator. Would
> parallel processing exploiting multiple accumulators be advantageous
> and, if so, could someone give some references?
>

The questions doesn't have a completely clear answer. My experience
suggests that with the current processors multi-precision integer
arithmetic would normally be done on a single ALU. This is not a
direct answer to your question. It may however use different processor
registers for LS and MS calculations if there is a time advantage
to do so. The main advantage of using multiple registers connected
to the same ALU is to prepare the second set of calculations while
waiting for the first set of calculation results

With current processors multiprecesion calculations would stay with
a single ALU because it would take too much time to transfer the
condition code information between ALU's.

There are quite a few considerations to be made when deciding if
creating a parallel execution. The first one essentially is making
a fundamental decision when paralleling an application to parallelize
a program at the function level or at the instruction level.
(Fine or Coarse grain parallelism)

For the last decade or so many researchers have been working on
fine grain parallelism where applications distribute a function for
execution on several execution units. This can happen in two
places, during software creation where the code generation tools
create co-operating execution threads on more than one processor.
It can happen in a similar way automatically in some processors
during code execution. In the later case for example when a
conditional branch is executed both paths are executed on separate
processors until the branch is resolved when the results of one
of the execution paths is discarded. Randy Allen and others have
published several papers and books on identifying blocking cases
for separating out close grained execution.

Coarse grained execution is essentially breaking an application
at a function level and making function calls between processors.
At one level we see that happen when a system is implemented that
uses a co-processor programmed to perform some specialized set
of functions. There are several ways that this can be implemented.
In the simplest case this can be done by creating separate
application code for each processor. This was how we implemented
the four PDP-11 computer network protocol simulator in the mid 70's
(Yep its been around for a long time) By the mid 80's some of the
personal computers were using separate processors to do detailed
work on peripherals like the processors found in a keyboards or
mice that processed key rollover, auto repeat and quadrature
accumulation and button presses. Applications can be split is a
similar way to execute in parallel either through an execution
environment like ISO 61131, ISO 61499 block structured programming
or through tools that create multiprocessor execution from a
single source.

A single source to a multiprocessor execution can be automatically
done with some compiler tools. Byte Craft has developed a bunch of
algorithms and processes to do this in some of our tools. In the
automotive engine controllers the divisions between processors on
a heterogeneous system are well defined and when compiling for one
processor an interprocessor interface was exported as part of the
build of a second processor. In many of the consumer applications
the compilers used some form of geographical center of reference to
automatically of semi automatically break an application over several
processors. Byte Craft has done a lot of coarse grained parallel
development tools work.


Regards,

--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Mok-Kong Shen on
Walter Banks wrote:
[snip]
> With current processors multiprecesion calculations would stay with
> a single ALU because it would take too much time to transfer the
> condition code information between ALU's.
[snip]

Just an additional question for my clear understanding: For the special
case of multiplication of two large integers does that mean that the
cost of hardware needed for parallel processing doesn't weigh up the
net processing efficiency gained?

M. K. Shen
From: Walter Banks on


Mok-Kong Shen wrote:
>
> Walter Banks wrote:
> [snip]
> > With current processors multiprecesion calculations would stay with
> > a single ALU because it would take too much time to transfer the
> > condition code information between ALU's.
> [snip]
>
> Just an additional question for my clear understanding: For the special
> case of multiplication of two large integers does that mean that the
> cost of hardware needed for parallel processing doesn't weigh up the
> net processing efficiency gained?

Yes.

However multiply may be a bad example because there are some very fast
multipliers and partial products don't need to wait for condition codes.
It would probably work well in a fine grained parallel environment.

Depending on application and processor a piplined software
implementation
might also be very efficient.


Regards,


Walter..
--
Walter Banks
Byte Craft Limited
http://www.bytecraft.com

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Ira Baxter on

"Mok-Kong Shen" <mok-kong.shen(a)t-online.de> wrote in message
news:hub3ak$i34$02$1(a)news.t-online.com...
> Walter Banks wrote:
> [snip]
>> With current processors multiprecesion calculations would stay with
>> a single ALU because it would take too much time to transfer the
>> condition code information between ALU's.
> [snip]
>
> Just an additional question for my clear understanding: For the special
> case of multiplication of two large integers does that mean that the
> cost of hardware needed for parallel processing doesn't weigh up the
> net processing efficiency gained?
>
> M. K. Shen

For really large multiprecision multiplication, you might be able to
partition
a single multiplication into computing partial products in (task) parallel
and using an adder tree to
reduce/sum the partial products and get some performance.

If I were doing lots of this kind of arithmetic, I'd be tempted instead to
find a way to call the multiprecision routines as units of parallel work.
They are big enough to make nice units of parallelism.

We have a parallel programming langauge, PARLANSE,
(www.semanticdesigns.com/Products/PARLANSE) that is designed to
handle modest-size units of code in task parallel for lots
of combinations (pure parallal, partial order, arbitrary forking).
It just so happens to have a multiprecision library (integers and
rationals),
and in fact calls on that library *do* get used in parallel in our everyday
use
of PARLANSE for program analysis purposes. In practice,
we don't see a lot of instances of this occuring, but it does occur.

-- IDB