From: Roger Ivie on
On 2010-04-26, Quadibloc <jsavard(a)ecn.ab.ca> wrote:
> On Apr 26, 7:03�am, Bernd Paysan <bernd.pay...(a)gmx.de> wrote:
>
>> I remember that the HP Fortran compiler compiled a hand-optimized matrix
>> multiplication whenever it found something resembling a matrix
>> multiplication (more than 15 years ago), and I'm quite ok with that
>> approach.
>
> Well, I'm not. Not because it's cheating on benchmarks. But because it
> should only replace Fortran code with a routine that performs a matrix
> multiplication if, in fact, what it found *really is* a matrix
> multiplication.

I have actually been fighting just this sort of battle on an Itanium
machine. Not specifically matrix multiplication, but...

In my situation, we're doing real-time work on an Itanium VMS box using
FORTRAN code that's been around since VAX-11/750s roamed the earth.
Since the only thing these particular boxes do is run our application,
we do things like create global sections to specific hardware addresses
to allow our FORTRAN code to get at the registers.

Had a bit of trouble a while ago with a Bit3 PCI to VMEbus adapter. Code
that creates mappings on the VMEbus was unable to map more than one
region. The code worked by walking the scatter/gather map looking for an
unused region in which the mapping could be performed. In this specific
case, it means walking through an array of longwords looking for an
entry that has bit 0 clear.

The array is declared as INTEGER*4,volatile:: (although it's FORTRAN and
been around since the /750, that doesn't mean it hasn't had a few
facelifts over the years).

The Itanium compiler noticed that I was only looking at bit 0, so it
performed *byte* fetches from the scatter/gather map.

The Bit3 hardware doesn't support byte accesses to the map. I suspect it
uses the size of an access to decide between the CSR data path (bytes
only) and the scatter/gather map data path (longwords). As a result, I
was seeing 0x0f (an unaddressed byte CSR) always returned for the first
map register, resulting in my code always believing the first register
was available.

Similarly, once it's allocated a chunk of map registers it clears them
to mark them as in use. This involves walking through the array,
plunking a zero in each longword.

The compiler noticed that this was a block clear and replaced my code
with an unrolled block clear loop that did either byte or word clears,
depending on alignment.

Furthermore, the compiler saw through all my simple-minded attempts to
trick it. And compiling it /NOOPTIMIZE didn't fix the first problem,
which involved using a byte fetch to snag only the "interesting" portion
of a longword.

I wound up having to do map accesses through a function similar to this:

integer*4 peek( address )
implicit none
integer*4,volatile:: address
return address

But this makes me worry about *all* of the other CSR accesses in the
system, *especially* those that go through the scatter/gather map to a
bus that has another byte order. Using the *one* byte swapping mode that
makes stuff at the other end of the system look enough like memory to
tolerate changes in access size moves the bits around.
--
roger ivie
rivie(a)ridgenet.net
From: Rick Jones on
Anne & Lynn Wheeler <lynn(a)garlic.com> wrote:
> HP: last Itanium man standing
> http://www.theregister.co.uk/2010/04/26/itanium_hp_last_standing/

> from above:

> Make no mistake: If Hewlett-Packard had not coerced chip maker Intel
> into making Itanium into something it never should have been, the
> point we have come to in the history of the server business would
> have got here a hell of a lot sooner than it has. But the flip side
> is that a whole slew of chip innovation outside of Intel might never
> have happened.

I read the article, but clearly not closely enough - what was it HP
coerced Intel into making Itanium that it should never have been?

Also, the bit about emulators is a little off:

"(The Itanium chips had an x86 emulator, you will remember, and also
emulated some PA-RISC instructions that HP-UX needed)"

I have never been a HW guy, but I don't recall there being any sort of
PA-RISC instruction emulation in Itanium chips. There is the Aries
PA-RISC emulator *SW* available with HP-UX to allow customers to run
PA-RISC binaries.

rick jones
--
The glass is neither half-empty nor half-full. The glass has a leak.
The real question is "Can it be patched?"
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: MitchAlsup on
On Apr 26, 12:43 am, Brett Davis <gg...(a)yahoo.com> wrote:
> In article
> <b24c8bb2-fcc3-4f4a-aa0d-0d18601b0...(a)11g2000yqr.googlegroups.com>,

>  MitchAlsup <MitchAl...(a)aol.com> wrote:
> > I think there are a number of semi-fundamental issues to be resolved;
>
> > The realization that "one can synchronize" a hundred thousand threads
> > running in a system the size of a basketball court

> ATI chips already have ~2000 processors, simple scaling over the next
> decade states that the monitor in your iMac a decade from now will
> have 100,000 CPUs. Which means that a desktop server will have a
> million CPUs. One for each 10 pixels on your monitor.

These ATI chips are the size of a basketball court? I suspect you mean
pipelines or pipeline stages in a single chip.

The problem I was aluding to was of size versus speed (equivalent to
time and distance). Where size is massively bigger (>1000X) than the
clock rate of the pipeline stage, and where the alusion of "nothing
can happen simultaneously" and "everyone can agree on exactly what
time it is" become unsolvable.

BTW a basketball court is about the size of some really large
supercomputer systems, so I was not talking about systems in-the-small
with those assertions.

Mitch
From: MitchAlsup on
On Apr 26, 12:22 pm, Robert Myers <rbmyers...(a)gmail.com> wrote:
> On Apr 25, 10:15 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
>
>
>
> > Perhaps along with the notion of the "Memory Wall" and the "Power
> > Wall" we have (or are about to) run into the "Multi-Processing" Wall.
> > That is, we think we understand the problem of getting applications
> > and their necessary data and disk structures parallel-enough and
> > distributed-enough. And we remain are under the impression that we
> > "espression limited" in applying our techniques to the machines that
> > have been built; but in reality we are limited by something entirely
> > more fundamental, and one we do not yet grasp or cannot yet enumerate.
>
> A misbegotten imaginary generalization of the Turing machine is at the
> root of all this, along with a misbegotten conception of how
> intelligence works.
>
> One of these days, we'll recognize a Turing machine as an interesting
> first step, but ultimately a dead end.  Along with it, we'll
> eventually grasp that the entire notion of "programming" is a very
> limiting concept.  Eventually, the idea of a "programmer", as we now
> conceive it, will seem preposterously dated and strange.

I, personally, blame the vonNeumann programming model. But it is so
intemately intertwined with the Turing Machine fundamentals that
little distinction is bought by making such a distinction.

But person of blame apart, I entirely agree with you.

> Nature has evolved very sophisticated ways of coding the design for an
> organism that will interact with an environment with certain expected
> characteristics to evolve into a very sophisticated mature organism
> that it is hard to believe arose from such compact code--and it
> didn't.  It evolved from that compact code through interaction with an
> appropriate environment, from which it "learned."

One could say the same about LISP programs....or the purporters of
LISP programs
{LISP = all languages derived from the notions first established by
LISP}

<snip>
> Does any of this have to do with hardware?  I think it does.  So long
> as processes are so limited and clumsy in the way they communicate,
> we'll wind up with machines that are at best an outer product of
> Turing machines.  

It is not just communications, its the fundamental nature of one step
(instruction) at a time that must die to break out of the Turing/
vonNeumann bottleneck. Parralelism in the memory interconnect
(communications mechanism) is entirely stiffled by "Memory Models" and
"Cache coherence".
Parralelism in the system to system (communication mechanisms) is
entirely stiffled by physical distance (latency), data rate (BW), and
the notion that I/O is too dangerous for user level program to manage
therefor the OS(device drivers) need to do it all, and this requires
synchronization(s) on the scale of "threads in a box"

Mitch
From: Robert Myers on
Quadibloc wrote:

> On Apr 26, 11:22 am, Robert Myers <rbmyers...(a)gmail.com> wrote:
>
>> One of these days, we'll figure out how to mimic that magic.
>
> Well, "genetic algorithms" are already used to solve certain types of
> problem.
>

Genetic programming is only one possible model.

The current programming model is to tell the computer in detail what to do.

The proposed paradigm is to shift from explicitly telling the computer
what to do to telling the computer what you want and letting it figure
out the details of how to go about it, with appropriate environmental
feedback, which could include human intervention.

This changes what is now programming into more of a systems engineering
problem, since it is unlikely that, in the foreseeable future, computers
will be able to write "programs" without significant help, and telling
the computer what to focus on will remain the province of the human user.

The envisioned outcome is a way of using computers that is less brittle,
that is less sensitive to the kind of timing issues that Mitch has
identified, that is naturally parallel, and that will produce more
reusable "software."

Robert.