How Many Processor Cores Are Enough? [Computer Architecture]

Prev: Trying to design low level hard disk manipulation program
Next: New information on POWER6

From: Nick Maclaren on 24 Oct 2006 04:15

In article <puGdncja4PrkpqDYnZ2dnUVZ_rGdnZ2d(a)comcast.com>,
"Chris Thomasson" <cristom(a)comcast.net> writes:
|>
|> > But I can easily believe that there is such a barrier at the hardware
|> > level for that CPU. Let's assume that, until and unless we see evidence
|> > to the contrary.
|>
|> So far, so good... However, IMHO, it seems that this kind of stuff could be
|> fairly easily documented in an explicit fashion... Humm...

The reason that it is not is that hardware vendors quite reasonably do
not want to handicap their CPU implementors by putting too many of the
nasty details into the architecture. In my view, they could do better
by specifying an abstract architectural model and more explicit control,
but that doesn't eliminate the problem.

|> Well, Okay... Let's say I read the arch docs for NewCPUFoo and they happen
|> to explicitly and clearly state that if you want loads/stores to be atomic,
|> you simply have to ensure that the variable you are loading from or storing
|> to is exactly equal to the size of a system pointer, and must be aligned on
|> a boundary that is a multiple of the size of a system pointer. For example,
|> a 32-bit pointer would mean that the variable has to be exactly 32-bits
|> wide, and it has to be aligned and be all by itself on a boundary that is a
|> multiple of 32-bits...

There are two problems with that description, which are related:

1) Most architectures were written for serial codes/CPUs, and the
designers do not want to make them incompatible (and constrain them) by
imposing restrictions on serial codes/CPUs that affect only multiple
threads run on multiple CPUs.

2) A lot of 64-bit objects need to be only 32-bit aligned, and most
128-bit objects need to be only 64-bit aligned.

Regards,
Nick Maclaren.

From: Nick Maclaren on 24 Oct 2006 04:42

In article <4q4ledFlff5uU1(a)individual.net>,
Del Cecchi <cecchinospam(a)us.ibm.com> writes:
|>
|> If I interpret this article http://www.linuxjournal.com/article/8211
|> correctly, expecting those stores to not be reordered as seen from
|> another cpu is unrealistic unless steps are taken in the software to
|> make it so. It is realistic to expect them to occur in order as seen
|> from the cpu where they originate.

Further on that, it seems to be entirely about banking, but I suspect
that interrupt handling is actually a cause of the obscurer problems.
Yes, I know that it is my pet hobby-horse, but that is the result of
experience with stepping in its excrement [*] :-(

Consider a simple, serial CPU executing a series of memory accesses,
one of which gets a TLB miss. Few architectures say that TLB miss
handling is sequentially consistent with ALL memory operations, but
I suspect that it generally is. Certainly, my measurements of its
cost are consistent with that.

But, if it gets it wrong, we have a really lethal trap, because it
means that a TLB miss on a location used in inter-CPU communication
will SOMETIMES result in an unexpected access reversal. And that
is what I am pretty sure I have seen - but was this the cause or was
it just a banking issue?

Now let's move on to ECC handling. The interrupt no longer can be
taken when the instruction is executed, but has to be delayed until
the memory is accessed. And what does it do? Now, that is NEVER
documented in architectures, as far as I know, and obviously it can
cause an access reversal unless it is synchronised.

But at least decent systems log such things (unlike TLB misses!), so
they can be cross-checked with unexpected problems. I have never seen
such a correlation.

[*] Which is not quite as rare as that of rocking horses :-(

Regards,
Nick Maclaren.

From: Joe Seigh on 24 Oct 2006 07:22

Nick Maclaren wrote:
> In article <4q4ledFlff5uU1(a)individual.net>,
> Del Cecchi <cecchinospam(a)us.ibm.com> writes:
> |>
> |> If I interpret this article http://www.linuxjournal.com/article/8211
> |> correctly, expecting those stores to not be reordered as seen from
> |> another cpu is unrealistic unless steps are taken in the software to
> |> make it so. It is realistic to expect them to occur in order as seen
> |> from the cpu where they originate.
>
> Thanks for finding that; at a quick glance, I agree, and it justifies
> a more thorough perusal. The performance issue is why I have always
> been somewhat suspicious of salesmen and othhers who claim complete
> sequential consistency on the memory of a modern SMP system. I can't
> think of how it can be done, efficiently, even in theory ....
>

Transactional memory, aka magic at this point.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

From: Chris Thomasson on 24 Oct 2006 07:45

"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
news:hvGdnXAAUoqlaqDYnZ2dnUVZ_rGdnZ2d(a)comcast.com...
> Nick Maclaren wrote:
>> In article <4q4ledFlff5uU1(a)individual.net>,
>> Del Cecchi <cecchinospam(a)us.ibm.com> writes:
>> |> |> If I interpret this article
>> http://www.linuxjournal.com/article/8211
>> |> correctly, expecting those stores to not be reordered as seen from |>
>> another cpu is unrealistic unless steps are taken in the software to |>
>> make it so. It is realistic to expect them to occur in order as seen |>
>> from the cpu where they originate.
>>
>> Thanks for finding that; at a quick glance, I agree, and it justifies
>> a more thorough perusal. The performance issue is why I have always
>> been somewhat suspicious of salesmen and othhers who claim complete
>> sequential consistency on the memory of a modern SMP system. I can't
>> think of how it can be done, efficiently, even in theory ....
>>
>
> Transactional memory, aka magic at this point.

The cache coherency mechanism that could properly support a robust
transactional memory scheme would have to be really strict IMHO... Think of
scenarios in which there could be lots of reader threads iterating over
large shared linked data-structures in parallel... The transactional
coherency system could potentially wind up having to track all of those many
thousands of reads transactions' that will be generated my the frequently
reading threads... It would die from livelock; ahh the explicit contention
manager to the rescue! So, when the times get tough for TM, it basically is
forced to fold under its security blanket, and wait for things to cool way
down for a while... This is because TM is kind of like obstruction free
algorithms... One small trivial interference from another thread, and the
operations dies and asks the contention manager if it can retry...

I don't think TM can scale very well at all wrt the previous scenarios that
I briefly described...

Any thoughts on this?

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f6399b3b837b0a40

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9c572b709248ae64

:O

From: Nick Maclaren on 24 Oct 2006 07:54

In article <hvGdnXAAUoqlaqDYnZ2dnUVZ_rGdnZ2d(a)comcast.com>,
Joe Seigh <jseigh_01(a)xemaps.com> writes:
|> >
|> > Thanks for finding that; at a quick glance, I agree, and it justifies
|> > a more thorough perusal. The performance issue is why I have always
|> > been somewhat suspicious of salesmen and othhers who claim complete
|> > sequential consistency on the memory of a modern SMP system. I can't
|> > think of how it can be done, efficiently, even in theory ....
|>
|> Transactional memory, aka magic at this point.

Sigh. Indeed :-(

Like 90% of the other problems with getting parallelism right in
hardware, the problem is actually moving people to a coding paradigm
where the problem is soluble. I don't see any problem with that
implementing that model for BSP, transactional databases etc.!

But, to do it for C++ ....

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Prev: Trying to design low level hard disk manipulation program
Next: New information on POWER6