Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: Terje Mathisen "terje.mathisen at on 26 Dec 2009 11:04

nmm1(a)cam.ac.uk wrote:
> In article<jg0d07-mtb.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Andy "Krazy" Glew wrote:
>>>>>
>>>>> I.e. I am suspecting that full cache coherency is overkill, but that
>>>>> completely eliminating cache coherency is underkill.
>
> That is probably correct.
>
>>>> I agree, and I think most programmers will be happy with word-size
>>>> tracking, i.e. we assume all char/byte operations happens on private
>>>> memory ranges.
>>>
>>> Doesn't that fly in the face of the Alpha experience, where originally
>>> they did not have byte memory operations, but were eventually forced to?
>>>
>>> Why? What changed? What is different?
>>
>> What's different is easy:
>>
>> I am not proposing we get rid of byte-sized memory operations, "only"
>> that we don't promise they will be globally consistent, i.e. you only
>> use 8/16-bit operations on private memory blocks.
>
> And it is that which flies in the face of the Alpha experience. Sorry,
> Terje, but you have (unusually for you) missed the key requirements.
>
>>> I suspect that there is at least some user level parallel code that
>>> assumes byte writes are - what is the proper term? Atomic? Non-lossy?
>>> Not implemented via a non-atomic RMW?
>>
>> There might be some such code somewhere, but only by accident, I don't
>> believe anyone is using it intentionally: You want semaphores to be
>> separated at least by a cache line if you care about performance, but I
>> guess it is conceivable some old coder decided to pack all his lock
>> variables into a single byte range.
>
> Andy is right, I am afraid. You aren't thinking of the right problem;
> it's not the atomic/synchronisation requirements that are the issue.

As soon as you let multiple cpus access the same cache line at the same
time, you have a serious performance problem, which is why I suggested
that only in the case of atomic/synch primitives should you ever do it.

I accept however that if both you and Andy think this is bad, then it
probably isn't such a good idea to allow programmers to be surprised by
the difference between one size of data objects and another, both of
which can be handled inside a register and with size-specific load/store
operations available.
:-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 26 Dec 2009 11:44

In article <2poh07-rui.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>As soon as you let multiple cpus access the same cache line at the same
>time, you have a serious performance problem, which is why I suggested
>that only in the case of atomic/synch primitives should you ever do it.

If any access is for update and with a few reservations, I agree.

>I accept however that if both you and Andy think this is bad, then it
>probably isn't such a good idea to allow programmers to be surprised by
>the difference between one size of data objects and another, both of
>which can be handled inside a register and with size-specific load/store
>operations available.
>:-(

No, that's NOT my position! I am not disagreeing with your approach,
not for a minute, but merely saying that it's NOT an issue for the
hardware architecture. To get there, we need radical changes in the
language architectures, so that they don't rely on separate bytes
being independent.

I can think of how it would be done, but not starting from C/C++
Java or even Fortran. There might be a chance starting from Haskell,
IBM X10 or even Python, but I should need to study some aspects in
more detail to be sure.

Actually, I am being too hard on Fortran. I think that it could be
done starting from modern Fortran, but the restrictions would be such
that I don't think that Fortran programmers would accept the result.

Regards,
Nick Maclaren.

From: Robert Myers on 26 Dec 2009 14:55

On Dec 26, 6:44 am, n...(a)cam.ac.uk wrote:

> Most modern disks are actually disk subsystems, and have massive
> buffers in the 'disk' itself - and, in this context, even a few
> tracks is 'massive'. With IBM channels, you could get performance
> without ANY extra buffering. That is one of the reasons that, for
> some decades, old, 'slow' mainframes beat the hell out of the new,
> 'fast' RISC systems. I.e. during that period, memory wasn't cheap
> enough to put the buffering in with the disk. Now it is, so what
> the hell?

Well, for me at least, you have answered a question that I didn't know
how to formulate correctly. Thank you, by the way, for spelling my
name correctly.

Actually, you answered several questions.

IBM tended to price its hardware by some measure of performance so
that it could produce ads that showed that its uber-expensive
mainframes were not merely cost-competitive, but *less* expensive than
competing unix hardware. In fact, I'm pretty sure that IBM's pricing
model has always been driven so that they could stay in the ballpark
or even win. You didn't even have to pay extra for the legendary
reliability, or at least not much extra.

How, I always wondered, could they do that, and I think you just
answered that question. For the disk-intensive benchmarks in
question, IBM off-loaded the pain from the uber-expensive CPU time to
the channels. IBM learned the "let it stall" wisdom long before
Patterson made it explicit. As to how much of the win was real (the
system was faster in transactions per real time and total cost of
ownership because it was so much more efficient about disk I/O) and
how much of it was choice of benchmark and pricing models is beyond
the scope of this forum.

The other question you answered is: what was so magic about channels.
I think you have correctly identified a big piece of the actual magic,
while at the same time acknowledging that the magic isn't so powerful
any more.

As to my ability to flatter, I can only say that I have had the
opportunity to learn from the best.

Robert.

From: Anne & Lynn Wheeler on 26 Dec 2009 15:58

Robert Myers <rbmyersusa(a)gmail.com> writes:
> IBM tended to price its hardware by some measure of performance so
> that it could produce ads that showed that its uber-expensive
> mainframes were not merely cost-competitive, but *less* expensive than
> competing unix hardware. In fact, I'm pretty sure that IBM's pricing
> model has always been driven so that they could stay in the ballpark
> or even win. You didn't even have to pay extra for the legendary
> reliability, or at least not much extra.

two decades ago ... one of the senior people in san jose disk division
got a talk scheduled at the annual world-wide internal communication
division conference. He started off the talk saying that the head of the
communication business was going to be responsible for the demise of the
disk division.

early in the introduction of PC ... something that contributing
significantly to early uptake was 3270 terminal emulation ... basically
a corporation that already business justified tens of thousands of
3270s, could get a PC for about the same price, and in single desktop
footprint, do both 3270 to mainframe operation as well as some local
computing (almost no brainer business justification ... more function
for same price as something that was already justified).

moving later into the decade ... the communication group had large
terminal emulation install base that it was attempting to protect
.... however the technology was moving on ... and the terminal emulation
paradigm was becoming a major bottleneck between all the desktops and
the datacenter. as a result ... data was leaking out of the datacenter
at an alarming rate ... significantly driving commodity desktop and
server disk market.

the disk division had attempted to bring a number of products to market
that would have provided channel-speed like thruput and a lot more
function between the desktops and the datacenter (attempting to maintain
role for the datacenter in modern distributed environment) ... but was
constantly blocked by the communcation business unit (attempting to
preserve the terminal emulation install base). misc. past posts
mentioning terminal emulation
http://www.garlic.com/~lynn/subnetwork.html#emulation

this is somewhat related to earlier battles that my wife had with the
communication group when she was con'ed into going to POK (center of
high-end mainframe) to be in charge of loosely-coupled architecture.
She was constantly battling with the communication group over using
their terminal-oriented products for high-speed multiple processor
operation. They would have temporary truce where she would be allowed to
use whatever she wanted with the walls of the datacenter ... but the
communication group's terminal-oriented products had to be used for
anything that crossed the datacenter walls. misc. past post
mentioning my wife doing stint in POK in charge of loosely-coupled
architecture
http://www.garlic.com/~lynn/submain.html#shareddata

.... anyway ... and so it came to pass ... san jose disk division is long
gone.

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970

From: Robert Myers on 26 Dec 2009 17:34

On Dec 26, 3:58 pm, Anne & Lynn Wheeler <l...(a)garlic.com> wrote:

> ... data was leaking out of the datacenter
> at an alarming rate ... significantly driving commodity desktop and
> server disk market.

Not having to deal with RJE emulation and HASP was more important to
bringing computing in-house than was the cost of computation. Even if
I had to do a computation on a Cray, I wanted the data on my own
hardware as quickly as possible, to end the back and forth.

Robert.

First | Prev | Next | Last
Pages: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Prev: PEEEEEEP
Next: Texture units as a general function