From: "Andy "Krazy" Glew" on
On 4/17/2010 3:58 PM, Brett Davis wrote:
> In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
> EricP<ThatWouldBeTelling(a)thevillage.com> wrote:
>

> The future of computing (this decade) is lots of simple in-order CPUs.
> Rules of die size, heat and efficiency kick in. Like ATI chips.

This remains to be seen. I am tempted to say "That is so last decade 200x".

Wrt GPUs, perhaps.

However, in the last few months I have been seeing evidence of a trend the other way:

The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads.
Apple's PA-Semi's team's last processor was a low power PowerPC.

I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family.

Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to
see an out-of-order flavor of such an x86 vector SIMD. De-facto, AVX is that, although not in the same space as Larrabee.

I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
microarchitectures.

The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
much to invest in either half. Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
some workloads.

I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
Coherent threading as well. Or vice versa. Minimizng the amount of hardware that can only be used for one class of
computation.
From: John W Kennedy on
On 2010-04-17 04:44:49 -0400, robin said:
> Now 3 octets will be 9 bits,

Robin, will you /please/ stop blithering on about things you don't
understand?! Buy a Data Processing dictionary, for God's sake!

The rest of this is addressed to the original poster: I don't
understand why you're using int variables for octets; they should be
char. For the rest, I'd do the following:

typedef struct __Pixel {
unsigned char red, green, blue;
} Pixel;

Pixel src[W][H];
Pixel dest[H][W];

for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i) {
dest[j][i].red = src[i][j].red;
dest[j][i].green = src[i][j].green;
dest[j][i].blue = src[i][j].blue;
}

I'd also try:

for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i)
memcpy(dest[j][i], src[i][j], sizeof (Pixel));

and see which one is faster; it will depend on the individual compiler.
(Make sure you test at the same optimization level you plan to use.)

--
John W Kennedy
"There are those who argue that everything breaks even in this old dump
of a world of ours. I suppose these ginks who argue that way hold that
because the rich man gets ice in the summer and the poor man gets it in
the winter things are breaking even for both. Maybe so, but I'll swear
I can't see it that way."
-- The last words of Bat Masterson

From: nmm1 on
In article <ggtgp-2839D7.17581617042010(a)news.isp.giganews.com>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
> EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
>
>> Oh, you are avoiding read after write data
>> dependency pipeline stalls on in-order cpus.
>
>No, nice guess.
>I dont know of any CPU that stalls on a read after write, instead
>they try and forward the data, and in the rare case when a violation
>occurs the CPU will throw an interrupt and restart the instructions.
>This is an important comp.arch point, so someone will correct me
>if I am wrong.

There used to be a fair number, and I suspect still are. However,
I doubt that stalling when the read and write are on a single CPU
will return. However, I would expect that at least some multi-core
CPUs stall at least sometimes, because there will not be cache to
cache links at all levels, and they will have to wait until the
write reaches the next cache (or memory) with a link.

But that's guessing on the basis of history and general principles,
not actual knowledge.


Regards,
Nick Maclaren.
From: nmm1 on
In article <4BCA775A.8040604(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/17/2010 3:58 PM, Brett Davis wrote:
>> In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
>> EricP<ThatWouldBeTelling(a)thevillage.com> wrote:
>>
>
>> The future of computing (this decade) is lots of simple in-order CPUs.
>> Rules of die size, heat and efficiency kick in. Like ATI chips.
>
>This remains to be seen. I am tempted to say "That is so last decade 200x".
>
>Wrt GPUs, perhaps.
>
>However, in the last few months I have been seeing evidence of a trend the other way:
>
>The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads.
>Apple's PA-Semi's team's last processor was a low power PowerPC.
>
>I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family.
>
>Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to
>I suspect that we will end up in a bifurcated market: out-of-order for
>the high performance general purpose computation in cell phones and
>other important portable computers, in-order in the
>SIMD/SIMT/CoherentThreading GPU style microarchitectures.
>
>The annoying thing about such bifurcation is that it leads to hybrid
>heterogenous architectures - and you never know how much to invest in
>either half. Whatever resource allocation you make to in-order SIMD
>vs. ooo scalar will be wrong for some workloads.

Well, yes, but that's no different from any other choice. As I have
posted before, I favour a heterogeneous design on-chip:

Essentially uninteruptible, user-mode only, out-of-order CPUs
for applications etc.
Interuptible, system-mode capable, in-order CPUs for the kernel
and its daemons.

Most programs could be run on either, whichever there was more of,
but affinity could be used to select which. CPUs designed for HPC
would be many:one; ones designed for file serving etc would be
one:many. But all systems would run on all CPUs.


Regards,
Nick Maclaren.
From: bartc on

"John W Kennedy" <jwkenne(a)attglobal.net> wrote in message
news:4bca7f2a$0$22520$607ed4bc(a)cv.net...
> On 2010-04-17 04:44:49 -0400, robin said:
>> Now 3 octets will be 9 bits,
>
> Robin, will you /please/ stop blithering on about things you don't
> understand?! Buy a Data Processing dictionary, for God's sake!
>
> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char. For the
> rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }
....
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));

And perhaps dest[j][i] = src[i][j];

But in practice W,H might only be known at runtime, making the code rather
different. Depending on the exact format of the image data, there might also
be padding bytes (nothing to do with C's struct padding), for example at the
end of each row.

--
Bartc