From: Skybuck Flying on
I am thinking 1920x1200x24 bit colors at 60 frames per second.

There is simply no way that any cpu can achieve that with simply x86...
very maybe sse etc... but I wouldn't hold my breath ! ;)

^ Even if it's just a single loop...

Add a few branches in the loop and it's game over 100% for sure ;)

Bye,
Skybuck.



From: Wolfgang Draxinger on
On Thu, 5 Aug 2010 01:31:39 +0200
"Skybuck Flying" <IntoTheFuture(a)hotmail.com> wrote:

> I am thinking 1920x1200x24 bit colors at 60 frames per second.

The following calculating is extremely pessimistic, in the real world
things look much, much better :)

1920 * 1200 * 8 bytes * 60/s =
0.943718400 GByte/s

Let's just assume you need 10 instructions for making your picture. And
let's say, you're using a modern CPU, clocked at about 2.5GHz. And due
to OOE and pipeline parallelization, within one clock cycle 4
instructions deliver their result.

So within one second you get values
2.5 * 10^9 * 4 / 10 = 10^9

Now let's say you're clever and operate on 32bit registers, so you're
actually getting 4 * 10^9 Bytes/s = 4GByte/s, which is more than 4 times
the data rate, than needed to fill that screen at 60FPS. In a
pessimistic view.

You may also see it from this perspective: On contemporary CPUs you
can watch FullHD-Video - which requires decompressing the stuff and
sending it to the graphics card, color space conversion usually happens
there in the video overay - at 30Hz with no problem whatsoever, it's
just that your CPU it completely cogged up with that task then. That's
why modern GPUs help with decompression, but it's not something
absolutely required to watch FullHD video on a PC.

> There is simply no way that any cpu can achieve that with simply
> x86... very maybe sse etc... but I wouldn't hold my breath ! ;)

2D Compositing is so simple, that contemporary CPUs can do it with
ease, even if you don't use the most fancy SIMD instructions, but of
course they really speed things up.

The problem arises if the CPU shall do anything other like that.

But frankly: The very first SIMD instructions (MMX) have been around
for >12 years, today every x86-CPU can do SSE2 at least. Plus AMD's own
3Dnow! extensions.

Today you can be pretty sure, to have some kind of SIMD available.

> ^ Even if it's just a single loop...

It doesn't matter how deep you stack the loops, as long you stay in
the working set. And in some (and on current CPUs most!) cases it's even
better to keep the loops and not unroll them.

Today's OOE predictors are very, very efficient in determining
something to be a loop. And unlike switch/if statements the codepath
within loops can be predetermined from the very beginning (that's why
loop unrolling works in the first place). Of course if you're branching
within a loop, you're causing problems, if the branch is between
complex code. But simple things like

if(a<WHATEVER)
a = b*c
else
a = b/c

they don't even appear to modern CPUs as two different code paths. Heck
a whole architecture emerged from that observation, and made that the
core feature of it's instruction set (ARM).

In fact loop unrolling even can make your program slower. One of the
x264 developers has a nice blog entry on the topic.
http://x264dev.multimedia.cx/?p=201

> Add a few branches in the loop and it's game over 100% for sure ;)

Branches are a problem only if each of the code paths will perform a
lot of operations on very different parts of memory. Those are real
cache killers. If your branches are short, quickly rejoin and stay
within a working set, you won't even notice they're there - performance
wise.


From: Wolfgang Draxinger on
On Thu, 5 Aug 2010 10:59:12 +0200
Wolfgang Draxinger <wdraxinger(a)darkstargames.de> wrote:

> On Thu, 5 Aug 2010 01:31:39 +0200
> "Skybuck Flying" <IntoTheFuture(a)hotmail.com> wrote:
>
> > I am thinking 1920x1200x24 bit colors at 60 frames per second.
>
> The following calculating is extremely pessimistic, in the real world
> things look much, much better :)

Sorry, I forgot a factor of 3 here, but still...

> 1920 * 1200 * 8 bytes * 60/s =
> 0.943718400 GByte/s

*3 -> 3.32GByte/s

> ...
> (...) which is more than 4 times the data rate, than needed to fill
> that screen at 60FPS. In a pessimistic view.

Yet 3.32 < 4, so the whole things still holds.


Wolfgang

From: Skybuck on
My harddisk which is pretty speedy can read at about 180 Megabyte/sec.

The screen/data that needs to be decompressed per second is:

1920x1200x60x3 = 414.720.000

The usual compression ratio is 200.

Which means:

414.720.000 / 200 = 2.073.600 bytes remain per second.

This means rougly:

2.073.600 x 200 instructions = 414.720.000 instructions for
decompress, however a single instruction decoder probably doesn't
exist... so it's pretty safe to multiply this with 2 or 3 or maybe
even 10.

Let's take 10.

414.720.000 * 10 = 4.147.200.000 instructions per second.

Some instructions might require 2 to 15 cycles... let's say 3 or so:
12.441.600.000 = 12 GigaHertz processor needed to run smoothly with
simple instructions.

Computers are near 2.0 ghz per core to maybe 4.0 ghz at best... but no
where near 12 ghz.

Tricks like sse might speed it up here and there... but probably/
definetly not enough to run smooth at 60 ghz.

So conclusion is:

CPU can't handle it, and even if it could handle it no CPU processing
power would be left to do anything else.

You seem to understand that yourself as well ;)

If you doubt this is true try writing a simple video codec yourself
and you will quickly find out that your video codec is limited by CPU
but also by the memory system... not being able to sustain random
access at such high frequencies. CPU will start to wait on memory
access if random access memory is done. This is where GPU's are better
than can do something else while waiting on the memory.

Bye,
Skybuck.































From: Skybuck on
On Aug 5, 11:13 am, Wolfgang Draxinger <wdraxin...(a)darkstargames.de>
wrote:
> On Thu, 5 Aug 2010 10:59:12 +0200
>
> Wolfgang Draxinger <wdraxin...(a)darkstargames.de> wrote:
> > On Thu, 5 Aug 2010 01:31:39 +0200
> > "Skybuck Flying" <IntoTheFut...(a)hotmail.com> wrote:
>
> > > I am thinking 1920x1200x24 bit colors at 60 frames per second.
>
> > The following calculating is extremely pessimistic, in the real world
> > things look much, much better :)
>
> Sorry, I forgot a factor of 3 here, but still...
>
> > 1920 * 1200 * 8 bytes * 60/s =
> > 0.943718400 GByte/s
>
> *3 -> 3.32GByte/s
>
> > ...
> > (...) which is more than 4 times the data rate, than needed to fill
> > that screen at 60FPS. In a pessimistic view.
>
> Yet 3.32 < 4, so the whole things still holds.

This assumes single copy... if any where in the system additional
copies have to be made this ofcourse breaks apart.

And it's highly likely that somewhere in a driver somewhere a copy is
being made...

Perhaps because of a memory swap between application space/driver and
kernel space/driver.

Bye,
Skybuck.