From: Skybuck Flying on
Hello,

First the bad news:

The bad news is probably that with my current design, too many passes are
needed, which means lot's of shader switching, texture switching,
framebuffer switching and possibly even vertex buffer switching. (Which
would need to happen for each pass).

The number of passes is something like:

for Rounds := 1 to 100 do
PassA
for Cycles := 1 to 80000 do
PassB
for Warriors := 1 to 10 do
PassQ
PassX
PassY
PassZ
end
PassC
end
PassD
end

So that's roughly: 100 x 80000 x 10 x (4 + ?) = 320.000.000 passes.

I did a little benchmark test with just the cpu, and some almost empty
routines... and the cpu does it in about 2 seconds or so... I did a
benchmark test some time ago with opengl... and opengl would be somewhere
between 300 and 300.000 fps... so that's no where near the number that's
necessary... so I am pretty much losing faith in the current design.

So that's the bad news ;) :) (I guess this is also nvidia's dirty little
secret ?! maybe cuda suffers the same faith... and needs to switch a lot
between "kernels/warps" for more advanced algorithms... this could be the
reason why nvidia might be developing a cpu to be integrated into the gpu...
so that these "pass" switches could be done by a "special cpu-like" chip on
the gpu... to accelerate it... how it would work... I have almost no idea
except...; I guess it would upload some kind of special cpu like program...
to it... just like a shader would go to gpu. This has been predicted by
Tom's Hardware recent article ;) :))

Ok such a bad news cannot go without good news ! LOL. Unfortunately... I
have now to make up some good news... so I am going to go back over all the
numbers to see if I can design a new design/algorithm to take more adventage
of the gpu itself... without requiring so many fricking passes !

The general idea is to use the registers of the cores to keep track of data
value's like a tiny minimum cpu cache.

However I am not sure if each core inside the gpu has it's own registers ?!?
I would guess so...

However... these registers need to be freeed... before the core can go to
next input/output data ???
(Or maybe the gpu has a stack for the registers ??? I think not... but not
sure... ;))

So this would mean... that by the end of the shader... the registers need to
be outputted to the output.

So this means I have to go back to one of my postings/idea's in the past...
where I wrote something like: "I can see a possible future for gpu
computing" ;) :)

The idea was to treat the cores of the gpu and the registers of the
shaders... like little tiny intelligent cells... which only do work if it
was addressed to them.

So the idea (here) I guess is to "attach" intelligent little cells to each
pixel/index.

Which then all get iterated by a single pass/shader which is executed
hopefully for 300.000 times or so... or 3000 times or 300 times... but at
least it will just be one shader, maybe two or so but that's it... maybe
vertex shader, plus fragment shader.. maybe two times.

So the idea is as follows:

Each pixel has the following data attached:
Core[Pixel].Instruction
Warrior[0].P-Space[Pixel].Value
Warrior[1].P-Space[Pixel].Value
Warrior[0].ProcessQueue[Pixel].Value
Warrior[1].ProcessQueue[Pixel].Value

Additional pixels could keep track of:
Simulator[S].WarriorsAlive
Simulator[S].Warrior[W].Score


Ultimately the pixel shaders/vertex shader could simply treat the input
memory as one gigant
byte array in rgba32 format... by using unpack/pack functions to extract and
store bytes in them.

Then two textures exist... an input texture and an output texture which are
more or less the same...

These would be close to 256 MB each to fit in the 512 MB ram...

If the shaders are to do updates on these textures then it could take
possibly lot's of bandwidth
but that keep be prevented by doing smart updates... but let's see what a
dumb algorithm would be like if one had to copy this all the time:

roughly 50*1024 MB/sec available / 256 MB per cycle = 200 cycles... woopsie
! ;) :)

That's very bad lol.

Let's see how much bandwidth we can actually burn to achieve cpu like
capabilities:

100 x 80.000 x 10 = 80000000

50*1024 * 1024 * 1024 B / sec / 80.000.000 = 53687091200 / 80000000 = 671
bytes.

Wow that's not a lot... quite surprising really...

Then again per cycle 5x6 bytes of updates needed for instructions or so...
is 30 bytes...
for pspace maybe another 2 bytes... for process queue maybe another 4
bytes... plus some additional process queue overhead/head/tail/processes
thingies... so this is well within range...
however address would need to specified as well... then this stuff copied...
but it should be doable...

However I said "cpu like capabilities"... so this means the gpu is actually
as fast as the cpu at this point... which is quite weird...

So all in all... this means something like 671 / ? = ? performance benefit
over cpu...

So this number could actually be quite important... I estimated the number
to be 58, this would need to be copied twice so that's 116.

So 671 / 116 = 5.7 speed up over cpu.

This is worst case scenerio though... actual performance might be better.
Though this is kinda sadning ;) :)

Lesson learned during this posting:

GPU GB/sec / CPU GB/sec = Speedup of GPU over CPU.

Now according to the figures/specs:
the gpu is roughly: 50 GB/sec
the cpu is roughly: 16 GB/sec

So speedup when both running at maximum efficiency: 50 / 16 = 3.125

This does not include bandwidth limit of pci-express lane... which could
bottleneck the gpu for some bigger algorithms... if that's the case the
opposite could happen:
CPU speed up:
16 / 2 = 8x... cpu could be 8x times faster if pci-express is bottle neck ?!
;) :) and data remains inside cpu which is unlikely ?!? so maybe not fair
comparision ;)

With main memory cpu could be 2 GB/sec so speedup of gpu is:
50 / 2 = 25x max.

But since it needs to do double/swapping etc...
25 / 2 = 12.5x max.

Which is kinda the number reported by others...

Access time seems to be a total different matter though... here the gpu
could prevail... once the memory is uploaded.

So I guess there is very little good news... and only some lessons
learned...

However the good news could be: I learned a lot about opengl/cg shaders and
the gpu and it's capabilities...

I could now give up on corewars executor on gpu because it would probably
not run fast ?!?
(opengl does not achieve high enough iterations/passes, and bandwidth could
be issue too ;))

5x time speed up is not enough for the effort, 12x speed up is not enough
for the effort, 30x speed up is not enough for the effort ! ;) :) I want
9000x speed up ! LOL ;) :)

So I think it's time to give up on this pipedream of corewars on the gpu...
I mainly did it because others on a forum dreamed about it and I thought it
would be nice if I could make their dream a reality... however it remains a
pipe dream me thinks... ;) :)

I can now go back to my older software and focus on that instead... I could
also create new other software... and/or I could also try and create new ai
for my game which is what actually got me into corewars in the first place !
;) :)

However I think I have spent enough time on this "entertainment/educational"
thing ;) :)

So I think it's now time for me to switch to something else... and leave it
for now ! =D

Bye,
Skybuck ;) =D


From: Skybuck Flying on
Oh well... I came this far... I have been positively surprised by some
performance benchmarks... maybe there is some magic in there after all..
like caching effects...

And going back to just 2 cores kinda sux anyway...

So maybe I will continue the project... and finish it... just to see what
the final performance would be ;) :)

Bye,
Skybuck =D