From: Skybuck Flying on
Hello,

Soon I will attempt some gpgpu development/shader development... I have
already done some tests and so forth the gpu and it's memory seem to be
working just fine... (and fast ;) :)) No bit errors so far.

However the thought of bit errors creeping into it is indeed a bit scary...
The hardware is older from 2006: nvidia 7900 gtx with 512 MB of RAM. (OpenGL
will be used for it's development and cg shaders).

It seems this gpu works with 4 floating point fields per register. Either 16
bit or 32 bit. Maybe it's always 32 bit ? I am not sure... probably not
because 16 bit performance is twice as high.

I plan on using: 3x16 bits so that needs a "vector" register with 4x16 bits.

This means only 48 bits are used for example the .x, .y, .z and this leaves
the .w to be used for something else.

For me it could be interesting to make two modes of operations for the final
product:

First mode which uses 3x16 bits in registers and memory.
Second mode which uses 4x16 bits in registers and memory.

This would make the second mode slightly less memory efficient.

I have a feeling that the gpu is very fast and has plenty of processing
power/instruction power.

So I might get away with adding some extra "data integrity checks" to the
..x, .y, .z components and store them in .w component.

So the idea is basically to apply a 16 bit "data integrity check" to 48 bits
of data. Using Shader Model 3.0 instructions/CG language for now...

I wonder what is a good data integrity checking algorithm to detect bit
errors in this situation ?

The data integrity algorithm should not use to many branches... hopefully
just one branch to compare results ?! ;)

It shouldn't use too many memory look ups... that would be bad for
performance ?! ;)...

The 3x16 bits could be unpacked into 6x8 bit quantities which are then
stored in 32 bit floating point registers to do further calculations on...

If the algorithm is in floating point format and limits itself to 8 bit to
16 bit precision than that should work just fine... algorithms can be done
with integers... I can convert it to 16 bit floating points ;) I could even
convert it to 64 bit floating point but that would require 64 bit software
math and would slow things done so better not to do that...

The following operations are definetly available for such an algorithm:
32 bit floating point addition, can be used as 8 bit or 16 bit integer
addition, (max 24 bit)
32 bit floating point subtraction, can be used as 8 bit or 16 bit integer
subtraction, (max 24 bit)
32 bit floating point multiplication, can be used as 8 bit or 16 bit integer
multiplication, (max 24 bit)
32 bit floating point division, can be used as 8 bit or 16 bit integer
division, (max 24 bit)

and ofcourse special graphics instructions:
dot operations
interpolation operations. (But I have never used them and don't quite
understand on the gpu at least but I am willing to learn ;) :))

Some idea's in my head for now:

1. A simply weak "checksum" where everything is summed together... seems
like a very bad algorithm, since bit flips might go undetected.
2. A crc32 ? But the algorithm I have requires a large table and thus memory
lookups... doesn't seem to smart... and crc32 is like a large division ?
maybe overkill for just 48 bits of data ?
3. I can vaguely remember something about parity ? Is that the same as a
checksum or different ? I think that's different... parity counts the bits
set and then stores that ? Doesn't seem so strong ?
4. I can vaguely remember an error correcting code which could correct 1 bit
error ? by using two parities or so ? one vertical, one horizontal ?

So I ask you software programmers/developers and hardware designers and
algorithm designers :) out there the following question:

What kind of error detection algorithms, or maybe even error correction
algorithms are out there that you think would be suited for this special
situation ?

Also maybe you can design something specially for this situation ?

(The algorithm could later also be applied to slightly smaller bits like
only 16 bits, or 32 bits, or maybe even just 8 bits, but stored in 16
bits... that would be nice.)

Bye,
Skybuck.


From: Skybuck Flying on
Here is a interesting thread about error detection on gpu's:

http://gpgpu.org/forums/viewtopic.php?p=18648&sid=c7bf701c2deed980c0a8745f15a630b8

The first guy says: "run twice see if same results..." I think this is
dangerous and wastefull, if a bit is truely damaged, the same bit error
might simply occur twice. It's also wasting resources big time ;) :)

Another guy says: people do weird things to their systems: like overclock,
or not cool properly.

Another guy says: memory chips might become warmer over time and might start
producing bit errors.

So I am thinking: It's now winter.. the computer is cool... thus no bit
errors... but what happens in the summer when it's fricking hot ? Maybe bit
errors will creep in... I will probably not be running my pc intensively
during the summer, but my "future" products users might be... it's good to
protect them from possible bit errors me thinks ;) :) So I kinda like this
idea of adding some bit error detection capabilities... just in case ! ;)

Then user can decide if he wants it or not by chosing the mode... ;) :) So I
hope by protecting the bits like that it will detect most if not all
problems with gpu/memory corruption ?

Bye,
Skybuck.




From: Skybuck Flying on
Also I just realized something... CRC32 is too big... it requires 32 bits.

There are only 16 bits available for storing an integrity code...

Also it seems crc32 does 1 memory lookup per byte... so there are 3x2 bytes
is 6 bytes, which would mean 6 memory lookups... which is a bit much for my
taste... it might be acceptable anyway for a first version... but I would
rather have something better... something that doesn't require a memory
lookup and which fits in 16 bits... that would be nice ! ;) :)

Bye,
Skybuck.


From: Dave -Turner on
so use crc16


From: Skybuck Flying on
"Dave -Turner" <admin(a)127.0.0.1> wrote in message
news:suOdnXyGC-Hf3-jWnZ2dnUVZ7v-dnZ2d(a)westnet.com.au...
> so use crc16

I will try...

However I still need an error correcting code for mode 3 (See new
posting)...

Any idea's ?

I have 0% experience with error correcting codes I am afraid ! ;) :)

Bye,
Skybuck.