From: austin on
Symon,

Well, Cypress, Xilinx, IBM, and many others have made it no secret that
neutrons at sea level are causing upsets, and we have done something
about it (and presented the papers, and shown our results).

Intel has also been working very quietly on this, with much less press.

I suggest that if you are not thinking about single event effects, you
should be, and demanding your vendor show you the proof of their design
efforts in this regard.

Virtex 5 is (as of today), 144 FIT/Mb for the config bits, 95%
confidence interval from 100 to 200 FIT/Mb. This is from our 400
devices located on mountain tops in France (31.029 Giga-bit-years of
test time, 35 events).

Compare this to a 65nm ASSP or ASIC, which is at least 1000 FIT/Mb or
1000 FIT/million gates(!). Do nothing, and it gets worse. Do
something, and it gets back to where it should be. These numbers from
the SELSE II conference a few years back: the industry numbers are
really a lot worse, but no one will admit it.

There is a reason why Xilinx FPGA devices are finding their way into
many high availability and high reliability applications: we are the
only choice -- there is no competition whatsoever.

Austin
From: austin on
Symon,

First of all, there is no such thing as a single particle detector.

Secondly, detecting the current spike (from a strike) requires a very
complex circuit, itself subject to spikes (I know, we designed them for
the USAF...)

Thirdly, Intel has done far more than this, and deserved a better PR.

Perhaps they should fire the PR firm?

Austin

Symon wrote:
> "austin" <austin(a)xilinx.com> wrote in message
> news:ftg25m$p2m2(a)cnn.xsj.xilinx.com...
>> Intel has also been working very quietly on this, with much less press.
>>
> Hi Austin,
> I wondered what were your thoughts on their patent where "The cosmic ray
> detector [built into the device] is therefore designed to spot when rays
> have caused interference and then tell the chip to repeat the command." ? I
> guess in an FPGA it could trigger a readback to ensure the device was still
> correctly configured and/or issue a user logic reset.
> Cheers, Syms.
>
>
From: austin on
And,

Yes, in S3A, S3AN, S3D, V4, V5 we are able to either reconfigure on
detection of an upset, notify the user (and they decide what to do), or
in V4 and V5, correct the flipped bit without having to reconfigure (or
even go to the config flash/prom).

Basically, in our road show, it is detailed how the user needs to decide
what to do, and at what levels, in order to meet their availability and
reliability numbers.

Mitigation is part hardware, part system architecture, and part
software. Depending on what you are doing, and how long you can
tolerate being "off-line" there are different solutions.

They are:
-just reconfigure, start fresh
-just fix the bit flip, continue on (as a flip does nothing 90% of the
time, and seldom causes anything to 'crash')
-fix the bit flip and reset or go back to a check point/known states
-use dual redundancy, and check for agreement (if a fault is not
tolerated - like in banking, accounting) repeat if no agreement
-use full triple modular redundancy (when it must be correct, and 100%
available), also scrub to fix bits that may flip so flips are not
allowed to accumulate

All methods are used by our customers, and they all work. We have
reference designs and support for these models. And they can be tested
by reconfiguring to flip bits while operating. One heck of a lot cheaper
than using a proton beam, or neutron beam .... and more complete (we
have folks who flip each bit, one by one, and prove their system meets
its requirements).

Austin
From: Jon Elson on


Symon wrote:
> "austin" <austin(a)xilinx.com> wrote in message
> news:ftg25m$p2m2(a)cnn.xsj.xilinx.com...
>
>>Intel has also been working very quietly on this, with much less press.
>>
>
> Hi Austin,
> I wondered what were your thoughts on their patent where "The cosmic ray
> detector [built into the device] is therefore designed to spot when rays
> have caused interference and then tell the chip to repeat the command." ? I
> guess in an FPGA it could trigger a readback to ensure the device was still
> correctly configured and/or issue a user logic reset.
> Cheers, Syms.
>
>

Boy, I saw that text, too, and really wondered about how reliable such a
procedure would be. If the state of flip-flops or dynamic memories are
altered, repeating the previous instruction operation would be
worthless. There is SO much more area in high-end CPUs devoted to
memory and much less to logic functions, I would expect memory
corruption to be the most probable fault.

Jon

From: Colin Paul Gloster on
Austin posted:
|------------------------------------------------------------------------|
|"[..] |
| |
|[..] they can be tested |
|by reconfiguring to flip bits while operating. One heck of a lot cheaper|
|than using a proton beam, or neutron beam .... and more complete (we |
|have folks who flip each bit, one by one, and prove their system meets |
|its requirements)." |
|------------------------------------------------------------------------|

Logical testing will not match checking whether real radiation respects
your model of the system. One transient can defeat the outcome of clocked
triply modularly redundant voters.

Sincerely,
Colin Paul Gloster,
unemployed and cold