From: Brian Gordon on
Greetings,

I work in the aerospace industry and one of the considerations
that occurs in aerospace is a phenomenon called Single Event Upsets
(SEU). I'm not an expert on the physics behind this phenomenon, but
the end result is that bits in RAM change state due to high energy
particles passing through the device. This phenomenon happens more
often at higher altitudes (aircraft) and is a very serious
consideration for space vehicles.

When these SEU can be detected some action may be taken to improve
the behaviour of the system (log a fault and reset in order to
refresh things from scratch?). So the first question becomes how to
detect an SEU. Flash is considered somewhat safer than RAM. When
executables run in linux, do the .text and .ro sections get copied
into RAM? If so, can a background task monitor the RAM copy of .text
and .ro for corruption? Tripwire seems to offer this kind of
detection as a means for detecting tampering by a malicious attacker
in the filesystem, but I am not convinced that it would detect
modifications to copies of the ELF in RAM.

My understanding how linux does "on-demand" loading of executables
may be a problem here. But this SEU detection capability would seem
to have some applicability to intrusion detection, so I have to think
some mechanism already exists.

Thank you to anyone for any pointers on where I can look to learn
more about detecting SEU in linux.

legerde at gmail com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
Brian Gordon <legerde(a)gmail.com> writes:
> I work in the aerospace industry and one of the considerations
> that occurs in aerospace is a phenomenon called Single Event Upsets
> (SEU). I'm not an expert on the physics behind this phenomenon, but
> the end result is that bits in RAM change state due to high energy
> particles passing through the device. This phenomenon happens more
> often at higher altitudes (aircraft) and is a very serious
> consideration for space vehicles.

It's also a serious consideration for standard servers.

> When these SEU can be detected some action may be taken to improve
> the behaviour of the system (log a fault and reset in order to
> refresh things from scratch?). So the first question becomes how to
> detect an SEU. Flash is considered somewhat safer than RAM. When
> executables run in linux, do the .text and .ro sections get copied
> into RAM? If so, can a background task monitor the RAM copy of .text
> and .ro for corruption?

On server class systems with ECC memory hardware does that.

The hardware stores the RAM contents using an error correcting
code that can normally correct one bit errors and detect multi-bit
errors.

There are various more or less sophisticated variations of
this around, from simple ECC, over chipkill to handle DIMMs failing,
upto various variants of full memory mirroring.

> Thank you to anyone for any pointers on where I can look to learn
> more about detecting SEU in linux.

Normally server class hardware handles this and the kernel then reports
memory errors (e.g. through mcelog or through EDAC)

Hardware also stops the system before it would consume corrupted
data.

Newer Linux also has special code that allows to recover
from this in some circumstances or use predictive failure analysis
with page offlining to prevent future problems. This requires
suitable hardware support.

Lower end systems which are optimized for cost generally ignore the
problem though and any flipped bit in memory will result
in a crash (if you're lucky) or silent data corruption (if you're unlucky)

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Brian Gordon on
> It's also a serious consideration for standard servers.
Yes. Good point.

> On server class systems with ECC memory hardware does that.

> Normally server class hardware handles this and the kernel then reports
> memory errors (e.g. through mcelog or through EDAC)

Agreed. EDAC is a good and sane solution and most companies do this.
Some do not due to naivity or cost reduction. EDAC doesn't cover
processor registers and I have fairly good solutions on how to deal
with that in tiny "home-grown" tasking systems.

On the more exotic end, I have also seen systems that have dual
redundant processors / memories. Then they add compare logic between
the redundant processors that compare most pins each clock cycle. If
any pins are not identical at a clock cycle, then something has gone
wrong (SEU, hardware failure, etc..)

> Lower end systems which are optimized for cost generally ignore the
> problem though and any flipped bit in memory will result
> in a crash (if you're lucky) or silent data corruption (if you're unlucky)

Right! And this is the area that I am interested in. Some people
insist on lowering the cost of the hardware without considering these
issues. One thing I want to do is to be as diligent as possible (even
in these low cost situations) and do the best job I can in spite of
the low cost hardware.

So, some pages of RAM are going to be read-only and the data in those
pages came from some source (file system?). Can anyone describe a
high level strategy to occasionaly provide some coverage of this data?

So far I have thought about page descriptors adding an MD5 hash
whenever they are read-only and first being "loaded/mapped?" and then
a background daemon could occasionaly verify. Does tripwire
accomplish this kind of detection by monitoring the underlying
filesystem (I dont think so)?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Friesen on
On 06/10/2010 11:29 AM, Brian Gordon wrote:

> When these SEU can be detected some action may be taken to improve
> the behaviour of the system (log a fault and reset in order to
> refresh things from scratch?). So the first question becomes how to
> detect an SEU.

I do work in telco stuff. We use ECC RAM, turn on ECC/parity on the
various buses, enable error-checking in the hardware, etc.

At higher abstraction levels you can checksum the data being stored and
validate it when you access it.

Some of the errors are "soft" and can be corrected, others are "hard"
and uncorrectable. If you get enough "soft" errors in a short enough
time it may be desirable to treat it as a "hard" error and reset.

> Thank you to anyone for any pointers on where I can look to learn
> more about detecting SEU in linux.

You might start by taking a look at the "edac" code in the kernel.
Linux in general doesn't normally enable all the fault detection code,
so you may need to start looking at datasheets.

Chris

--
The author works for GENBAND Corporation (GENBAND) who is solely
responsible for this email and its contents. All enquiries regarding
this email should be addressed to GENBAND. Nortel has provided the use
of the nortel.com domain to GENBAND in connection with this email solely
for the purpose of connectivity and Nortel Networks Inc. has no
liability for the email or its contents. GENBAND's web site is
http://www.genband.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Brian Gordon on
> I do work in telco stuff. �We use ECC RAM, turn on ECC/parity on the
> various buses, enable error-checking in the hardware, etc.

Excellent stuff when you have it. :)

> At higher abstraction levels you can checksum the data being stored and
> validate it when you access it.

What about .ro and .text sections of an executable? I would think
kernel support for that would be required. If its application data,
then all sorts of things are possible like you described. Ive also
seen critical ram variables be stored in triplicate and then
compared/voted just to ensure no silent SEU corruption.

> You might start by taking a look at the "edac" code in the kernel.
> Linux in general doesn't normally enable all the fault detection code,
> so you may need to start looking at datasheets.

Thank you for the suggestion. If the memory device supports EDAC/ECC
then definitely enabling it is a good strategy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/