From: girish on
hello.
our hardware team seems to have almost concluded that the TLBs are the
primary culprit. the countermeasure(s) such as parity bit, will for
sure lead to cut down on some other feature, to balance the die size
and all that. impacting software/kernel to some extent.
please help me understand - why TLBs? is this the longest and un-
checked/un-correctable path?

thanks in advance.
girish.gulawani
PS. this is not a course assignment.
From: MitchAlsup on
I cannot think of any particular reason that the TLB CAMs and Data
canot be covered by either parity or ECC. Neither check has to be on
the critical path as long as you have a means to machine check before
the acdcessed data damages some permanent data structure.

If you would like to understand why this is the case, contace me via e-
mail. I am available for consultations.

Mitch Alsup

From: "Andy "Krazy" Glew" on
On 3/29/2010 7:27 PM, girish wrote:
> hello.
> our hardware team seems to have almost concluded that the TLBs are the
> primary culprit. the countermeasure(s) such as parity bit, will for
> sure lead to cut down on some other feature, to balance the die size
> and all that. impacting software/kernel to some extent.
> please help me understand - why TLBs? is this the longest and un-
> checked/un-correctable path?


I agree with Mitch - there is no excuse for not having EDC/ECC on your TLBs (and nearly everything else).

But to address your question: why FITs in the TLB and not, say, in the cacge? it may be that your TLBs are not being
accessed often enough. Some workloads simply do not access many pages. A TLB entry may be loaded, and may then be left
untouched, unrefreshed, for a long time while its bits degrade. Especially if you have the equivalent of the G global
bit - OS TLB entries may endure forever if not thrashed out. Especially if you have separate TLBs for small and large
pages (superpages) - the latter tend to endure forever.

Perhaps a periodic TLB scrub - e.g. a state machine invalidating TLB entries. ? You might test it by doing a global
TLB invalidate in a timer interrupt. But a state machine would be better; and EDC/ECC better yet.

Or, well, there is a long history of circuit problems in the TLB, at many companies.
From: Noob on
Andy "Krazy" Glew wrote:

> Or, well, there is a long history of circuit problems in the TLB,
> at many companies.

e.g. recently AMD's Barcelona core.

http://en.wikipedia.org/wiki/AMD_Barcelona#TLB_Bug
http://anandtech.com/show/2477/2