From: Jeffrey R. Carter on
Robert A Duff wrote:

> That's hard to believe. Dr. Wrigley said the hardware failures turned
> from "insideous" to "catastophic" when he changed some sort of Things to
> pointers-to-Things. I take that to mean, he got wrong answers before,
> and crashes after. Is that right, Dr. Wrigley?

Thinking about it a little more, I get the idea that before he had 1 pointer to
a 1 GB data structure on the heap. Then he changed to a large number of pointers
to smaller data structures on the heap, equivalent to the previous 1 GB structure.

With more pointers, there's a greater chance that one of the pointers might have
a bit flip, resulting in the occasional crash (1 crash every few months, IIRC).

Of course, this is sheer speculation.

--
Jeff Carter
"There's no messiah here. There's a mess all right, but no messiah."
Monty Python's Life of Brian
84
From: Dr. Adrian Wrigley on
On Mon, 31 Oct 2005 06:21:37 +0000, Jeffrey R. Carter wrote:

> Robert A Duff wrote:
>
>> That's hard to believe. Dr. Wrigley said the hardware failures turned
>> from "insideous" to "catastophic" when he changed some sort of Things to
>> pointers-to-Things. I take that to mean, he got wrong answers before,
>> and crashes after. Is that right, Dr. Wrigley?

I Thinks so, but the error rate was very low, so it is hard to tell.

> Thinking about it a little more, I get the idea that before he had 1 pointer to
> a 1 GB data structure on the heap. Then he changed to a large number of pointers
> to smaller data structures on the heap, equivalent to the previous 1 GB structure.

This is exactly the situation. The "Things" were about 36 bytes each
and I changed to having around 20 million pointers to things.
Single bit errors in pointers had a much more significant effect
than single bit errors in "Things" (which tended to be ignored
for various reasons).

> With more pointers, there's a greater chance that one of the pointers might have
> a bit flip, resulting in the occasional crash (1 crash every few months, IIRC).

Yes.

> Of course, this is sheer speculation.

well speculated! (was it so unclear?)

Another "feature" I observed was that files could stay cached by the OS
for months, and accumulate the occasional single-bit error. But when you
evict the cached pages and read the data again, the errors disappear.
Plenty of scope for very rare Heisenbugs.

The warning I give is that these error rates are "normal" for modern SDRAM,
but aren't usually noticed because they usually only show up if
you have several GB of memory and care about every bit 24/7.
Operating a financial business, I care about this!
--
Adrian

From: Jeffrey R. Carter on
Dr. Adrian Wrigley wrote:

> This is exactly the situation. The "Things" were about 36 bytes each
> and I changed to having around 20 million pointers to things.
> Single bit errors in pointers had a much more significant effect
> than single bit errors in "Things" (which tended to be ignored
> for various reasons).

Right. I don't think I'd ever want to consider a continually running program
with 20 million pointers. What did that buy you over the single-pointer version?

> well speculated! (was it so unclear?)

Thanks. I wasn't clear what was causing the crashes at 1st. With a little more
thought, it seemed likely it was dereferencing a pointer with a flipped bit.

> Another "feature" I observed was that files could stay cached by the OS
> for months, and accumulate the occasional single-bit error. But when you
> evict the cached pages and read the data again, the errors disappear.
> Plenty of scope for very rare Heisenbugs.

Interesting. It's not something you normally have to think about; most programs
don't run for that long. I remember a noticeable # of bit errors during a solar
maximum about 1980, but don't recall it repeating in 1991 or 2002.

--
Jeff Carter
"In the frozen land of Nador they were forced to
eat Robin's minstrels, and there was much rejoicing."
Monty Python & the Holy Grail
70
From: Dr. Adrian Wrigley on
On Wed, 02 Nov 2005 03:46:41 +0000, Jeffrey R. Carter wrote:

> Dr. Adrian Wrigley wrote:
>
>> This is exactly the situation. The "Things" were about 36 bytes each
>> and I changed to having around 20 million pointers to things.
>> Single bit errors in pointers had a much more significant effect
>> than single bit errors in "Things" (which tended to be ignored
>> for various reasons).
>
> Right. I don't think I'd ever want to consider a continually running program
> with 20 million pointers. What did that buy you over the single-pointer version?

It was a simple 2-D array of stock data (time/date, ticker). But some
tickers had large gaps (no data for long periods). I had been hitting
various memory limits. Adding a level of indirection allowed only the
valid data to consume memory, keeping the program within those limits.

It's still a rather simple architecture, compared to a more traditional
SQL database system for this application. But it is *very* fast
and (now) very robust.
--
Adrian

From: Robert A Duff on
"Dr. Adrian Wrigley" <amtw(a)linuxchip.demon.co.uk.uk.uk> writes:

> On Mon, 31 Oct 2005 06:21:37 +0000, Jeffrey R. Carter wrote:
>
> > Robert A Duff wrote:
> >
> >> That's hard to believe. Dr. Wrigley said the hardware failures turned
> >> from "insideous" to "catastophic" when he changed some sort of Things to
> >> pointers-to-Things. I take that to mean, he got wrong answers before,
> >> and crashes after. Is that right, Dr. Wrigley?
>
> I Thinks so, but the error rate was very low, so it is hard to tell.

Interesting.

Some years ago I used a computer that developed a hardware problem.
It would randomly flip the fifth bit of some bytes, once in a while.
It took a long time to even notice the problem, because that changes
letters to/from upper/lower case, in ASCII. So we would look at
a text file, and fix some "typos" -- change "hEllo" to "hello".
Until it started happening more and more.

- Bob