|
Prev: Survey on the Effects of Organizational Culture on Software Productivity
Next: Eclipse Ada Support - FYI
From: Jeffrey R. Carter on 31 Oct 2005 01:21 Robert A Duff wrote: > That's hard to believe. Dr. Wrigley said the hardware failures turned > from "insideous" to "catastophic" when he changed some sort of Things to > pointers-to-Things. I take that to mean, he got wrong answers before, > and crashes after. Is that right, Dr. Wrigley? Thinking about it a little more, I get the idea that before he had 1 pointer to a 1 GB data structure on the heap. Then he changed to a large number of pointers to smaller data structures on the heap, equivalent to the previous 1 GB structure. With more pointers, there's a greater chance that one of the pointers might have a bit flip, resulting in the occasional crash (1 crash every few months, IIRC). Of course, this is sheer speculation. -- Jeff Carter "There's no messiah here. There's a mess all right, but no messiah." Monty Python's Life of Brian 84
From: Dr. Adrian Wrigley on 1 Nov 2005 19:52 On Mon, 31 Oct 2005 06:21:37 +0000, Jeffrey R. Carter wrote: > Robert A Duff wrote: > >> That's hard to believe. Dr. Wrigley said the hardware failures turned >> from "insideous" to "catastophic" when he changed some sort of Things to >> pointers-to-Things. I take that to mean, he got wrong answers before, >> and crashes after. Is that right, Dr. Wrigley? I Thinks so, but the error rate was very low, so it is hard to tell. > Thinking about it a little more, I get the idea that before he had 1 pointer to > a 1 GB data structure on the heap. Then he changed to a large number of pointers > to smaller data structures on the heap, equivalent to the previous 1 GB structure. This is exactly the situation. The "Things" were about 36 bytes each and I changed to having around 20 million pointers to things. Single bit errors in pointers had a much more significant effect than single bit errors in "Things" (which tended to be ignored for various reasons). > With more pointers, there's a greater chance that one of the pointers might have > a bit flip, resulting in the occasional crash (1 crash every few months, IIRC). Yes. > Of course, this is sheer speculation. well speculated! (was it so unclear?) Another "feature" I observed was that files could stay cached by the OS for months, and accumulate the occasional single-bit error. But when you evict the cached pages and read the data again, the errors disappear. Plenty of scope for very rare Heisenbugs. The warning I give is that these error rates are "normal" for modern SDRAM, but aren't usually noticed because they usually only show up if you have several GB of memory and care about every bit 24/7. Operating a financial business, I care about this! -- Adrian
From: Jeffrey R. Carter on 1 Nov 2005 22:46 Dr. Adrian Wrigley wrote: > This is exactly the situation. The "Things" were about 36 bytes each > and I changed to having around 20 million pointers to things. > Single bit errors in pointers had a much more significant effect > than single bit errors in "Things" (which tended to be ignored > for various reasons). Right. I don't think I'd ever want to consider a continually running program with 20 million pointers. What did that buy you over the single-pointer version? > well speculated! (was it so unclear?) Thanks. I wasn't clear what was causing the crashes at 1st. With a little more thought, it seemed likely it was dereferencing a pointer with a flipped bit. > Another "feature" I observed was that files could stay cached by the OS > for months, and accumulate the occasional single-bit error. But when you > evict the cached pages and read the data again, the errors disappear. > Plenty of scope for very rare Heisenbugs. Interesting. It's not something you normally have to think about; most programs don't run for that long. I remember a noticeable # of bit errors during a solar maximum about 1980, but don't recall it repeating in 1991 or 2002. -- Jeff Carter "In the frozen land of Nador they were forced to eat Robin's minstrels, and there was much rejoicing." Monty Python & the Holy Grail 70
From: Dr. Adrian Wrigley on 2 Nov 2005 06:16 On Wed, 02 Nov 2005 03:46:41 +0000, Jeffrey R. Carter wrote: > Dr. Adrian Wrigley wrote: > >> This is exactly the situation. The "Things" were about 36 bytes each >> and I changed to having around 20 million pointers to things. >> Single bit errors in pointers had a much more significant effect >> than single bit errors in "Things" (which tended to be ignored >> for various reasons). > > Right. I don't think I'd ever want to consider a continually running program > with 20 million pointers. What did that buy you over the single-pointer version? It was a simple 2-D array of stock data (time/date, ticker). But some tickers had large gaps (no data for long periods). I had been hitting various memory limits. Adding a level of indirection allowed only the valid data to consume memory, keeping the program within those limits. It's still a rather simple architecture, compared to a more traditional SQL database system for this application. But it is *very* fast and (now) very robust. -- Adrian
From: Robert A Duff on 2 Nov 2005 08:39
"Dr. Adrian Wrigley" <amtw(a)linuxchip.demon.co.uk.uk.uk> writes: > On Mon, 31 Oct 2005 06:21:37 +0000, Jeffrey R. Carter wrote: > > > Robert A Duff wrote: > > > >> That's hard to believe. Dr. Wrigley said the hardware failures turned > >> from "insideous" to "catastophic" when he changed some sort of Things to > >> pointers-to-Things. I take that to mean, he got wrong answers before, > >> and crashes after. Is that right, Dr. Wrigley? > > I Thinks so, but the error rate was very low, so it is hard to tell. Interesting. Some years ago I used a computer that developed a hardware problem. It would randomly flip the fifth bit of some bytes, once in a while. It took a long time to even notice the problem, because that changes letters to/from upper/lower case, in ASCII. So we would look at a text file, and fix some "typos" -- change "hEllo" to "hello". Until it started happening more and more. - Bob |