From: ANTant on
>> Wrong, Linux implements the configuration features also. Some
>> machines, probably newer laptops, can't be configured without ACPI.
> Very frustrating when `lspci` shows it. I presume some sort of device code IPL is required.

FYI if it is related to my issues:
$ lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:09.0 Ethernet controller: Intel Corporation 82559 InBusiness 10/100 (rev 08)
05:00.0 VGA compatible controller: nVidia Corporation G92 [GeForce 8800 GT] (rev a2)


> There have been recent changes to the kernel in this area --
> perhaps a roll-back to an earlier kernel (that gave good service
> on the hardware) would be a good test. Newer is not always better.

I was using the same Kernel 2.6.30 before and after the PSU incident. I
never had problems before, but started having problems after. Unless
something else like related kernel updates (modules or whatever) started
them.

--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: ANTant on
>> Does anyone know what these ticks do to cause
>> these rare and random machine errors and kernel panics?
>
> No but everything about those errors looks hardware related so I'd be looking at
> replacing the cpu at the very least. That looks like the most likely component
> but it's not necessarily the right one - other bits that spring to mind are
> motherboard, PSU and RAM.

Yeah, it is probably my CPU since my PSU+video card went dead and a 512
MB RAM piece showed memory errors in memtest86+ v4.00 before these
problems came out. After replacing all of them, memtest86+ v4.00 passed
a few times for several hours and few days of testings (including its
test #9).
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )
From: Robert Redelmeier on
ANTant(a)zimage.com wrote in part:
> I was using the same Kernel 2.6.30 before and after the PSU
> incident. I never had problems before, but started having
> problems after. Unless something else like related kernel
> updates (modules or whatever) started them.

This really points towards a hardware failure. As a general
rule, the modules are only updated when the kernel changes.
I suppose someone could try the MS approach of "device drivers"
on a more-or-less static kernel, but that historically has not been
the Linux approach. New kernels come out relatively frequently, so
it is not a big deal to wait an upgrade everything. Note this does
not apply for foreign modules (like nvidia), but you did not mention
upgrading -- or did you do something when you changed vidcard?

-- Robert



From: Jerry Peters on
Robert Redelmeier <redelm(a)ev1.net.invalid> wrote:
> Jerry Peters <jerry(a)example.invalid> wrote in part:
>> Wrong, Linux implements the configuration features also. Some
>> machines, probably newer laptops, can't be configured without ACPI.
>
> While I cannot say that _none_ of the 1000s of device modules use ACPI,
> I can say that most do not need it. Not to say BIOS didn't use it.
> I've compiled lots of kernels and never needed CONFIG_ACPI_*. Nor did
> it help when I couldn't get a device working -- something fairly
> frequent under Linux, especially for wireless. Very frustrating when
> `lspci` shows it. I presume some sort of device code IPL is required.
>
> I have no problem squirting arbitrary bytes at known PCI addr[s], nor
> do I imagine Linus does either, although Stallman might. But giving
> execution over to foreign code in ring0 is a recipe for insecurity.
> You wanna get Theo de Raadt even hotter under the collar? :)

IIRC some of the newer systems need it enumerate multiple CPU's. On
some laptops ACPI is needed to control screen brightness (then there
are the laptops that report ACPI events for screen brightness *and*
change it via firmware).

Yeah, it's a really crappy design, APM was much simpler.
What about SMI? That's even scarier. Or the trusted computing stuff.
ACPI is typical over-design engaged in by large companies. IBM used to
be famous for it.

>
>
>> I'd doubt that the OP's problem is caused by ACPI though. The TLB on
>> x86 is mostly hardware maintained, the OS's sole responsibility is to
>> purge the TLB when it changes the page tables. He's getting a parity
>> error in the associative array, that's a hardware problem.
>
> Agreed it looks like a hardware problem. But the fact it arises
> almost exclusively at one code address is very suspicious. Some code
> there seems to be triggering some hardware "sensitivity". Especially
> since the OP did not have this problem prior to a known PSU fry-fest.
>
> There have been recent changes to the kernel in this area --
> perhaps a roll-back to an earlier kernel (that gave good service
> on the hardware) would be a good test. Newer is not always better.

If I had to guess, it might be that the kernel uses large page
mappings for itself rather than the standard 4k page size.
Another possibility is only a particular bit pattern triggers the MC.

No one seems to be complaining on LKML about machine checks in the TLB
with recent kernels, and the fact that other hardware was damaged,
probably by overvoltage, would cause me to think it's hardware.

Jerry
From: Ant on
On 3/17/2010 1:08 PM PT, Robert Redelmeier typed:

> ANTant(a)zimage.com wrote in part:
>> I was using the same Kernel 2.6.30 before and after the PSU
>> incident. I never had problems before, but started having
>> problems after. Unless something else like related kernel
>> updates (modules or whatever) started them.
>
> This really points towards a hardware failure. As a general
> rule, the modules are only updated when the kernel changes.
> I suppose someone could try the MS approach of "device drivers"
> on a more-or-less static kernel, but that historically has not been
> the Linux approach. New kernels come out relatively frequently, so
> it is not a big deal to wait an upgrade everything. Note this does
> not apply for foreign modules (like nvidia), but you did not mention
> upgrading -- or did you do something when you changed vidcard?

Interesting. I am still having difficulities reproducing the errors and
kernel panics outside of my Debian. I tried memtest86+ v4.00 three times
and KNOPPIX v6.2.1 CD so far, and nothing. I am going to try Ubuntu v9.1
i386 CD next, maybe this weekend when I don't need to use this box much.
--
"An ant hole may collapse an embankment." --Japanese
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.