From: Borislav Petkov on
From: Jeffrey Merkey <jeffmerkey(a)gmail.com>
Date: Tue, Jun 29, 2010 at 03:13:03PM -0600

> On a 4 x Opteron HP Proliant Server with a CCISS array controller in
> x86_64 mode, under very heavy (saturated) disk IO, 2.6.34 reports the
> following error:
>
> Jun 29 02:02:08 kernel: Northbridge Error, node 0, core: 0
> Jun 29 02:02:08 kernel: ECC/ChipKill ECC error.
> Jun 29 02:02:08 kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
> Jun 29 02:02:08 kernel: EDAC amd64: get_channel_from_ecc_syndrome:
> error reading F3x180.
> Jun 29 02:02:08 kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
> Jun 29 02:03:21 kernel: Northbridge Error, node 0
> Jun 29 02:03:21 kernel: ECC/ChipKill ECC error.
> Jun 29 02:03:21 kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
> Jun 29 02:03:21 kernel: EDAC amd64: get_channel_from_ecc_syndrome:
> error reading F3x180.

It looks like you don't have extended PCI config space accesses enabled
on that machine. Can you send me the whole dmesg?

> Jun 29 02:03:21 kernel: EDAC MC0: CE page 0xc7358, offset 0x280,
> grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
>
> The error is reproduceable by subjecting the server to excessive disk
> loads > 350 MB/S stream to disk.

DRAM ECC errors. It looks most probably like the first DIMM on node 0,
whichever that is, might be slowly failing.

Pinpointing it is not that straightforward, here's what you can do:

Try to figure which it is by looking at the silkscreen labels on the
motherboard. They're normally named like "DIMM_Ax" where x is in (1,
2, ...) or "DIMM_Bx" or a similar scheme. If the layout on the mobo is
sane, I'm guessing the first DIMM in that naming scheme should be it.
Try swapping it out to see if the errors disappear.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: Jeffrey Merkey <jeffmerkey(a)gmail.com>
Date: Wed, Jun 30, 2010 at 01:21:04PM -0600

> >
> > It looks like you don't have extended PCI config space accesses enabled
> > on that machine. Can you send me the whole dmesg?
> >
>
> Here is the complete dmesg log of the Northbridge chip error messages.
> The drives report IO problems before the chip error message happens.

Jun 29 02:02:08 cloudstream kernel: Northbridge Error, node 0, core: 0
Jun 29 02:02:08 cloudstream kernel: ECC/ChipKill ECC error.
Jun 29 02:02:08 cloudstream kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
Jun 29 02:02:08 cloudstream kernel: EDAC amd64: get_channel_from_ecc_syndrome: error reading F3x180.
Jun 29 02:02:08 cloudstream kernel: EDAC MC0: CE page 0xc7358, offset 0x280, grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac
Jun 29 02:03:21 cloudstream kernel: Northbridge Error, node 0
Jun 29 02:03:21 cloudstream kernel: ECC/ChipKill ECC error.
Jun 29 02:03:21 cloudstream kernel: EDAC amd64 MC0: CE ERROR_ADDRESS= 0xc7358280
Jun 29 02:03:21 cloudstream kernel: EDAC amd64: get_channel_from_ecc_syndrome: error reading F3x180.
Jun 29 02:03:21 cloudstream kernel: EDAC MC0: CE page 0xc7358, offset 0x280, grain 0, syndrome 0xa4c1, row 3, channel 0, label "": amd64_edac

Right, this is the ECC happening. I asked about the dmesg because of
the "error reading F3x180" but you have a K8 machine so no extended PCI
config space there. And the error message is wrong there, I will move
that F3x180 read behind a family check since it makes no sense to access
that on K8.

Thanks.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/