From: Gary Mills on
We had a reboot recently that was a result of this hardware fault:

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical

Fault class : fault.cpu.intel.nb.fsb
FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
faulty

How do I determine which CPU or core is at fault? This is on an E4450
with four four-core CPUs. `psrinfo -vp' says:

The physical processor has 4 virtual processors (0 4-6)
x86 (chipid 0x0 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (1 7-9)
x86 (chipid 0x2 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (2 10-12)
x86 (chipid 0x4 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (3 13-15)
x86 (chipid 0x6 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz

--
-Gary Mills- -Unix Group- -Computer and Network Services-
From: Cydrome Leader on
Gary Mills <mills(a)cc.umanitoba.ca> wrote:
> We had a reboot recently that was a result of this hardware fault:
>
> --------------- ------------------------------------ -------------- ---------
> TIME EVENT-ID MSG-ID SEVERITY
> --------------- ------------------------------------ -------------- ---------
> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>
> Fault class : fault.cpu.intel.nb.fsb
> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
> faulty
>
> How do I determine which CPU or core is at fault? This is on an E4450
> with four four-core CPUs. `psrinfo -vp' says:

While you can disable cores/processors for solaris x86, it's not clear if
it really does anything. On a sparc platform, yes you can really disable
memory and processors and it's for real.

I've seen xeon processors (really cores) fail in solaris before and in
real life there's nothing wrong at all with the CPU. For intel hardware
just rebooting seems to be the fix. I suspect it's some sort of software
issue.


From: Gary Mills on
In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes:

>Gary Mills <mills(a)cc.umanitoba.ca> wrote:
>> We had a reboot recently that was a result of this hardware fault:
>>
>> --------------- ------------------------------------ -------------- ---------
>> TIME EVENT-ID MSG-ID SEVERITY
>> --------------- ------------------------------------ -------------- ---------
>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>
>> Fault class : fault.cpu.intel.nb.fsb
>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>> faulty
>>
>> How do I determine which CPU or core is at fault? This is on an E4450
>> with four four-core CPUs. `psrinfo -vp' says:

In this instance, I'd really like to know which CPU was faulty.
I can guess, but I might be wrong. (It was actually an X4450.)

>I've seen xeon processors (really cores) fail in solaris before and in
>real life there's nothing wrong at all with the CPU. For intel hardware
>just rebooting seems to be the fix. I suspect it's some sort of software
>issue.

This server needed a power-cycle before it came back to normal. A
reboot wasn't sufficient. Either something didn't get reset fully
or it was a real hardware failure.

--
-Gary Mills- -Unix Group- -Computer and Network Services-
From: Cydrome Leader on
Gary Mills <mills(a)cc.umanitoba.ca> wrote:
> In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes:
>
>>Gary Mills <mills(a)cc.umanitoba.ca> wrote:
>>> We had a reboot recently that was a result of this hardware fault:
>>>
>>> --------------- ------------------------------------ -------------- ---------
>>> TIME EVENT-ID MSG-ID SEVERITY
>>> --------------- ------------------------------------ -------------- ---------
>>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>>
>>> Fault class : fault.cpu.intel.nb.fsb
>>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>>> faulty
>>>
>>> How do I determine which CPU or core is at fault? This is on an E4450
>>> with four four-core CPUs. `psrinfo -vp' says:
>
> In this instance, I'd really like to know which CPU was faulty.
> I can guess, but I might be wrong. (It was actually an X4450.)
>
>>I've seen xeon processors (really cores) fail in solaris before and in
>>real life there's nothing wrong at all with the CPU. For intel hardware
>>just rebooting seems to be the fix. I suspect it's some sort of software
>>issue.
>
> This server needed a power-cycle before it came back to normal. A
> reboot wasn't sufficient. Either something didn't get reset fully
> or it was a real hardware failure.

If you have any core files, sun might be able to tell you which cpu it
feels faulted. Since you're running on sun hardware they should probably
be able to help with this.

If you can, running VTS for a few days might be a good idea.
From: Gary Mills on
In <huv042$2rn$2(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes:

>Gary Mills <mills(a)cc.umanitoba.ca> wrote:
>> In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes:
>>
>>>Gary Mills <mills(a)cc.umanitoba.ca> wrote:
>>>> We had a reboot recently that was a result of this hardware fault:
>>>>
>>>> --------------- ------------------------------------ -------------- ---------
>>>> TIME EVENT-ID MSG-ID SEVERITY
>>>> --------------- ------------------------------------ -------------- ---------
>>>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>>>
>>>> Fault class : fault.cpu.intel.nb.fsb
>>>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>>>> faulty
>>
>> This server needed a power-cycle before it came back to normal. A
>> reboot wasn't sufficient. Either something didn't get reset fully
>> or it was a real hardware failure.

>If you have any core files, sun might be able to tell you which cpu it
>feels faulted. Since you're running on sun hardware they should probably
>be able to help with this.

There was no core file or traceback, just a sudden reboot. Oracle/Sun
is going to replace one of the CPUs. I just wanted an independant
way to verify which one it was.

--
-Gary Mills- -Unix Group- -Computer and Network Services-