From: Andi Kleen on
> FYI, this commit triggered a -tip test failure:
>
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'

Does this patch fix it? (I assume your config is missing CONFIG_PCI)

-Andi

---

commit 748e9c041da32f173f20289c71ef0462b8e97234
Author: Andi Kleen <ak(a)linux.intel.com>
Date: Sat Jan 23 12:34:10 2010 +0100

Make xeon75xx memory driver dependent on PCI

Found by Ingo Molnar's automated tester

Signed-off-by: Andi Kleen <ak(a)linux.intel.com>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f62db24..ab2d0a8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -834,7 +834,7 @@ config X86_MCE_INTEL

config X86_MCE_XEON75XX
tristate "Intel Xeon 7500 series corrected memory error driver"
- depends on X86_MCE_INTEL
+ depends on X86_MCE_INTEL && PCI
---help---
Add support for a Intel Xeon 7500 series specific memory error driver.
This allows to report the DIMM and physical address on a corrected
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mauro Carvalho Chehab on
Ingo Molnar wrote:
> * Borislav Petkov <petkovbb(a)googlemail.com> wrote:
>
>> (Adding some more interested parties to Cc:)
>>
>> On Sat, Jan 23, 2010 at 06:17:17AM +0100, Ingo Molnar wrote:
>>> * tip-bot for H. Peter Anvin <hpa(a)zytor.com> wrote:
>>>
>>>> Commit-ID: f91c4d2649531cc36e10c6bc0f92d0f99116b209
>>>> Gitweb: http://git.kernel.org/tip/f91c4d2649531cc36e10c6bc0f92d0f99116b209
>>>> Author: H. Peter Anvin <hpa(a)zytor.com>
>>>> AuthorDate: Thu, 21 Jan 2010 18:31:54 -0800
>>>> Committer: H. Peter Anvin <hpa(a)zytor.com>
>>>> CommitDate: Thu, 21 Jan 2010 18:31:54 -0800
>>>>
>>>> x86, mce: Rename cpu_specific_poll to mce_cpu_specific_poll
>>>>
>>>> cpu_specific_poll is a global variable, and it should have a global
>>>> namespace name. Since it is MCE-specific (it takes a struct mce *),
>>>> rename it mce_cpu_specific_poll.
>>>>
>>>> Signed-off-by: H. Peter Anvin <hpa(a)zytor.com>
>>>> Cc: Andi Kleen <andi(a)firstfloor.org>
>>>> LKML-Reference: <20100121221711.GA8242(a)basil.fritz.box>
>>> FYI, this commit triggered a -tip test failure:
>>>
>>> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c: In function 'xeon75xx_mce_init':
>>> arch/x86/kernel/cpu/mcheck/mce-xeon75xx.c:340: error: implicit declaration of function 'pci_match_id'
>>>
>>> I'm excluding it from tip:master.
>>>
>>> But the bigger problem with this commit is structure of it - or the lack
>>> thereof.
>>>
>>> It just blindly goes into the direction the MCE code has been going for some
>>> time, minimally enabling the hardware, ignoring both the new EDAC design and
>>> the new performance monitoring related design i outlined some time ago.
>> I completely agree - from what I see this is adding vendor- or rather
>> vendor-and-machine-specific hooks to read out (1) the position of the the
>> memory translation table from PCI config space (0x8c), (2) then to read out
>> the offset from the first MCA status register in order to (3) rdmsr the
>> status information.
>>
>> In AMD's case, we need similar hooks too, in order to evaluate correctable
>> MCEs for different RAS reasons like for example L3 cache or data arrays
>> errors for disabling L3 indices. I was looking into adding hooks into
>> machine_check_poll() and cpu_specific_poll() interface could work.
>>
>> Furthermore, lets leave mcheck be mcheck and do error decoding in EDAC
>> modules. For example, there was a core i7 EDAC module submission from Mauro
>> and the Xeon75xx-specific decoding bits could be added to it or even as a
>> new machine-specific module instead of mcelog.
>>
>> With the evergrowing complexity of memory controller design I don't think
>> that the userspace mcelog approach will scale - you need the whole decoding
>> in the kernel where the module knows the exact memory controllers setup and
>> which DRAM addresses belong to which nodes and whether you do memory
>> hoisting and whether you interleave, if yes, how and on what granilarity you
>> interleave and on and on...
>
> Yep. Could you give a few pointers to Andi where exactly you'd like to see the
> Intel Xeon functionality added to the EDAC code? There is some Intel
> functionality there already, but the current upstream code does not look very
> uptodate. I've looked at e752x_edac.c. (there's some Corei7 work pending,
> right?) In any case there's a lot of fixing to be done to the Intel code
> there.

The i7core_edac driver covers the Nehalem chipsets. It is at:
http://git.kernel.org/?p=linux/kernel/git/mchehab/i7core.git

The error decoding is done by i7core_mce_output_error(). If Xeon 75xx needs to do
something different, it is just a matter of calling a function there, to retrieve
the values from the proper registers.

In order to avoid duplication at the code, the driver is receiving the error events
from the mcelog driver, and having two drivers fighting to access the same error
registers at the same time, this driver relies on mce driver to send the events.

I expect to allocate some time to put the driver into a better shape for
upstream submission soon.

> Memory controller functionality integrated into the kernel proper is a good
> idea partly for similar reasons an on-die memory controller is a good idea in
> the hardware space.

I agree. The EDAC module handles memory errors for several memory controllers,
including some on non-x86 architectures. Letting this decode to happen outside
the kernel, just for a very few set of memory controllers, seems a bad idea.

Cheers,
Mauro.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mauro Carvalho Chehab on
Ingo Molnar wrote:
> * Borislav Petkov <petkovbb(a)googlemail.com> wrote:
>
>> From: Ingo Molnar <mingo(a)elte.hu>
>> Date: Tue, Feb 16, 2010 at 10:02:15PM +0100
>> Hi,
>>
>>> I like it.
>>>
>>> You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw'
>>> utility, if you'd like to maintain it separately. It would have a daemon
>>> component as well, to receive and log hardware events continuously, to
>>> trigger policy action, etc.
>>>
>>> I'd suggest you start to do it in small steps, always having something that
>>> works - and extend it gradually.
>> I had the chance to meditate over the weekend a bit more on the whole
>> RAS thing after rereading all the discussion points more carefully.
>> Here are some aspects I think are important which I'd like to drop here
>> rather sooner than later so that we're in sync and don't waste time
>> implementing the wrong stuff:
>>
>> * Critical errors: we need to switch to a console and dump decoded error
>> there at least, before panicking. Nowadays, almost everyone has a camera
>> with which that information can be extracted from the screen. I'm afraid we
>> won't be able to send the error over a network since climbing up the TCP
>> stack takes relatively long and we cannot risk error propagation...? We
>> could try to do it on a core which is not affected by the error though as a
>> last step in the sequence...
>>
>> I think this is much more user-friendly than the current panicking which is
>> never seen when running X except when the user has a serial/netconsole
>> sending to some other machine.
>
> Yep.

If the user has set up kexec/kdump, currently, the error is dumped
to the crash log at the local machine or to another machine, depending on
the config.

I like the idea of switching to console before panicking, but this is sometimes
problematic, as there are some video drivers that switching to console don't
always work. Also, this could interfere at the kdump setup, if configured.
So, if kdump or serial/netconsole is enabled, the better is to not try to switch
the video mode. So, I think the better is to allow userspace to select one or
another mode.

>> All other non-that-critical errors are copied to userspace over a mmapped
>> buffer and then the uspace daemon is being poked with a uevent to dump the
>> error/signal over network/parse its contents and do policy stuff.

It seems interesting, but, if the userspace daemon is not running, it should
fallback to write the errors via dmesg.

> If you use perf here you get the events and can poll() the event channel.
> User-space can decide which events to listen in on. uevent/user-notifier is a
> bit clumsy for that.
>
>> * receive commands by syscall, also for hw config: I like the idea of
>> sending commands to the kernel over a syscall, we can reuse perf
>> functionality here and make those reused bits generic.

Why? I think we should keep using sysfs for hw config, and it seems that you also
agree. Sysfs work fine and fits nice for enumerating hierarchical data.
What's the rationale to add also a syscall API?

The EDAC data model needs some discussion, as, currently, the memory is represented
per csrow, and modern MCU don't allow such level of control (and it doesn't
make much sense on representing this way, as you can't replace a csrow). The
better is to use DIMM as the minumum unit.

>>
>> * do not bind to error format etc: not a big fan of slaving to an error
>> format - just dump error info into the buffer and let userspace format it.
>> We can do the formatting if we absolutely have to.

The format code is needed for critical errors anyway.

> If you use perf and tracepoints to shape the event log format then this is all
> taken care of already, you get structured event format descriptors in
> /debug/tracing/events/*. For example there's already an MCE tracepoint in the
> upstream kernel today (for thermal events):
>
> phoenix:/home/mingo> cat /debug/tracing/events/mce/mce_record/format
> name: mce_record
> ID: 28
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
>
> field:u64 mcgcap; offset:16; size:8; signed:0;
> field:u64 mcgstatus; offset:24; size:8; signed:0;
> field:u8 bank; offset:32; size:1; signed:0;
> field:u64 status; offset:40; size:8; signed:0;
> field:u64 addr; offset:48; size:8; signed:0;
> field:u64 misc; offset:56; size:8; signed:0;
> field:u64 ip; offset:64; size:8; signed:0;
> field:u8 cs; offset:72; size:1; signed:0;
> field:u64 tsc; offset:80; size:8; signed:0;
> field:u64 walltime; offset:88; size:8; signed:0;
> field:u32 cpu; offset:96; size:4; signed:0;
> field:u32 cpuid; offset:100; size:4; signed:0;
> field:u32 apicid; offset:104; size:4; signed:0;
> field:u32 socketid; offset:108; size:4; signed:0;
> field:u8 cpuvendor; offset:112; size:1; signed:0;
>
> print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, REC->status, REC->addr, REC->misc, REC->cs, REC->ip, REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, REC->socketid, REC->apicid
>
> tools/perf/util/trace-event-parse.c contains the above structured format
> descriptor parsing code, and can turn it into records that you can read out
> from C code - and provides all sorts of standard functionality over it.
>
> I'd strongly suggest to reuse that - we _really_ want health monitoring and
> general system performance monitoring to share a single facility: as they are
> both one and the same thing, just from different viewpoints.
>
> In other words: 'system component failure' is another metric of 'system
> performance', so there's strong synergies all around.

Agreed.

>
>> * can also configure hw: The tool can also send commands over the syscall to
>> configure certain aspects of the hardware, like:
>>
>> - disable L3 cache indices which are faulty
>> - enable/disable MCE error sources: toggle MCi_CTL, MCi_CTL_MASK bits
>> - disable whole DIMMs: F2x[1, 0][5C:40][CSEnable]
>> - control ECC checking
>> - enable/disable powering down of DRAM regions for power savings
>> - set memory clock frequency
>> - some other relevant aspects of hw/CPU configuration

I agree that configuring the hw is interesting, while I still think that we should
use sysfs for it.

> Once the hardware's structure is enumerated (into a tree/hiearchy), and events
> are attached to individual components, then 'commands' are the next logical
> step: they are methods of a given component/object.
>
> One such method could be 'injection' functionality btw: to simulate rare
> hardware failures and to make sure policy logic is ready for all
> eventualities.

The i7core, amd64 and a very few other EDAC drivers already implements memory
error injection errors, via sysfs. The point with error injection is that this
feature is hardware dependent.

So, while you can use the same hierarchy/tree for hw description, things like
error injection will require an specific hierarchy. That's one of the reasons
why I think we should keep using sysfs: it can easily be used to represent data
that are hardware dependent.

The way I mapped this on the i7core_edac is that I kept using the standard EDAC
sysfs hierarchy for memory, and added a generic code that allows describing
the error injection bits found in Nehalem.

> But ... while that is clearly the 'big grand' end goal, the panacea of RAS
> design, i'd suggest to start with a small but useful base and pick up low
> hanging fruits - then work towards this end goal. This is how perf is
> developed/maintained as well.
>
> So i'd suggest to start with _something_ that other people can try and have a
> look at and extend, for example something that replaces basic mcelog
> functionality. That alone should be fairly easy and immediately gives it a
> short-term purpose. It would also be highly beneficial to the x86 code to get
> rid of the mcelog abonimation.
>
>> * keep all info in sysfs so that no tool is needed for accessing it,
>> similar to ftrace: All knobs needed for user interaction should appear
>> redundantly as sysfs files/dirs so that configuration/query can be done
>> "by hand" even when the hw tool is missing
>
> Please share this code with perf. Profiling needs the same kind of 'hardware
> structure' enumeration - combined with 'software component enumeration'.
>
> Currently we have that info /debug/tracing/events/. Some hw structure is in
> there as well, but not much - most of it is kernel subsystem event structure.
>
> sysfs would be an option but IMO it's even better to put ftrace's
> /debug/tracing/events/ hiearchy into a separate eventfs - and extend it with
> 'hardware structure' details.
>
> This would not only crystalise the RAS purpose, but would nicely extend perf
> as well. With every hardware component you add from the RAS angle we'd get new
> events for tracing/profiling use as well - and vice versa. There's no reason
> why RAS should be limited to hw component failure events: a RAS policy action
> could be defined over OOM events too for example, or over checksum failures in
> network packets - etc.
>
> RAS is not just about hardware, and profiling isnt just about software. We
> want event logging to be a unified design - there's big advantages to that.
>
> So please go for an integrated design. The easiest and most useful way for
> that would be to factor out /debug/tracing/events/ into /eventfs.
>
>> * gradually move pieces of RAS code into kernel proper: important
>> codepaths/aspects from the HW which are being queried often (e.g., DIMM
>> population and config) should be moved gradually into the kernel proper.
>
> Yeah. Good plans.
>
> Ingo


--

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mauro Carvalho Chehab on
Mauro Carvalho Chehab wrote:
> The EDAC data model needs some discussion, as, currently, the memory is represented
> per csrow, and modern MCU don't allow such level of control (and it doesn't
> make much sense on representing this way, as you can't replace a csrow). The
> better is to use DIMM as the minumum unit.

Just to start the data model, this is what a typical EDAC driver presents:

/sys/devices/system/edac/mc/mc0/
|-- ce_count
|-- ce_noinfo_count
|-- csrow0
| |-- ce_count
| |-- ch0_ce_count
| |-- ch0_dimm_label
| |-- ch1_ce_count
| |-- ch1_dimm_label
| |-- ch2_ce_count
| |-- ch2_dimm_label
| |-- ch3_ce_count
| |-- ch3_dimm_label
| |-- dev_type
| |-- edac_mode
| |-- mem_type
| |-- size_mb
| `-- ue_count
|-- csrow1
| |-- ce_count
| |-- ch0_ce_count
| |-- ch0_dimm_label
| |-- ch1_ce_count
| |-- ch1_dimm_label
| |-- ch2_ce_count
| |-- ch2_dimm_label
| |-- ch3_ce_count
| |-- ch3_dimm_label
| |-- dev_type
| |-- edac_mode
| |-- mem_type
| |-- size_mb
| `-- ue_count
|-- device -> ../../../../pci0000:3f/0000:3f:03.0
|-- mc_name
|-- reset_counters
|-- sdram_scrub_rate
|-- seconds_since_reset
|-- size_mb
|-- ue_count
`-- ue_noinfo_count

In the case of i7core_edac, there's no way to identify csrows by using
the public registers (I've no idea is is there any non-documented register
for it). So, the driver maps one dimm per "edac csrow".

It would be good to see a better struct for it.

With respect to error injection, this is the way i7core maps it:

|-- inject_addrmatch
| |-- bank
| |-- channel
| |-- col
| |-- dimm
| |-- page
| `-- rank
|-- inject_eccmask
|-- inject_enable
|-- inject_section
|-- inject_type

The inject_addrmatch leaves control a match filter for the error
where the error will be inject. I doubt we would find a way to standardize it.

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/