From: Greg KH on
On Wed, May 19, 2010 at 09:14:36AM +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-18 at 19:48 -0700, Greg KH wrote:
> > Again, why do you need/want anything in sysfs in the first place?
> > What problem is it going to solve? Who is going to benifit? Why do
> > they care? What is this whole thing about?
>
>
> OK, so all of this is about perf_event. The story starts with CPUs
> adding a PMU (Performance Monitor Unit) which allows the user to
> count/sample cpu state.
>
> The whole perf_counter subsystem was created to abstract this piece of
> hardware and provide an kernel interface to it.
>
> Then we realized that a generalization of the PMU exists in pretty much
> everything that generates 'events' of interest and so we started adding
> software PMUs that allowed us to do the same for tracepoints etc.
>
> So we ended up with perf_events. A subsystem dedicated to counting
> events and event based sampling.
>
> Now the problem this patch set tries to solve; more hardware than the
> CPU has such capabilities. There are memory controllers, bus controllers
> and devices with similar capabilities.
>
> So we need a way to identify and locate these things, and since sysfs
> has the full machine topology in it, the idea was to represent these
> things in sysfs as an event_source class.
>
> Since the CPU and memory controllers are (assumed) symmetric on the
> system, we get to add things like:
>
>
> /sys/devices/system/cpu/cpu_event_source/

Wouldn't that really be:
/sys/devices/system/cpu/cpu0/cpu_event_source/
?

/sys/devices/system/cpu is a "type" of devices in the system here, and
isn't an event source specific to the device itself?

Or is it for all cpus together?

> /sys/devices/system/node/node_event_source/
>
> Devices like GPUs can do:
>
> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/radeon_event_source/
>
> Hooking them into sysfs at the proper device/machine topology location
> allows us to quickly locate and identify these 'event_sources'.

Ok, this all makes a lot more sense now, thanks.

> Since all hardware wants to keep life interesting they all work
> differently and programming PMUs is no different, they count different
> things, have different ways to program them etc. But for each class
> there is a useful subset of things that is pretty uniform.
>
> CPU based PMUs all can count things like clock-cycles and instructions,
> Memory controllers can count things like local/remote memory accesses
> etc.
>
> So each class has a number of actual events that are worthy of
> abstracting. The idea was to place these events in the event_source,
> like:
>
> /sys/devices/system/cpu/cpu_event_source/cycles/
> /sys/devices/system/cpu/cpu_event_source/instructions/
>
>
>
> And then there are the software event_sources that expose kernel events
> (through tracepoints), currently tracepoints live
> in /debug/tracing/events/ (or /sys/kernel/debug/tracing/events/ for
> those so inclined). But the above abstraction would suggest we expose
> them similarly.
>
> I'm not sure where we'd want them to live, we could add them to:
>
> /sys/kernel/tracepoint_event_source/
>
> and have them live there, but I'm open to alternatives :-)

Once you go outside of /sys/devices/ you aren't playing with devices
properly, so you might just want to stick to a "class" and have
/sys/class/tracepoint_event_source/ where all of the devices would end
up symlinking to.

> [ With event_source's being a sysfs-class, we also get a nice flat
> collection in /sys/class/event_source/ helping those who get lost
> in the device topology, me :-) ]

Yes, but isn't the fact that you can have different types of
event sources lend itself to different classes of event sources?

> The next issue seems to be the interface between this sysfs
> representation and the perf_event syscall, how do we go about creating
> an actual perf_event object from this rich sysfs event_source class
> object.
>
> The sys_perf_event_open() call takes a struct perf_event_attr pointer
> which describes the event and its properties. The current event
> classification goes through:
>
> struct perf_event_attr {
> __u32 type;
> __u64 config;
>
> ...
> };
>
> So my initial idea was to let each event_source have a type_id and let
> each of its events have a config field and read those and insert them in
> your structure.
>
> So we'd get:
>
> /sys/devices/system/cpu/cpu_event_source/type_id
> /sys/devices/system/cpu/cpu_event_source/instructions/config
>
> cat those to get: .type = 0, .config = 1
> (PERF_TYPE_HARDWARE:PERF_COUNT_HW_INSTRUCTIONS).
>
> Then Ingo objected and said, if we need to open and read those file, you
> might as well just open one file and pass the fd along, saves some
> syscalls.
>
> So you'd end up doing:
>
> fd = open("/sys/devices/system/cpu/cpu_event_source/instructions/config");
> attr->type = fd | PERF_TYPE_FD;
> event_fd = perf_event_open(attr, ... );
> close(fd);
>
> From that one fd we can find to which 'event_source' it belongs and what
> particular config we need to use.

Ah, pass the fd of a sysfs file to sysfs to get the kobject. Ick,
that's just, well, something that I never even considered someone would
need/want to do...

sysfs exports single values just fine. If you are starting to do more
complex things, like you currently are, maybe you shouldn't be in
sysfs...

I can always knock up a eventfs for you do mount at /sys/kernel/events/
or something if you want :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Thu, 2010-05-20 at 11:42 -0700, Greg KH wrote:
> On Wed, May 19, 2010 at 09:14:36AM +0200, Peter Zijlstra wrote:

> > Since the CPU and memory controllers are (assumed) symmetric on the
> > system, we get to add things like:
> >
> >
> > /sys/devices/system/cpu/cpu_event_source/
>
> Wouldn't that really be:
> /sys/devices/system/cpu/cpu0/cpu_event_source/
> ?
>
> /sys/devices/system/cpu is a "type" of devices in the system here, and
> isn't an event source specific to the device itself?
>
> Or is it for all cpus together?

All CPUs are assumed identical, and the perf syscall has task/cpu
monitor targets. If the CPUs would not be identical (like Paul Mundt
said SH might do) then it would make sense to have different
event_sources for each cpu.

> > I'm not sure where we'd want them to live, we could add them to:
> >
> > /sys/kernel/tracepoint_event_source/
> >
> > and have them live there, but I'm open to alternatives :-)
>
> Once you go outside of /sys/devices/ you aren't playing with devices
> properly, so you might just want to stick to a "class" and have
> /sys/class/tracepoint_event_source/ where all of the devices would end
> up symlinking to.

Sure, that would work.

> > [ With event_source's being a sysfs-class, we also get a nice flat
> > collection in /sys/class/event_source/ helping those who get lost
> > in the device topology, me :-) ]
>
> Yes, but isn't the fact that you can have different types of
> event sources lend itself to different classes of event sources?

I'm not quite sure adding another abstraction level buys us much.

> > fd = open("/sys/devices/system/cpu/cpu_event_source/instructions/config");
> > attr->type = fd | PERF_TYPE_FD;
> > event_fd = perf_event_open(attr, ... );
> > close(fd);
> >
> > From that one fd we can find to which 'event_source' it belongs and what
> > particular config we need to use.
>
> Ah, pass the fd of a sysfs file to sysfs to get the kobject. Ick,
> that's just, well, something that I never even considered someone would
> need/want to do...

No, we don't pass the fd to sysfs, we pass the fd into a syscall.

> sysfs exports single values just fine. If you are starting to do more
> complex things, like you currently are, maybe you shouldn't be in
> sysfs...

Well, like said, I'm fine with it actually being single values, its just
that Ingo suggested skipping a few syscalls.

Robert just suggested we could use the sysfs files as device nodes and
have then open() return a perf_event fd. Its just that that would
require we add a ioctl to change the perf_event_attr structure and
attach it to a context.

> I can always knock up a eventfs for you do mount at /sys/kernel/events/
> or something if you want :)

But that won't get us the nice device linkage, right?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Greg KH <greg(a)kroah.com> wrote:

> [...]
>
> I can always knock up a eventfs for you do mount at /sys/kernel/events/ or
> something if you want :)

eventfs was my first idea, until Peter convinced me that we want sysfs :-)

One important aspect would be to move it into the physical topology. Graphics
card? It might have events. PCI device? It might have events. Southbridge? It
might have a PMU and events. CPU? It has a PMU.

Especially when it comes to complex physical topologies on larger systems, we
eventually want to visualize things in tooling as well - as a tree of the
physical topology. Also, physical topologies will only become more complex, so
we dont want to detach events from them.

> sysfs exports single values just fine. If you are starting to do more
> complex things, like you currently are, maybe you shouldn't be in sysfs...

This is really like a read-only attributes, and it would be multi-line only
for the event format descriptor - a genuinely new aspect: a flexible ABI
descriptor.

It's an attribute for a very good purpose: flexible ABI with a user-space that
interprets new format descriptions automatically. This is not just theory, for
example perf trace does this today, and you can write scripts with old tools
for a new event that shows up in a new kernel, without rebuilding the tools.

Here is an example of a format descriptor:

# cat /debug/tracing/events/sched/sched_wakeup/format
name: sched_wakeup
ID: 59
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;

field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
field:pid_t pid; offset:28; size:4; signed:1;
field:int prio; offset:32; size:4; signed:1;
field:int success; offset:36; size:4; signed:1;
field:int target_cpu; offset:40; size:4; signed:1;

print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d", REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu

Also, we already have quite a few multi-line files in sysfs, for example:

$ cat /sys/devices/pnp0/00:09/options
Dependent: 00 - Priority preferred
port 0x378-0x378, align 0x0, size 0x8, 16-bit address decoding
port 0x778-0x778, align 0x0, size 0x8, 16-bit address decoding
irq 7 High-Edge
dma 3 8-bit compatible
Dependent: 01 - Priority acceptable
port 0x378-0x378, align 0x0, size 0x8, 16-bit address decoding
port 0x778-0x778, align 0x0, size 0x8, 16-bit address decoding
irq 3,4,5,6,7,10,11,12 High-Edge
dma 0,1,2,3 8-bit compatible
Dependent: 02 - Priority acceptable
port 0x278-0x278, align 0x0, size 0x8, 16-bit address decoding
port 0x678-0x678, align 0x0, size 0x8, 16-bit address decoding
irq 3,4,5,6,7,10,11,12 High-Edge
dma 0,1,2,3 8-bit compatible
Dependent: 03 - Priority acceptable
port 0x3bc-0x3bc, align 0x0, size 0x4, 16-bit address decoding
port 0x7bc-0x7bc, align 0x0, size 0x4, 16-bit address decoding
irq 3,4,5,6,7,10,11,12 High-Edge
dma 0,1,2,3 8-bit compatible

$ cat /sys/devices/pci0000:00/0000:00:1a.7/pools
poolinfo - 0.1
ehci_sitd 0 0 96 0
ehci_itd 0 0 160 0
ehci_qh 4 42 96 1
ehci_qtd 4 42 96 1
buffer-2048 0 0 2048 0
buffer-512 0 0 512 0
buffer-128 0 0 128 0
buffer-32 1 128 32 1

In fact uevents have multi-line attributes as well:

$ cat /sys/devices/pci0000:00/0000:00:1a.1/usb4/uevent
MAJOR=189
MINOR=384
DEVNAME=bus/usb/004/001
DEVTYPE=usb_device
DRIVER=usb
DEVICE=/proc/bus/usb/004/001
PRODUCT=1d6b/1/206
TYPE=9/0/0
BUSNUM=004
DEVNUM=001

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Greg KH on
On Thu, May 20, 2010 at 09:52:32PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-20 at 11:42 -0700, Greg KH wrote:
> > On Wed, May 19, 2010 at 09:14:36AM +0200, Peter Zijlstra wrote:
>
> > > Since the CPU and memory controllers are (assumed) symmetric on the
> > > system, we get to add things like:
> > >
> > >
> > > /sys/devices/system/cpu/cpu_event_source/
> >
> > Wouldn't that really be:
> > /sys/devices/system/cpu/cpu0/cpu_event_source/
> > ?
> >
> > /sys/devices/system/cpu is a "type" of devices in the system here, and
> > isn't an event source specific to the device itself?
> >
> > Or is it for all cpus together?
>
> All CPUs are assumed identical, and the perf syscall has task/cpu
> monitor targets. If the CPUs would not be identical (like Paul Mundt
> said SH might do) then it would make sense to have different
> event_sources for each cpu.

Ah, ok.

> > > fd = open("/sys/devices/system/cpu/cpu_event_source/instructions/config");
> > > attr->type = fd | PERF_TYPE_FD;
> > > event_fd = perf_event_open(attr, ... );
> > > close(fd);
> > >
> > > From that one fd we can find to which 'event_source' it belongs and what
> > > particular config we need to use.
> >
> > Ah, pass the fd of a sysfs file to sysfs to get the kobject. Ick,
> > that's just, well, something that I never even considered someone would
> > need/want to do...
>
> No, we don't pass the fd to sysfs, we pass the fd into a syscall.

Sorry, yes, that is what I was trying to say. You then take that fd,
pass it to the sysfs core within the kernel, and get a kobject back.

> > sysfs exports single values just fine. If you are starting to do more
> > complex things, like you currently are, maybe you shouldn't be in
> > sysfs...
>
> Well, like said, I'm fine with it actually being single values, its just
> that Ingo suggested skipping a few syscalls.
>
> Robert just suggested we could use the sysfs files as device nodes and
> have then open() return a perf_event fd. Its just that that would
> require we add a ioctl to change the perf_event_attr structure and
> attach it to a context.

Nope, sorry, that's not going to happen, that is not what sysfs is for.
No ioctls or device nodes in there please.

> > I can always knock up a eventfs for you do mount at /sys/kernel/events/
> > or something if you want :)
>
> But that won't get us the nice device linkage, right?

True, you would loose that.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Greg KH on
On Thu, May 20, 2010 at 10:14:18PM +0200, Ingo Molnar wrote:
>
> * Greg KH <greg(a)kroah.com> wrote:
>
> > [...]
> >
> > I can always knock up a eventfs for you do mount at /sys/kernel/events/ or
> > something if you want :)
>
> eventfs was my first idea, until Peter convinced me that we want sysfs :-)
>
> One important aspect would be to move it into the physical topology. Graphics
> card? It might have events. PCI device? It might have events. Southbridge? It
> might have a PMU and events. CPU? It has a PMU.
>
> Especially when it comes to complex physical topologies on larger systems, we
> eventually want to visualize things in tooling as well - as a tree of the
> physical topology. Also, physical topologies will only become more complex, so
> we dont want to detach events from them.

Ok, yes, physical topology would be nice to have, I agree.

> > sysfs exports single values just fine. If you are starting to do more
> > complex things, like you currently are, maybe you shouldn't be in sysfs...
>
> This is really like a read-only attributes, and it would be multi-line only
> for the event format descriptor - a genuinely new aspect: a flexible ABI
> descriptor.

Oh no...

> It's an attribute for a very good purpose: flexible ABI with a user-space that
> interprets new format descriptions automatically. This is not just theory, for
> example perf trace does this today, and you can write scripts with old tools
> for a new event that shows up in a new kernel, without rebuilding the tools.
>
> Here is an example of a format descriptor:
>
> # cat /debug/tracing/events/sched/sched_wakeup/format
> name: sched_wakeup
> ID: 59
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
>
> field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> field:pid_t pid; offset:28; size:4; signed:1;
> field:int prio; offset:32; size:4; signed:1;
> field:int success; offset:36; size:4; signed:1;
> field:int target_cpu; offset:40; size:4; signed:1;
>
> print fmt: "comm=%s pid=%d prio=%d success=%d target_cpu=%03d", REC->comm, REC->pid, REC->prio, REC->success, REC->target_cpu

Hm, kind of like a "sane" xml, right?

> Also, we already have quite a few multi-line files in sysfs, for example:

These are all aborations, please don't perputuate it.

> $ cat /sys/devices/pnp0/00:09/options
> Dependent: 00 - Priority preferred
> port 0x378-0x378, align 0x0, size 0x8, 16-bit address decoding
> port 0x778-0x778, align 0x0, size 0x8, 16-bit address decoding
> irq 7 High-Edge
> dma 3 8-bit compatible
> Dependent: 01 - Priority acceptable
> port 0x378-0x378, align 0x0, size 0x8, 16-bit address decoding
> port 0x778-0x778, align 0x0, size 0x8, 16-bit address decoding
> irq 3,4,5,6,7,10,11,12 High-Edge
> dma 0,1,2,3 8-bit compatible
> Dependent: 02 - Priority acceptable
> port 0x278-0x278, align 0x0, size 0x8, 16-bit address decoding
> port 0x678-0x678, align 0x0, size 0x8, 16-bit address decoding
> irq 3,4,5,6,7,10,11,12 High-Edge
> dma 0,1,2,3 8-bit compatible
> Dependent: 03 - Priority acceptable
> port 0x3bc-0x3bc, align 0x0, size 0x4, 16-bit address decoding
> port 0x7bc-0x7bc, align 0x0, size 0x4, 16-bit address decoding
> irq 3,4,5,6,7,10,11,12 High-Edge
> dma 0,1,2,3 8-bit compatible

That should be a debugfs file.

> $ cat /sys/devices/pci0000:00/0000:00:1a.7/pools
> poolinfo - 0.1
> ehci_sitd 0 0 96 0
> ehci_itd 0 0 160 0
> ehci_qh 4 42 96 1
> ehci_qtd 4 42 96 1
> buffer-2048 0 0 2048 0
> buffer-512 0 0 512 0
> buffer-128 0 0 128 0
> buffer-32 1 128 32 1

Odd, I hadn't noticed that one before. I can't figure out what that
file is, who creates it?

Ick, mm/dmapool.c? Hm, not good, that's a debugging file only, and
really does not belong in sysfs. It seems to predate 2.6.12, so it made
it in before debugfs was around. I'll work on moving it out of sysfs...

> In fact uevents have multi-line attributes as well:
>
> $ cat /sys/devices/pci0000:00/0000:00:1a.1/usb4/uevent
> MAJOR=189
> MINOR=384
> DEVNAME=bus/usb/004/001
> DEVTYPE=usb_device
> DRIVER=usb
> DEVICE=/proc/bus/usb/004/001
> PRODUCT=1d6b/1/206
> TYPE=9/0/0
> BUSNUM=004
> DEVNUM=001

Yes, that's the environment variables that are sent to userspace in the
uevent. I don't like the multi-line stuff for this one, but we couldn't
think of a better way at the time.

Anyway, back to your original issue, multi-line sysfs files.

I really don't want to do something like that, in sysfs, if at all
possible. We have been working very hard to keep the sysfs file format
simple, and to follow the one-value-per-file rule, so we don't end up
repeating the same mistakes we did in /proc.

Now one could argue that we are not entirely successful, especially
based on your examples above. However, those are the rare exception,
not the rule by far.

So, where do we do something like this? I don't know. I still like the
idea of eventfs, and we could pass in a kobject to it to have it create
the tree if needed. Yeah, that would be a replication of some of the
sysfs structure, but you could have a custom file format, like you show
above, which would you could control and keep in step with your
userspace tools.

How deep in the device tree are you really going to be caring about? It
sounds like the large majority of events are only going to be coming
from the "system" type objects (cpu, nodes, memory, etc.) and very few
would be from things that we consider a 'struct device' today (like a
pci, usb, scsi, or input, etc.)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/