x86: Export tsc related information in sysfs [Kernel]

Prev: [PATCH 3/4] drivers/hid: Eliminate use after free
Next: Loan Application

From: Arjan van de Ven on 15 May 2010 15:20

On Sat, 15 May 2010 06:29:25 -0700 (PDT)
Dan Magenheimer <dan.magenheimer(a)oracle.com> wrote:

> > It would be better to fix them to use the vsyscalls instead.
> > Or if they can't use the vsyscalls for some reason today fix them.
>
> The problem is from an app point-of-view there is no vsyscall.
> There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> if it gets lucky, they turn out to be very fast and sometimes
> it doesn't get lucky and they are VERY slow (resulting in a
> performance hit of 10% or more), depending on a number of factors
> completely out of the control of the app and even undetectable to the
> app.

But the point is.. in the case you get that 10% hit.... that is exactly
the case where tsc would not work either!!!
>

> If tsc_reliable is 1, the system and the kernel are guaranteeing
> to the app that nothing will change in the TSC. In an Invariant
> TSC system that has passed Ingo's warp test (to eliminate the
> possibility of a fixed interprocessor TSC gap due to a broken BIOS
> in a multi-node NUMA system), if anything changes in the clock

just when we're trying to get rid of this constraint by allowing a per
cpu offset... (this is needed to cope with cpus not powering on at the
exact same time... including hotplug cpu etc etc)

oh and.. what notification mechanism do you have to notify the
application that the tsc now is no longer reliable? Such conditions
can exist... for example due to a CPU being hotplugged, or some SMM
screwing around and the kernel detecting that or .. or ...

really. Use the vsyscall. If the vsyscall does not do exactly what you
want, make a better vsyscall.

But friends don't let friends use rdtsc in application code.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 15 May 2010 18:40

> From: Arjan van de Ven [mailto:arjan(a)infradead.org]
(Arjan comments reordered somewhat)

> But friends don't let friends use rdtsc in application code.

Um, I realize that many people have been burned by this
many times over the years so it is a "hot stove". I also
realize that there are many environments where using
rdtsc is risking stepping on landmines. But I (we?) also
know there are many environments now where using rdtsc is
NOT risky at all... and with the vast majority of new
systems soon shipping with Invariant TSC and a single socket
(and even most multiple-socket systems with non-broken
BIOSes passing a warp test), why should past burns outlaw
userland use of a very fast, very useful CPU feature? After
all, CPU designers at both Intel and AMD have spent
a great deal of design effort and transistors to FINALLY
provide an Invariant TSC.

> > The problem is from an app point-of-view there is no vsyscall.
> > There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> > if it gets lucky, they turn out to be very fast and sometimes
> > it doesn't get lucky and they are VERY slow (resulting in a
> > performance hit of 10% or more), depending on a number of factors
> > completely out of the control of the app and even undetectable to the
> > app.
>
> But the point is.. in the case you get that 10% hit.... that is exactly
> the case where tsc would not work either!!!

Yes, understood. But the kernel doesn't expose a "gettimeofday
performance sucks" flag either. If it did (or in the case of
the patch, if tsc_reliable is zero) the application could at least
choose to turn off the 10000-100000 timestamps/second and log
a message saying "you are running on old hardware so you get
fewer features".

> just when we're trying to get rid of this constraint by allowing a per
> cpu offset... (this is needed to cope with cpus not powering on at the
> exact same time... including hotplug cpu etc etc)
>
> oh and.. what notification mechanism do you have to notify the
> application that the tsc now is no longer reliable? Such conditions
> can exist... for example due to a CPU being hotplugged, or some SMM
> screwing around and the kernel detecting that or .. or ...

The proposal doesn't provide a notification mechanism (though I'm
not against it)... if the tsc can EVER become unreliable,
tsc_reliable should be 0.

A CPU-hotplugable system is a good example of a case where
the kernel should expose that tsc_reliable is 0. (I've heard
anecdotally that CPU hotplug into a QPI or Hypertransport system
will have some other interesting challenges, so may require some
special kernel parameters anyway.) Even if tsc_reliable were
only enabled if a "no-cpu_hotplug" kernel parameter is set,
that is still useful. And with cores-per-socket (and even
nodes-per-socket) going up seemingly every day, multi-socket
systems will likely be an ever smaller percentage of new
systems.

A virtual machine where live migration to another physical machine
may occur is another good example where tsc_reliable should be 0.
Xen now has a VM config feature that says "migration is disallowed"
for this reason; the Invariant TSC flag is always off for a VM
unless this "no_migrate" flag is set (or rdtsc is emulated).

> really. Use the vsyscall. If the vsyscall does not do exactly what you
> want, make a better vsyscall.

If this discussion results in a better vsyscall and/or a way
for applications to easily determine (and report loudly) that
the system does NOT provide a good way to do a fast timestamp,
that may be sufficient. But please propose how that will be done
as the current software choices are inadequate and the CPU
designers have finally fixed the problem for the vast majority
of systems. I am already aware of some enterprise software
that is doing its best to guess whether TSC is reliable by
looking at CPU families and socket counts, but this is doomed
to failure in userland and is something that the kernel knows
and should now expose.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arjan van de Ven on 16 May 2010 01:50

On Sat, 15 May 2010 15:32:51 -0700 (PDT)
Dan Magenheimer <dan.magenheimer(a)oracle.com> wrote:

> > From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> (Arjan comments reordered somewhat)
>
> > But friends don't let friends use rdtsc in application code.
>
> Um, I realize that many people have been burned by this
> many times over the years so it is a "hot stove". I also
> realize that there are many environments where using
> rdtsc is risking stepping on landmines.

> But I (we?) also
> know there are many environments now where using rdtsc is
> NOT risky at all...

I see a lot of Intel hardware.. (stuff that you likely don't see yet ;-)
and I have not yet seen a system where the kernel would be able to give
the guarantee as you describe it in your email.

If you want a sysfs variable that is always 0... go wild.

> and with the vast majority of new
> systems soon shipping with Invariant TSC and a single socket
> (and even most multiple-socket systems with non-broken
> BIOSes passing a warp test),

(the warp test is going away)

on multisocket that passes a wrap test you can still get skew over
time.. due to things like SMM, thermal throttling etc etc.

> why should past burns outlaw
> userland use of a very fast, very useful CPU feature? After
> all, CPU designers at both Intel and AMD have spent
> a great deal of design effort and transistors to FINALLY
> provide an Invariant TSC.

sadly even with all these transistors no system that I know of today
can guarantee the guarantee by the rules you state.

> > oh and.. what notification mechanism do you have to notify the
> > application that the tsc now is no longer reliable? Such conditions
> > can exist... for example due to a CPU being hotplugged, or some SMM
> > screwing around and the kernel detecting that or .. or ...
>
> The proposal doesn't provide a notification mechanism (though I'm
> not against it)... if the tsc can EVER become unreliable,
> tsc_reliable should be 0.

then it should be 0 always on all of todays hardware.
SMM, thermal overload, etc etc ... you name it.
Things the kernel will get notified about...

> A CPU-hotplugable system is a good example of a case where
> the kernel should expose that tsc_reliable is 0. (I've heard
> anecdotally that CPU hotplug into a QPI or Hypertransport system
> will have some other interesting challenges, so may require some
> special kernel parameters anyway.)

eh no.
hot add works just fine.

(hot remove is a very different ballgame)

> > really. Use the vsyscall. If the vsyscall does not do exactly what
> > you want, make a better vsyscall.
>
> If this discussion results in a better vsyscall and/or a way
> for applications to easily determine (and report loudly) that
> the system does NOT provide a good way to do a fast timestamp,
> that may be sufficient. But please propose how that will be done
> as the current software choices are inadequate and the CPU
> designers have finally fixed the problem for the vast majority
> of systems.

*cough*

> I am already aware of some enterprise software
> that is doing its best to guess whether TSC is reliable by
> looking at CPU families and socket counts, but this is doomed
> to failure in userland and is something that the kernel knows
> and should now expose.

can you name said "enterprise" software by name please? We need a huge
advertisement to let people know not to trust their important data to
it..

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 16 May 2010 05:30

On Sat, 15 May 2010, Arjan van de Ven wrote:
> On Sat, 15 May 2010 15:32:51 -0700 (PDT)
> Dan Magenheimer <dan.magenheimer(a)oracle.com> wrote:
>
> > > From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> > (Arjan comments reordered somewhat)
> >
> > > But friends don't let friends use rdtsc in application code.
> >
> > Um, I realize that many people have been burned by this
> > many times over the years so it is a "hot stove". I also
> > realize that there are many environments where using
> > rdtsc is risking stepping on landmines.
>
> > But I (we?) also
> > know there are many environments now where using rdtsc is
> > NOT risky at all...
>
> I see a lot of Intel hardware.. (stuff that you likely don't see yet ;-)
> and I have not yet seen a system where the kernel would be able to give
> the guarantee as you describe it in your email.
>
> If you want a sysfs variable that is always 0... go wild.

Nah, there are systems which will have it set to 1:

Dig out your good old Pentium-I box and enjoy.

> > > oh and.. what notification mechanism do you have to notify the
> > > application that the tsc now is no longer reliable? Such conditions
> > > can exist... for example due to a CPU being hotplugged, or some SMM
> > > screwing around and the kernel detecting that or .. or ...
> >
> > The proposal doesn't provide a notification mechanism (though I'm
> > not against it)... if the tsc can EVER become unreliable,
> > tsc_reliable should be 0.
>
> then it should be 0 always on all of todays hardware.
> SMM, thermal overload, etc etc ... you name it.
> Things the kernel will get notified about...

What we could expose is an estimate about the performance of
gettimeofday/clock_gettime. The kernel has all the information to do
that, but this still does not solve the notification problem when we
need to switch to a different clock source.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 16 May 2010 12:50

> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> What we can talk about is a vget_tsc_raw() interface along with a
> vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an
> nasty error code for everything which is not usable.

I'm open to something like that provided:

1) It works (whenever possible) without changing privilege levels
or causing vmexits or other "hidden slowness" problems when
used both in bare-metal Linux and in a virtual machine.
2) The "transformation" performed by the kernel on the TSC
does not require some hidden pcpu number that won't work
in a virtual machine.

If TSC is indeed reliable (see below), it is both faster AND
meets the above constraints.

> > From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> > If you want a sysfs variable that is always 0... go wild.
>
> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> Nah, there are systems which will have it set to 1:
> Dig out your good old Pentium-I box and enjoy.

Hot stove syndrome again? Are you truly saying that there
are NO single-socket multi-core systems that don't have
stupid firmware (SMI and/or BIOS)? Or are you saying that
significant TSC clock skew occurs even between the cores
on a single-socket Nehalem system?

If things are this bad, why on earth would the kernel itself
EVER use TSC even as its own internal clocksource? Or
even to provide additional precision to a slow platform timer?

Or are you saying that many systems (and especially large
multi-socket systems) DO exist where the kernel isn't able
to proactively determine that the firmware is broken and/or
significant thermal variation may occur across sockets?
This I believe.

I understand that you both are involved in pushing the
limits of large systems and that time synchronization is
a very hard problem, perhaps effectively unsolvable,
in these systems.

But that doesn't mean the vast majority of latest generation
single-socket systems can't set "tsc_reliable" to 1. Or that
the kernel is responsible for detecting and/or correcting
every system with buggy firmware.

Maybe the best way to solve the "buggy firmware problem"
is exactly by encouraging enterprise apps to use TSC
and to expose and *blacklist* systems and/or system vendors
who ship boxes with crappy firmware!

> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> What we could expose is an estimate about the performance of
> gettimeofday/clock_gettime. The kernel has all the information to do
> that, but this still does not solve the notification problem when we
> need to switch to a different clock source.

This would at least be a big step in the right direction.

But if we go with a vget_raw_tsc() or direct TSC solution,
you have convinced me of the need for notification.
Maybe this is a perfect use for (at least one bit in)
the TSC_AUX register and the rdtscp instruction?

And I do agree with Venki that some user library (or at
least published sample code) should be made available
to demonstrate proper usage and to dampen out the worst
of the "broken user problem".

> > From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> > can you name said "enterprise" software by name please? We need a huge
> > advertisement to let people know not to trust their important data to
> > it..

For obvious reasons I can't do that, but I can point to
enterprise *operating systems* that have long since solved
this same problem one way or another: Solaris on x86 and
HP-UX (the latter admittedly on ia64). Enterprise app
vendors are quite happy with requiring conformance to a
very completely specified software/hardware/firmware stack
before providing support to an app customer. I'm just trying
to ensure that Linux can be part of that spec.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: [PATCH 3/4] drivers/hid: Eliminate use after free
Next: Loan Application