From: Andi Kleen on
Hi dan,

On Sat, May 15, 2010 at 06:29:25AM -0700, Dan Magenheimer wrote:
> The problem is from an app point-of-view there is no vsyscall.
> There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> if it gets lucky, they turn out to be very fast and sometimes
> it doesn't get lucky and they are VERY slow (resulting in a performance
> hit of 10% or more), depending on a number of factors completely
> out of the control of the app and even undetectable to the app.

What would the application do in the 10% case?

(Assuming modern kernels, I know older kernels had trouble sometimes):

That's the case when the TSC doesn't work reliably, so if it
uses it anyways it won't get good time.

It seems to me you're bordering on violating Steinberg's rule
of system programming here :-)

>
> Note also that even vsyscall with TSC as the clocksource will
> still be significantly slower than rdtsc, especially in the
> common case where a timestamp is directly stored and the
> delta between two timestamps is later evaluated; in the
> vsyscall case, each timestamp is a function call and a convert
> to nsec but in the TSC case, each timestamp is a single
> instruction.

First the single instruction is typically quite slow. Then
to really get monotonous time you need a barrier anyways.

When I originally wrote vsyscalls that overhead wasn't that big
with all that compared to open coding. The only thing that could
be stripped might be the unit conversion. In principle
a new vsyscall could be added for that (what units do you need?)

I remember when they were converted to clocksources they got
somewhat slower, but I suspect with some tuning work that
could be also fixed again.

I think glibc also still does a unnecessary indirect jump
(might hurt you if your CPU cannot predict that), but that could
be fixed too. I think I have an old patch for that in fact,
if you're still willing to use the old style vsyscalls.

>
> > This way if anything changes again in TSC the kernel could
> > shield the applications.
>
> If tsc_reliable is 1, the system and the kernel are guaranteeing
> to the app that nothing will change in the TSC. In an Invariant
> TSC system that has passed Ingo's warp test (to eliminate the
> possibility of a fixed interprocessor TSC gap due to a broken BIOS
> in a multi-node NUMA system), if anything changes in the clock

That only handles cases visible at boot. If the TSC breaks
longer term the kernel catches it with its watchdog, but your
user application won't.

> signal that drives the TSC, the system is badly broken and far
> worse things -- like inter-processor cache incoherency -- may happen.

I don't think that's true. There are various large systems with
non synchronized TSC and I haven't heard of any unique cache coherency
problems on that.

Also often the TSC is actually synchronized, but unfortunately
runs with a offset.

>
> Is it finally possible to get past the horrible SMP TSC problems
> of the past and allow apps, under the right conditions, to be able
> to use rdtsc again? This patch argues "yes".

Yes but why not let them use vsyscalls?

I know vsyscalls still have some issues today, but these
would be better fixed than worked around like this.

e.g.

If the idea is to use the TSC on not fully synchronized systems?

I haven't fully kept track, but at some point there was an attempt
to have more POSIX clocks with loser semantics (like per thread
monotonous). If you use that you'll get fast time (well not day time,
but perhaps useful time) which might be good enough without
hacks like this?

If the semantics are not exactly right I think more POSIX clocks
could be added too.

Or if the time conversion is a problem we could add a posix_gettime_otherunit()
or so (e.g. with a second vsyscall that converts units so you don't
need to do it in the fast path)

A long time ago there was also the idea to export the information
if gettimeofday()/clock_gettime() was fast or not. If this helps this could
be probably revisited. But I'm not sure what the application
should really do in this case.

32bit doesn't have a fast ring 3 gtod() today but that could be also fixed.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
(note important alternate proposal suggested below, please
don't get bored and stop reading yet ;-)

> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> > "won't let my friends use rdtsc" :-) Anything I can do
> > to help?
>
> Yes, stop trying to convince me that rdtsc in apps is a good idea. :)

You've convinced ME that it is NOT a good idea in many many cases
and that some of those cases are very very hard to detect. Unfortunately,
convincing ME is not sufficient. The problem is that Pandora's
box is already open and it is getting ever more wide open. (Each
of you has implied that TSC on future systems is getting even
more likely to be reliable.) Not only is rdtsc unprivileged
but so is the CPUID Invariant TSC bit. And the kernel already
publishes current_clocksource and puts a tempting MHz rate in
its logs. While we all might hope that every system programmer
will read this long thread and also be convinced never to use
rdtsc, the problem is there is a high and increasing chance that
any given systems programmer will write code that uses rdtsc and,
in all of his/her test machines will NEVER see a problem. But,
sadly, some of the customers using that app WILL see the problem
BUT neither the customers nor the systems programmer may ever know
what really went wrong (unless they hire Thomas to debug it :-)

So let me suggest inverting the logic of the patch and maybe
it will serve all of us better. (see proposal below)

> From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> If you want to find out yourself if the tsc is good enough for you
> that is one thing.... but if you want the kernel to have an official
> interface for it.... the kernel has to live by that commitment.
> We cannot put in that interface "oh and you need to implement the same
> workarounds, scaling and offsets as the kernel does", because that's
> in a huge flux, and will change from kernel version to kernel version.
> The only shot you could get is some vsyscall/vdso function that gives
> you a unit (but that is not easy given per cpu offset/frequency/etc..
> but at least the kernel can try)

> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> Please understand that once we expose that tsc_reliable information we
> are responsible for its correctness. People will use it whether the
> enterprise entity who wants this feature has qualified that particular
> piece of hardware or not. And while the support of that enity refuses
> to help on non qualified hardware (your own words), we'll end up with
> the mess which was created to help that very entity.

OK, so let's invert the sense of the sysfs file and call it (for now)
"tsc_detected_as_UNreliable". Then anytime the kernel detects
a failed warp test (or any other suspicious condition), it changes
the bit from 0 to 1 effectively saying "if you are using rdtsc
against our recommendation, we told you that it might go bad and
it has, so consider yourself warned that some of the timestamps
you've taken since the last time you've checked this flag
may be b*rked"

IMHO, addressing the issue directly and clearly documenting it
(instead of trying to hide the dirty laundry in the kernel)
will result in far better education of systems programmers
and far fewer end user problems. Which raises another good analogy:

You are telling teenagers to abstain and I am proposing that we
instead encourage them to use a condom.

You are simply not going to stop systems programmers from using
rdtsc... let's at least allow them to practice "safe TSC usage"
which, to extend the analogy, we all know is still not 100.0%
effective but, if they are going to do "it" anyway, is still
far better than completely unprotected TSC usage.

(The remainder of this response is discussion of individual
points raised, so can be skipped by most readers.)

====

> From: Arjan van de Ven [mailto:arjan(a)infradead.org]
> and 3) the kernel gets thermal interrupts and the app does not
> and 4) the kernel decides which power management to use when
> and 5) the kernel can find out if SMI's happened, and the app cannot.
> and 6) the kernel can access tsc and a per cpu offset/frequency
> data atomically, without being scheduled to another CPU. The app cannot
> [well it can ask the kernel to be pinned, and that's a 99.99% thing,
> but still]

These are all very good reasons where the kernel might turn
on a "tsc_detected_as_unreliable" bit. And CPU-hotplug_add
too.

And, by the way, are any of these valid in a virtual machine?

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> What would the application do in the 10% case?

(and in the "tsc_detected_as_unreliable" == 1 case)
Well, some possibilities are:
- Disable the heavy TSC usage and log/notify that it has been
disabled because the TSC is not reliable.
- Switch to gettimeofday and log/notify that, sorry, overall
performance has become slower due to the fact that the
TSC has become reliable
- Log a message telling the user to confirm that the hardware
they are using is on the app's supported/compatibility list
and that they are using the latest firmware and that their
thermal logs are OK, because something might not be quite
right with their system

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> 32bit doesn't have a fast ring 3 gtod() today but that could be also
> fixed.

That might have helped, because the enterprise app I mentioned
earlier was 32-bit... but I'll bet the genie is out of the
bottle now and the app has already shipped using rdtsc.

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> It seems to me you're bordering on violating Steinberg's rule
> of system programming here :-)

<Embarrassed to not know this rule, Dan goes off and googles
but fails to find any good matches before his TSC goes bad>

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> First the single instruction is typically quite slow. Then
> to really get monotonous time you need a barrier anyways.

Agreed, I've measured 30-ish and 60-ish cycles on a
couple of machines.

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> When I originally wrote vsyscalls that overhead wasn't that big
> with all that compared to open coding. The only thing that could
> be stripped might be the unit conversion. In principle
> a new vsyscall could be added for that (what units do you need?)
>
> I remember when they were converted to clocksources they got
> somewhat slower, but I suspect with some tuning work that
> could be also fixed again.
>
> I think glibc also still does a unnecessary indirect jump
> (might hurt you if your CPU cannot predict that), but that could
> be fixed too. I think I have an old patch for that in fact,
> if you're still willing to use the old style vsyscalls.

I think vsyscall is an excellent idea and I'm very much in
favor of continuing to improve it and encouraging people
to use it. But until it either "always" "just works" in "all"
of the software/hardware environments used by a systems
programmer in their development and testing and in the systems
the apps get deployed on (including virtual machines) OR until
it clearly provides an indicator that it is hiding dirty performance
laundry, IMHO it won't convince the (admittedly undereducated)
pro-TSC crowd.

> > signal that drives the TSC, the system is badly broken and far
> > worse things -- like inter-processor cache incoherency -- may happen.
>
> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> I don't think that's true. There are various large systems with
> non synchronized TSC and I haven't heard of any unique cache coherency
> problems on that.

Here I was referring to clock skew/drift, not the "fixed offset"
problem. I'm far from an expert in PLL's etc, but I think if
the clock signal is delayed far enough to cause TSC to skew
significantly, eventually some critical cache coherency protocol
is eventually going to miss a beat and screw up and corrupt data.

> If the idea is to use the TSC on not fully synchronized systems?

That has never been my intent, though others might be interested
in that.

> From: Andi Kleen [mailto:andi(a)firstfloor.org]
> I haven't fully kept track, but at some point there was an attempt
> to have more POSIX clocks with loser semantics (like per thread
> monotonous). If you use that you'll get fast time (well not day time,
> but perhaps useful time) which might be good enough without
> hacks like this?
>
> If the semantics are not exactly right I think more POSIX clocks
> could be added too.
>
> Or if the time conversion is a problem we could add a
> posix_gettime_otherunit()
> or so (e.g. with a second vsyscall that converts units so you don't
> need to do it in the fast path)
>
> A long time ago there was also the idea to export the information
> if gettimeofday()/clock_gettime() was fast or not. If this helps this
> could
> be probably revisited. But I'm not sure what the application
> should really do in this case.
>
> 32bit doesn't have a fast ring 3 gtod() today but that could be also
> fixed.

I believe (and this is strictly a personal opinion based on my
view of human psychology) that adding more obscure interfaces and
more obscure options is a losing battle.

> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
> I think you understand that I have no intention to put a ticking time
> bomb into the code I'm responsible for. I really have better things to
> do than shooting myself in the foot.

As we both know, the ticking time bomb is already there. IMHO this
revised proposal provides you with "plausible deniability"**
"The kernel now advertises when it detects the TSC is bad or has gone
bad... did your app vendor check the kernel-provided info?"

** http://en.wikipedia.org/wiki/Plausible_deniability
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
On Mon, May 17, 2010 at 08:23:59AM -0700, Dan Magenheimer wrote:
> will read this long thread and also be convinced never to use
> rdtsc, the problem is there is a high and increasing chance that
> any given systems programmer will write code that uses rdtsc and,
> in all of his/her test machines will NEVER see a problem. But,

It would be probably a good idea to document all this
somewhere central, agreed.

In fact I remember wanting to write such a paper once. Possibly
even with some title borrowing from Swift. It seems it never happened :)

> > From: Andi Kleen [mailto:andi(a)firstfloor.org]
> > 32bit doesn't have a fast ring 3 gtod() today but that could be also
> > fixed.
>
> That might have helped, because the enterprise app I mentioned
> earlier was 32-bit... but I'll bet the genie is out of the
> bottle now and the app has already shipped using rdtsc.

Well the enterprise will have to live with wrong timing now and then
(or perhaps totally broken timing on some system) then.

> > From: Andi Kleen [mailto:andi(a)firstfloor.org]
> > It seems to me you're bordering on violating Steinberg's rule
> > of system programming here :-)
>
> <Embarrassed to not know this rule, Dan goes off and googles
> but fails to find any good matches before his TSC goes bad>

Sorry it's Steinbach's rule. Never test for an error condition
you don't know how to handle.

>
> > From: Andi Kleen [mailto:andi(a)firstfloor.org]
> > First the single instruction is typically quite slow. Then
> > to really get monotonous time you need a barrier anyways.
>
> Agreed, I've measured 30-ish and 60-ish cycles on a
> couple of machines.

It can be worse.

Also for correct behaviour you need the barrier on many systems.
And some other workarounds.

>
> I think vsyscall is an excellent idea and I'm very much in
> favor of continuing to improve it and encouraging people
> to use it. But until it either "always" "just works" in "all"

On 64bit it should just work, although it could be made somewhat
(but not dramatically) faster.

On older kernels vsyscall got a bad name because it was not always
as clever as it could be to chose when to use TSC and when not,
but that is long fixed. It worked in any case, just was slower[1]

[1] That is some specific TSC misbehaviours were only fixed in later
kernels, but it's very unlikely that any other RDTSC user got
that right either.


> > From: Andi Kleen [mailto:andi(a)firstfloor.org]
> > I don't think that's true. There are various large systems with
> > non synchronized TSC and I haven't heard of any unique cache coherency
> > problems on that.
>
> Here I was referring to clock skew/drift, not the "fixed offset"
> problem. I'm far from an expert in PLL's etc, but I think if
> the clock signal is delayed far enough to cause TSC to skew
> significantly, eventually some critical cache coherency protocol
> is eventually going to miss a beat and screw up and corrupt data.

Generally small systems run on the same clock (and any drift
you see has other reasons), but large systems built out of multiple
boards run on different clocks and the interconnect
does appropiate timing between the different domains for itself.

The interconnects also have checksums and other error checking
and recovery mechanisms and tend to do something appropiate when there
is a transmission problem. It does not usually lead to data corruption.

> I believe (and this is strictly a personal opinion based on my
> view of human psychology) that adding more obscure interfaces and
> more obscure options is a losing battle.

It sounds like you're arguing against your own patch here.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on
On Mon, 17 May 2010, Dan Magenheimer wrote:
>
> OK, so let's invert the sense of the sysfs file and call it (for now)
> "tsc_detected_as_UNreliable". Then anytime the kernel detects
> a failed warp test (or any other suspicious condition), it changes
> the bit from 0 to 1 effectively saying "if you are using rdtsc
> against our recommendation, we told you that it might go bad and
> it has, so consider yourself warned that some of the timestamps
> you've taken since the last time you've checked this flag
> may be b*rked"
>
> IMHO, addressing the issue directly and clearly documenting it
> (instead of trying to hide the dirty laundry in the kernel)
> will result in far better education of systems programmers
> and far fewer end user problems. Which raises another good analogy:
>
> You are telling teenagers to abstain and I am proposing that we
> instead encourage them to use a condom.

Wrong. A vsyscall _is_ the protection which you want them to pull over
the rdtsc.

You are basically telling them: Go ahead, but keep in mind to look for
that well hidden tag behind the left earlobe which might change
suddenly from "no disease" to "infectous".

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on
> From: Thomas Gleixner [mailto:tglx(a)linutronix.de]
>
> On Mon, 17 May 2010, Dan Magenheimer wrote:
> >
> > OK, so let's invert the sense of the sysfs file and call it (for now)
> > "tsc_detected_as_UNreliable". Then anytime the kernel detects
> > a failed warp test (or any other suspicious condition), it changes
> > the bit from 0 to 1 effectively saying "if you are using rdtsc
> > against our recommendation, we told you that it might go bad and
> > it has, so consider yourself warned that some of the timestamps
> > you've taken since the last time you've checked this flag
> > may be b*rked"
> >
> > IMHO, addressing the issue directly and clearly documenting it
> > (instead of trying to hide the dirty laundry in the kernel)
> > will result in far better education of systems programmers
> > and far fewer end user problems. Which raises another good analogy:
> >
> > You are telling teenagers to abstain and I am proposing that we
> > instead encourage them to use a condom.
>
> Wrong. A vsyscall _is_ the protection which you want them to pull over
> the rdtsc.
>
> You are basically telling them: Go ahead, but keep in mind to look for
> that well hidden tag behind the left earlobe which might change
> suddenly from "no disease" to "infectous".

OK, well ignoring the metaphor, it's clear we disagree on a point
that neither one of us can prove: You think your decision to avoid
sharing kernel information will stop system programmers from using
rdtsc, and I think some are going to use rdtsc anyway and blame
Linux when something eventually and silently breaks.

Given that I'm not going to win this argument even by pointing
to examples, let's move forward with your solution. Can you say
more about your vget_tsc_raw() directions? Or at least describe
the API if there is anything special? I'm getting beaten on for
an answer *today*, they say (and I quote from an email today)
they "have been using vsyscalls for a while and still have a
performance headache", and rdtsc looks awfully tempting, even
if it is not 100% perfect. So what can I give them as an
alternative?

(And, yes, I understand this is not your problem, but I do worry
that it is perceived as *Linux's* problem. And as we all have
painfully learned in our careers, it's almost impossible to
beat back a perception by presenting a set of byzantine facts
that only a handful of people in the world truly understand.)

Thanks!
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/