Q: sched_clock() vs. clocksource, how to implement correctly [Kernel]

Prev: add usb_add_bus() function for usb bus list
Next: [PATCH] net: change recvform to return same address length as getsockname on unnamed unix sockets

From: Johannes Stezenbach on 23 Apr 2010 11:10

Hi,

I'm trying to figure out how to correctly implement sched_clock()
for an ARM board. However, looking at existing implementations
leaves me rather confused.

E.g. arch/arm/mach-u300/timer.c has

static cycle_t u300_get_cycles(struct clocksource *cs)
{
return (cycles_t) readl(U300_TIMER_APP_VBASE + U300_TIMER_APP_GPT2CC);
}

static struct clocksource clocksource_u300_1mhz = {
.name = "GPT2",
.rating = 300, /* Reasonably fast and accurate clock source */
.read = u300_get_cycles,
.mask = CLOCKSOURCE_MASK(32), /* 32 bits */
/* 22 calculated using the algorithm in arch/mips/kernel/time.c */
.shift = 22,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

unsigned long long notrace sched_clock(void)
{
return clocksource_cyc2ns(clocksource_u300_1mhz.read(
&clocksource_u300_1mhz),
clocksource_u300_1mhz.mult,
clocksource_u300_1mhz.shift);
}

Thus, sched_clock() is based on a 1MHz 32bit counter which wraps
after about 71 minutes. There are a few similar sched_clock()
implementations in the tree.

Questions:

- Isn't sched_clock() supposed to be extended to 64bit so
that it practically never wraps?
(old implementations use cnt32_to_63())

- What would be the effect on scheduling when sched_clock() wraps?

- What is the effect of the sched_clock() frequency on scheduling?
Is there a benefit from setting the freq as high as possible?

- Is struct timecounter + struct cyclecounter + timecounter_read()
designated way to implement sched_clock() with a 32bit hw counter?

arch/microblaze/kernel/timer.c seems to be the only user
of timecounter/cyclecounter in arch/, but I don't get what it does.

Or is it better to stick with cnt32_to_63()?

- Also regarding the clocksource.shift value, I found
http://lkml.org/lkml/2008/5/8/270 and it seems to suggest
to use a low shift value, whereas arch/mips/kernel/time.c
seems to result in a large one. Is the posting correct?

Thanks,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Martin Schwidefsky on 23 Apr 2010 12:40

On Fri, 23 Apr 2010 17:09:17 +0200
Johannes Stezenbach <js(a)sig21.net> wrote:

> I'm trying to figure out how to correctly implement sched_clock()
> for an ARM board. However, looking at existing implementations
> leaves me rather confused.
>
> E.g. arch/arm/mach-u300/timer.c has
>
> static cycle_t u300_get_cycles(struct clocksource *cs)
> {
> return (cycles_t) readl(U300_TIMER_APP_VBASE + U300_TIMER_APP_GPT2CC);
> }
>
> static struct clocksource clocksource_u300_1mhz = {
> .name = "GPT2",
> .rating = 300, /* Reasonably fast and accurate clock source */
> .read = u300_get_cycles,
> .mask = CLOCKSOURCE_MASK(32), /* 32 bits */
> /* 22 calculated using the algorithm in arch/mips/kernel/time.c */
> .shift = 22,
> .flags = CLOCK_SOURCE_IS_CONTINUOUS,
> };
>
> unsigned long long notrace sched_clock(void)
> {
> return clocksource_cyc2ns(clocksource_u300_1mhz.read(
> &clocksource_u300_1mhz),
> clocksource_u300_1mhz.mult,
> clocksource_u300_1mhz.shift);
> }
>
> Thus, sched_clock() is based on a 1MHz 32bit counter which wraps
> after about 71 minutes. There are a few similar sched_clock()
> implementations in the tree.

Not good.

> Questions:
>
> - Isn't sched_clock() supposed to be extended to 64bit so
> that it practically never wraps?
> (old implementations use cnt32_to_63())

Yes, sched_clock() is supposed to return a monotonic timestamp.

> - What would be the effect on scheduling when sched_clock() wraps?

It would confuse the process accounting and the scheduling I guess.

> - What is the effect of the sched_clock() frequency on scheduling?
> Is there a benefit from setting the freq as high as possible?

With a higher frequency the precision of the accounting increases and
the scheduling fairness increases.

> - Is struct timecounter + struct cyclecounter + timecounter_read()
> designated way to implement sched_clock() with a 32bit hw counter?
>
> arch/microblaze/kernel/timer.c seems to be the only user
> of timecounter/cyclecounter in arch/, but I don't get what it does.
>
> Or is it better to stick with cnt32_to_63()?

cnt32_to_63() is a way to extend the 32 bit clocksource to a useable
sched_clock() provider. One way or the other you have to extend the 32
bits of the 1MHz counter to something bigger. The obvious way is to
count the number of wraps and use this number as the upper 32 bits
just like cnt32_to_63() does. You just have to make sure that the
clocksource is read frequently enough to get all the wraps. A non-idle
cpu will tick with HZ frequency and the clock will be read often
enough, for an idle cpu the maximum sleep time needs to be limited.

> - Also regarding the clocksource.shift value, I found
> http://lkml.org/lkml/2008/5/8/270 and it seems to suggest
> to use a low shift value, whereas arch/mips/kernel/time.c
> seems to result in a large one. Is the posting correct?

Low shift values have the advantage that the 64-bit calculations do not
overflow so easily. But there are fairly large shift values in use
(20, 24, even 32 bit) in the current kernel tree, so I wouldn't worry
about bigger shift values.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Johannes Stezenbach on 24 Apr 2010 17:00

On Fri, Apr 23, 2010 at 06:29:33PM +0200, Martin Schwidefsky wrote:
> On Fri, 23 Apr 2010 17:09:17 +0200 > Johannes Stezenbach <js(a)sig21.net> wrote:
> >
> > - Isn't sched_clock() supposed to be extended to 64bit so
> > that it practically never wraps?
> > (old implementations use cnt32_to_63())
>
> Yes, sched_clock() is supposed to return a monotonic timestamp.

OK, searching LXR for sched_clock, I think serveral
sched_clock() implementations across architectures
have the wrapping issue.

BTW, wrapping sched_clock() also causes printk() timestamps to
wrap, so the problem should be easy to spot.

> > - What would be the effect on scheduling when sched_clock() wraps?
>
> It would confuse the process accounting and the scheduling I guess.

That's a bit vague ;-/
Let's try that: Could it cause processes calling e.g. nanosleep()
to sleep much longer than intended (e.g. until sched_clock() wraps
again)? Or can it only cause a momentary scheduler hickup,
i.e. the scheduler might make one wrong scheduling decision
per sched_clock() wrap? Given that there are serveral platforms
with wrapping sched_clock() I guess the latter.

> > - Is struct timecounter + struct cyclecounter + timecounter_read()
> > designated way to implement sched_clock() with a 32bit hw counter?
> >
> > arch/microblaze/kernel/timer.c seems to be the only user
> > of timecounter/cyclecounter in arch/, but I don't get what it does.
> >
> > Or is it better to stick with cnt32_to_63()?
>
> cnt32_to_63() is a way to extend the 32 bit clocksource to a useable
> sched_clock() provider. One way or the other you have to extend the 32
> bits of the 1MHz counter to something bigger. The obvious way is to
> count the number of wraps and use this number as the upper 32 bits
> just like cnt32_to_63() does. You just have to make sure that the
> clocksource is read frequently enough to get all the wraps. A non-idle
> cpu will tick with HZ frequency and the clock will be read often
> enough, for an idle cpu the maximum sleep time needs to be limited.

Somehow I got the impression that cnt32_to_63() is old-style,
and using clocksource_cyc2ns() is new-style. Essentially I wanted to
avoid the do_div() in sched_clock() (like e.g. in
arm/mach-versatile/core.c), and just using clocksource_cyc2ns()
looked simple and elegant. Bummer that it's wrong.

It seems that arch/arm/plat-orion/time.c is about the only
one which gets it completely right according to your description.
I'll quote it here for discussion:

/*
* Orion's sched_clock implementation. It has a resolution of
* at least 7.5ns (133MHz TCLK) and a maximum value of 834 days.
*
* Because the hardware timer period is quite short (21 secs if
* 200MHz TCLK) and because cnt32_to_63() needs to be called at
* least once per half period to work properly, a kernel timer is
* set up to ensure this requirement is always met.
*/
#define TCLK2NS_SCALE_FACTOR 8

static unsigned long tclk2ns_scale;

unsigned long long sched_clock(void)
{
unsigned long long v = cnt32_to_63(0xffffffff - readl(TIMER0_VAL));
return (v * tclk2ns_scale) >> TCLK2NS_SCALE_FACTOR;
}

static struct timer_list cnt32_to_63_keepwarm_timer;

static void cnt32_to_63_keepwarm(unsigned long data)
{
mod_timer(&cnt32_to_63_keepwarm_timer, round_jiffies(jiffies + data));
(void) sched_clock();
}

static void __init setup_sched_clock(unsigned long tclk)
{
unsigned long long v;
unsigned long data;

v = NSEC_PER_SEC;
v <<= TCLK2NS_SCALE_FACTOR;
v += tclk/2;
do_div(v, tclk);
/*
* We want an even value to automatically clear the top bit
* returned by cnt32_to_63() without an additional run time
* instruction. So if the LSB is 1 then round it up.
*/
if (v & 1)
v++;
tclk2ns_scale = v;

data = (0xffffffffUL / tclk / 2 - 2) * HZ;
setup_timer(&cnt32_to_63_keepwarm_timer, cnt32_to_63_keepwarm, data);
mod_timer(&cnt32_to_63_keepwarm_timer, round_jiffies(jiffies + data));
}

BTW, even though this uses TCLK2NS_SCALE_FACTOR of 8, the same file
uses a shift vaue of 20 for the orion_clksrc...

Thanks,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 25 Apr 2010 10:40

On Sat, 2010-04-24 at 22:50 +0200, Johannes Stezenbach wrote:
> Somehow I got the impression that cnt32_to_63() is old-style,

Not really, the biggest problem it has is that is not SMP aware. I did
some patches for that at one time, but people objected for some reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Johannes Stezenbach on 26 Apr 2010 09:20

On Sat, Apr 24, 2010 at 10:50:11PM +0200, Johannes Stezenbach wrote:
> /*
> * Orion's sched_clock implementation. It has a resolution of
> * at least 7.5ns (133MHz TCLK) and a maximum value of 834 days.
> *
> * Because the hardware timer period is quite short (21 secs if
> * 200MHz TCLK) and because cnt32_to_63() needs to be called at
> * least once per half period to work properly, a kernel timer is
> * set up to ensure this requirement is always met.
> */
> #define TCLK2NS_SCALE_FACTOR 8

I found the following discussion of the sched_clock()
implementation trade-offs very informative:
http://lkml.org/lkml/2010/4/15/299

It mentions clocksource_calc_mult_shift() which was added in 2.6.33,
however I had some difficulties understanding the meaning of the minsec
parameter, especially since all existing callers use a value of 4.
But when using minsec = 365*24*60*60 (1 year) it results in the shift value of 8.

So finally the pieces connect together :-)

> BTW, even though this uses TCLK2NS_SCALE_FACTOR of 8, the same file
> uses a shift vaue of 20 for the orion_clksrc...

It seems clocksource_cyc2ns() is used in kernel/time/timekeeping.c
and kernel/time/clocksource.c only on relatively
small delta values, so there's no need to worry
about overflow and a large clocksource.shift and .mult is OK.

(Apparently the minsec value of 4 mentioned above is suitable
for timekeeping? Where does the 4 come from?)

Thanks,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: add usb_add_bus() function for usb bus list
Next: [PATCH] net: change recvform to return same address length as getsockname on unnamed unix sockets