From: Patrick J. LoPresti on
For concreteness, let me start with the patch I have in mind. Call it
"patch version 1".


--- linux-2.6.32.13-0.4/kernel/time.c.orig 2010-08-13
10:52:50.000000000 -0700
+++ linux-2.6.32.13-0.4/kernel/time.c 2010-08-13 10:53:20.000000000 -0700
@@ -229,7 +229,7 @@ SYSCALL_DEFINE1(adjtimex, struct timex _
*/
struct timespec current_fs_time(struct super_block *sb)
{
- struct timespec now = current_kernel_time();
+ struct timespec now = getnstimeofday();
return timespec_trunc(now, sb->s_time_gran);
}
EXPORT_SYMBOL(current_fs_time);

....

I recently spent nearly a week tracking down an NFS cache coherence
problem in an application:

http://www.spinics.net/lists/linux-nfs/msg14974.html

Here is what caused my problem:

1) File dir/A is created locally on NFS server.
2) NFS client does LOOKUP on file dir/B, gets ENOENT.
3) File dir/B is created locally on NFS server.

In my case, these all happened in less than 4 milliseconds (much less,
actually). Since HZ on my system is 250, the file creation in step
(3) failed to update the ctime/mtime on the directory. The result is
that the NFS client's "dentry lookup cache" became stale, but did not
know it was stale (since it relies on the directory ctime/mtime to
detect that). Worse, the staleness persists even if additional
changes are made to the directory from the NFS client, thanks to NFS
v3's "weak cache consistency" optimizations.

Why did this take me a week to diagnose? Because I am using XFS, and
I know XFS and NFS use nanosecond resolution for file timestamps. It
never occurred to me that, here in 2010, Linux would have an actual
file timestamp resolution 6.5 orders of magnitude worse.

I know, I know, "use NFS v4 and i_version". But that is not the
point. The point is that 4 milliseconds is a very long time these
days; an awful lot of file system operations can happen in such an
interval.

I am guessing the objection to the above patch will be: "Waaah it's
slow!" My responses would be:

1) Anybody who cares about file system performance is already using
"noatime" or "relatime", which mitigates the hit greatly.

2) Correctness is more important than performance, and 4 milliseconds
is just embarrassing.

3) On the 99.99% of Linux systems that are post-1990 x86, it is not
slow at all, and the performance difference will be utterly
undetectable in the real world.

When was XFS designed? It has nanosecond timestamps. When was NFS
designed? It has nanosecond timestamps. Even ext4 has nanosecond
timestamps... But what is the point if 22 bits' worth will forever be
meaningless?

If the above patch is too slow for some architectures, how about
making it a configuration option? Call it "CONFIG_1980S_FILE_TICK",
have it default to YES on the architectures that care and NO on
anything remotely modern and sane.

OK that's my proposal. Bash away.

- Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: john stultz on
On Fri, Aug 13, 2010 at 11:25 AM, Patrick J. LoPresti
<lopresti(a)gmail.com> wrote:
> 3) On the 99.99% of Linux systems that are post-1990 x86, it is not
> slow at all, and the performance difference will be utterly
> undetectable in the real world.

Your stats are off here. The only fast clocksource on x86 is the TSC,
and its busted on many, many systems. The cpu vendors have only
recently taken it seriously and resolved the majority of problems
(however, issues still remain on large numa systems, but its much
better then the story was 3-7 years ago).

On those TSC broken systems that use the hpet or acpi_pm, a
getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
severe. And even with the TSC, expect some performance impact, as
reading hardware and doing the multiply is more costly then just
fetching a value from memory.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Patrick J. LoPresti on
On Fri, Aug 13, 2010 at 11:45 AM, john stultz <johnstul(a)us.ibm.com> wrote:
>
> Your stats are off here. The only fast clocksource on x86 is the TSC,
> and its busted on many, many systems. The cpu vendors have only
> recently taken it seriously and resolved the majority of problems
> (however, issues still remain on large numa systems, but its much
> better then the story was 3-7 years ago).

Thank you for the correction. Still, the number of systems where TSC
works is large, it is growing over time, and.... Really now,
milliseconds? In 2010? On some Apple iToy, perhaps...

> On those TSC broken systems that use the hpet or acpi_pm, a
> getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> severe.

So you are saying my proposal is a bad idea forever? (But then why
even bother having nanosecond resolution on ext4?)

Or that it is a bad idea for now?

Or that it needs to be refined? Maybe use hi-res precision on systems
where it is known to be fast?

> And even with the TSC, expect some performance impact, as
> reading hardware and doing the multiply is more costly then just
> fetching a value from memory.

Relative to file system operations? Seriously? What performance hit
would you expect on real-world applications?
Something like 0.1% (10 nsec / 10 usec) worst case?

- Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: john stultz on
On Fri, 2010-08-13 at 11:57 -0700, Patrick J. LoPresti wrote:
> On Fri, Aug 13, 2010 at 11:45 AM, john stultz <johnstul(a)us.ibm.com> wrote:
> > On those TSC broken systems that use the hpet or acpi_pm, a
> > getnstimeofday call can take 0.5-1.3us, so the penalty can be quite
> > severe.
>
> So you are saying my proposal is a bad idea forever? (But then why
> even bother having nanosecond resolution on ext4?)
>
> Or that it is a bad idea for now?

I'm not judging the idea as good/bad, just providing information for
context.

> Or that it needs to be refined? Maybe use hi-res precision on systems
> where it is known to be fast?
>
> > And even with the TSC, expect some performance impact, as
> > reading hardware and doing the multiply is more costly then just
> > fetching a value from memory.
>
> Relative to file system operations? Seriously? What performance hit
> would you expect on real-world applications?
> Something like 0.1% (10 nsec / 10 usec) worst case?

If you can show this does not affect performance in benchmarks, etc, I'm
sure it will be easier to push the patch. As outside of performance, I
don't think there's much of an issue with the change.

So other then "show some numbers", my only thought that might make the
patch more attractive is that rather than a global change, or a static
CONFIG_ option, would it maybe make more sense as a mount option?

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Patrick J. LoPresti on
On Fri, Aug 13, 2010 at 12:09 PM, john stultz <johnstul(a)us.ibm.com> wrote:
>
> So other then "show some numbers", my only thought that might make the
> patch more attractive is that rather than a global change, or a static
> CONFIG_ option, would it maybe make more sense as a mount option?

I really like this idea.

Consider the following "revision 2" of my proposal:

1) Add a function pointer "current_fs_time" to struct super_block.

2) Replace all calls of the form:

current_fs_time(sb);

with

sb->current_fs_time(sb);

3) Arrange for the default value to point to the current implementation.

These first three could be one patch. They change no functionality;
they just enable the next step.

Finally:

4) Add a mount option to cause sb->current_fs_time(sb) to use the
hi-res implementation.

Comments?

- Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/