From: Johannes Hirte on
Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> On the Opteron system I got now csum errors. I've synced some data from the
> netbook to the Opteron yesteray. After hitting ENOSPC with 4GB free, I've
> run 'btrfs-vol -b' on this fs in hope to get some more free space. It
> worked but the command failed and I found in dmesg:
>
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
> btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575
>
> So I've tested the new synced data by syncing them to another disk on the
> Optoern system (XFS). As I've expected (or better feared), some data wasn't
> readable and I found more csum errors in dmesg:
>
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547
> btrfs csum failed ino 1959333 off 252362752 csum 686735346 private
> 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346
> private 686735345 btrfs csum failed ino 1959333 off 252362752 csum
> 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752
> csum 686735346 private 686735345 btrfs csum failed ino 1959333 off
> 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333
> off 252362752 csum 686735346 private 686735345 btrfs csum failed ino
> 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed
> ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum
> failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs
> csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976
> btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private
> 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977
> private 2851505976 btrfs csum failed ino 1959333 off 898342912 csum
> 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912
> csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off
> 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333
> off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino
> 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed
> ino 1959333 off 898342912 csum 4271223884 private 4271223883

I think, this is a different error. I've only seen them on filesystems from my
Opteron system. It seems that the recorded csums are wrong and it looks to me
like rounding errors. The data itself should be correct, as I've tested one
affected file via md5sum against the original on another filesystem.
Any ideas what is going wrong here?

regards,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > ino 1959333 off 898342912 csum 4271223884 private 4271223883
>
> I think, this is a different error. I've only seen them on filesystems from my
> Opteron system. It seems that the recorded csums are wrong and it looks to me
> like rounding errors. The data itself should be correct, as I've tested one
> affected file via md5sum against the original on another filesystem.
> Any ideas what is going wrong here?

Are you doing data mirroring?

We can map that block and do a raw read off the device to see what the
data blocks actually contain.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> > >
> > > I think, this is a different error. I've only seen them on filesystems
> > > from my Opteron system. It seems that the recorded csums are wrong and
> > > it looks to me like rounding errors. The data itself should be correct,
> > > as I've tested one affected file via md5sum against the original on
> > > another filesystem. Any ideas what is going wrong here?
> >
> > Are you doing data mirroring?
>
> No, I don't.
>
> > We can map that block and do a raw read off the device to see what the
> > data blocks actually contain.
>
> I've modified the btrfs-source a little to get the data. In inode.c I've
> changed the code to:

Great. The bad csums are all just one bit off, that can't be an
accident. When were they written (which kernel?). Did you boot a 32
bit kernel on there at any time?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Johannes Hirte on
Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> >
> > I think, this is a different error. I've only seen them on filesystems
> > from my Opteron system. It seems that the recorded csums are wrong and
> > it looks to me like rounding errors. The data itself should be correct,
> > as I've tested one affected file via md5sum against the original on
> > another filesystem. Any ideas what is going wrong here?
>
> Are you doing data mirroring?

No, I don't.

> We can map that block and do a raw read off the device to see what the
> data blocks actually contain.

I've modified the btrfs-source a little to get the data. In inode.c I've
changed the code to:


csum = btrfs_csum_data(root, kaddr + offset, csum, end - start + 1);
btrfs_csum_final(csum, (char *)&csum);
if (csum != private)
if (printk_ratelimit()) {
printk(KERN_INFO "csum != private; ino %lu off %llu "
"csum %u private %llu\n", page->mapping->host->i_ino,
(unsigned long long)start, csum,
(unsigned long long)private);
}
// goto zeroit;

kunmap_atomic(kaddr, KM_USER0);

This way I could read the files with wrong csum too. As I wrote, I've compared
the md5sum from one file with a copy on an other filesystem. As they are the
same, at least for this file the data should be correct. The big question is,
why do the csums differ?

regards,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Johannes Hirte on
Am Donnerstag 15 Juli 2010, 21:35:51 schrieb Chris Mason:
> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:
> > Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:
> > > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:
> > > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:
> > > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883
> > > >
> > > > I think, this is a different error. I've only seen them on
> > > > filesystems from my Opteron system. It seems that the recorded csums
> > > > are wrong and it looks to me like rounding errors. The data itself
> > > > should be correct, as I've tested one affected file via md5sum
> > > > against the original on another filesystem. Any ideas what is going
> > > > wrong here?
> > >
> > > Are you doing data mirroring?
> >
> > No, I don't.
> >
> > > We can map that block and do a raw read off the device to see what the
> > > data blocks actually contain.
> >
> > I've modified the btrfs-source a little to get the data. In inode.c I've
>
> > changed the code to:
> Great. The bad csums are all just one bit off, that can't be an
> accident. When were they written (which kernel?). Did you boot a 32
> bit kernel on there at any time?

No, I don't have a bootable 32bit installation on this system. I've tested it
now with a 32bit system by dumping the whole filesystem to an external drive
and mounting this to a 32bit system. The result was the same.

The affected files were written by different kernels. I think at least 2.6.34,
2.6.35-rc3 and 2.6.35-rc4 should be involved, perhaps 2.6.33 too. I'll try to
figure it out more exactly.

regards,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/