Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) [Kernel]

Prev: (none)
Next: 2.6.35-rc3: System unresponsive under load

From: Ric Wheeler on 26 Jun 2010 10:00

On 06/26/2010 08:34 AM, Daniel Shiels wrote:
>> 25.06.2010 22:58, Ric Wheeler wrote:
>>
>>> On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>>>
>> []
>>
>>>>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>>>>> <Daniel.Taylor(a)wdc.com> wrote:
>>>>>
>>>>>
>>>>>> Just an FYI reminder. The original test (2K files) is utterly
>>>>>> pathological for disk drives with 4K physical sectors, such as
>>>>>> those now shipping from WD, Seagate, and others. Some of the
>>>>>> SSDs have larger (16K0 or smaller blocks (2K). There is also
>>>>>> the issue of btrfs over RAID (which I know is not entirely
>>>>>> sensible, but which will happen).
>>>>>>
>> Why it is not sensible to use btrfs on raid devices?
>> Nowadays raid is just everywhere, from 'fakeraid' on AHCI to
>> large external arrays on iSCSI-attached storage. Sometimes
>> it is nearly imposisble to _not_ use RAID, -- many servers
>> comes with a built-in RAID card which can't be turned off or
>> disabled. And hardware raid is faster (at least in theory)
>> at least because it puts less load on various system busses.
>>
>> To many "enterprise folks" a statement "we don't need hw raid,
>> we have better solution" sounds like "we're just a toy, don't
>> use".
>>
>> Hmm? ;)
>>
>> /mjt, who always used and preferred _software_ raid due to
>> multiple reasons, and never used btrfs so far.
>>
> Its not that you shouldn't use it on raid it's just it looses some value
> from the file system.
>
> Two nice features that btrfs provides are checksums and mirroring. If a
> disk corrupts a block then btrfs will realize due to the strong checksum
> and use the mirrored block. If you are using a raid system the raid won't
> know the data is corrupted and raid doesn't provide a way for the file
> system to get to the redundant block.
>
> I read a paper from Sun a while back about the undetected read failure
> rates for modern disks having not changed for many years. Disks are so
> large now that undetected failures are unacceptably likely for many
> systems. Hence zfs doing similar in file system raid schemes.
>
> In my lab I used dd to clobber data in some of my mirrors. Btrfs logs lots
> of checksum errors but never corrupted a file. Doing the same on a classic
> raid with classic filesystem (solaris with veritas volume manager)
> silently gave me bad data depending on what disk it felt like reading
> from.
>
> Daniel.
>

I was (one of many) people who worked at EMC on designing storage
arrays. If you are using any high end, external hardware array, it will
detect data corruption pro-actively for you. Most arrays do continual
scans for latent errors and have internal data integrity checks that are
used for this.

Note that DIF/DIX adds an extra 8 bytes of data integrity to newer
standards disks. We don't do anything with that today in btrfs, but you
could imagine ways to get even better data integrity protection.

If you are using software RAID (MD), you should also use its internal
checks to do this kind of proactive detection of latent errors on a
regular basis (say once every week or two).

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: (none)
Next: 2.6.35-rc3: System unresponsive under load