From: Dan Williams on
On 9/14/06, Jakob Oestergaard <jakob(a)unthought.net> wrote:
> On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote:
> ...
> > >Out of curiosity; how does accelerated compare to non-accelerated?
> >
> > One quick example:
> > 4-disk SATA array rebuild on iop321 without acceleration - 'top'
> > reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
> > utilization.
> >
> > With acceleration - 'top' reports md0_resync cpu utilization at ~90%
> > with the rest split between md0_raid5 and md0_raid5_ops.
> >
> > The sync speed reported by /proc/mdstat is ~40% higher in the accelerated
> > case.
>
> Ok, nice :)
>
> >
> > That being said, array resync is a special case, so your mileage may
> > vary with other applications.
>
> Every-day usage I/O performance data would be nice indeed :)
>
> > I will put together some data from bonnie++, iozone, maybe contest,
> > and post it on SourceForge.
>
> Great!
>
I have posted some Iozone data and graphs showing the performance
impact of the patches across the three iop processors iop321, iop331,
and iop341. The general take away from the data is that using dma
engines extends the region that Iozone calls the "buffer cache
effect". Write performance benefited the most as expected, but read
performance showed some modest gains as well. There are some regions
(smaller file size and record length) that show a performance
disadvantage but it is typically less than 5%.

The graphs map the relative performance multiplier that the raid
patches generate ('2.6.18-rc6 performance' x 'performance multiplier'
= '2.6.18-rc6-raid performance') . A value of '1' designates equal
performance. The large cliff that drops to zero is a "not measured"
region, i.e. the record length is larger than the file size. Iozone
outputs to Excel, but I have also made pdf's of the graphs available.
Note: Openoffice-calc can view the data but it does not support the 3D
surface graphs that Iozone uses.

Excel:
http://prdownloads.sourceforge.net/xscaleiop/iozone_raid_accel.xls?download

PDF Graphs:
http://prdownloads.sourceforge.net/xscaleiop/iop-iozone-graphs-20061010.tar.bz2?download

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Neil Brown on
[dropped akpm from the Cc: as current discussion isn't directly
relevant to him]
On Tuesday October 10, dan.j.williams(a)intel.com wrote:
> On 10/8/06, Neil Brown <neilb(a)suse.de> wrote:
>
> > Is there something really important I have missed?
> No, nothing important jumps out. Just a follow up question/note about
> the details.
>
> You imply that the async path and the sync path are unified in this
> implementation. I think it is doable but it will add some complexity
> since the sync case is not a distinct subset of the async case. For
> example "Clear a target cache block" is required for the sync case,
> but it can go away when using hardware engines. Engines typically
> have their own accumulator buffer to store the temporary result,
> whereas software only operates on memory.
>
> What do you think of adding async tests for these situations?
> test_bit(XOR, &conf->async)
>
> Where a flag is set if calls to async_<operation> may be routed to
> hardware engine? Otherwise skip any async specific details.

I'd rather try to come up with an interface that was equally
appropriate to both offload and inline. I appreciate that it might
not be possible to get an interface that gets best performance out of
both, but I'd like to explore that direction first.

I'd guess from what you say that the dma engine is given a bunch of
sources and a destination and it xor's all the sources together into
an accumulation buffer, and then writes the accum buffer to the
destination. Would that be right? Can you use the destination as one
of the sources?

That can obviously be done inline too with some changes to the xor
code, and avoiding the initial memset might be good for performance
too.

So I would suggest we drop the memset idea, and define the async_xor
interface to xor a number of sources into a destination, where the
destination is allowed to be the same as the first source, but
doesn't need to be.
Then the inline version could use a memset followed by the current xor
operations, or could use newly written xor operations, and the offload
version could equally do whatever is appropriate.

Another place where combining operations might make sense is copy-in
and post-xor. In some cases it might be more efficient to only read
the source once, and both write it to the destination and xor it into
the target. Would your DMA engine be able to optimise this
combination? I think current processors could certainly do better if
the two were combined.

So there is definitely room to move, but would rather avoid flags if I
could.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/