From: Andy Glew on
On 7/13/2010 2:46 PM, Morten Reistad wrote:
> In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Brett Davis wrote:
>>> In article<3l9eg7-j98.ln1(a)laptop.reistad.name>,
>>> Morten Reistad<first(a)last.name> wrote:
>>>> Seems to give large speedups to some fundamental algorithms.
>>>
>>> "You are Doing It Wrong"
>>> http://queue.acm.org/detail.cfm?id=1814327
>>
> The trick is to be choosy about which locality is important. Suddenly
> a vertical vs horisontal organisation of b-trees becomes very important
> for perfomance.
>
>>> Ten times faster than B-Tree with B-Heap.
>>
>> The problem here is that phk had two sorts of cached data: Web articles
>> with images and the index, and they were both allowed to fight for the
>> same limited resource (real DRAM pages).
>
> This isn't so much about RAM, it is about cache. We see this effect
> creep in all over the place now.

I could not help but wonder if the B-Heap speedups might also be in some
part due to TLB locality.
From: Terje Mathisen "terje.mathisen at on
Morten Reistad wrote:
> In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> It still makes sense to pack the index so that all the higher levels
>> will fit in the lower cache levels.
>
> This isn't so much about RAM, it is about cache. We see this effect
> creep in all over the place now.

Isn't that what I wrote in the previous paragraph?

Anyway, CPU Cache is very important (witness my .sig), but in this
particular case phk was specifically targeting RAM used as a disk cache:
Same kind of problem, ~3 orders of magnitude slower.

> And the actual cache effects can be pretty surprising at times.
>
> Watch an 8-processor(@2.4Ghz) HP Xeon with 4 gb ram outperform a
> 36-processor (@2.6GHz) (*) Sun with 40G ram by a factor of 6.
>
> It was all in the L2/L3 cache design.

Fun stuff!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Piotr Wyderski on
Morten Reistad wrote:

> Watch an 8-processor(@2.4Ghz) HP Xeon with 4 gb ram outperform a
> 36-processor (@2.6GHz) (*) Sun with 40G ram by a factor of 6.

I can confirm your observations, as we have encountered exactly the same
issue.

> It was all in the L2/L3 cache design.

Not only, Sparcs have much worse memory disambiguation mechanisms.
Even such a simple loop shows stunning differences:

for(volatile unsigned int i = 0; i != 1000000000; ++i);

Investigation at the assembler level has shown that Sparc is good
(i.e. Xeon-like) at loads and stores, but when there is a long
store-load dependence, the performance loss is a total disaster, to
put it mildly. Xeons do not exhibit this kind of problems.

Best regards
Piotr Wyderski

From: Morten Reistad on
In article <4C3D1CB0.8000704(a)andy.glew.ca>,
Andy Glew <giganews(a)andy.glew.ca> wrote:
>On 7/13/2010 2:46 PM, Morten Reistad wrote:
>> In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>,
>> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>> Brett Davis wrote:
>>>> In article<3l9eg7-j98.ln1(a)laptop.reistad.name>,
>>>> Morten Reistad<first(a)last.name> wrote:
>>>>> Seems to give large speedups to some fundamental algorithms.
>>>>
>>>> "You are Doing It Wrong"
>>>> http://queue.acm.org/detail.cfm?id=1814327
>>>
>> The trick is to be choosy about which locality is important. Suddenly
>> a vertical vs horisontal organisation of b-trees becomes very important
>> for perfomance.
>>
>>>> Ten times faster than B-Tree with B-Heap.
>>>
>>> The problem here is that phk had two sorts of cached data: Web articles
>>> with images and the index, and they were both allowed to fight for the
>>> same limited resource (real DRAM pages).
>>
>> This isn't so much about RAM, it is about cache. We see this effect
>> creep in all over the place now.
>
>I could not help but wonder if the B-Heap speedups might also be in some
>part due to TLB locality.

That too. But the cache seems to be the really imporant bit,
including cache snoops into the cache of other processors.

The cache counters are accessible through a loadable module
starting in Linux 1.6.31, so you can have a look at what the MMU
is actually doing. It is educational to see the cache activity.

-- mrr
From: MitchAlsup on
On Jul 14, 5:37 am, "Piotr Wyderski"
<piotr.wyder...(a)mothers.against.spam.gmail.com> wrote:
> Not only, Sparcs have much worse memory disambiguation mechanisms.
> Even such a simple loop shows stunning differences:
>
>     for(volatile unsigned int i = 0; i != 1000000000; ++i);
>
> Investigation at the assembler level has shown that Sparc is good
> (i.e. Xeon-like) at loads and stores, but when there is a long
> store-load dependence, the performance loss is a total disaster, to
> put it mildly. Xeons do not exhibit this kind of problems.

x86 grew up in an environment where stores were reloaded in a rather
short amount of time {arguments get pushed, then accessed a couple
cycles later in the called subroutine.) Thus, there are mechanisms to
forward store data to loads when the addresses match even when the
store instruction has not retired. This normally goes under the
monicer of Store-to-Load forwarding (STLF).

Mitch