What happened to computer architecture (and comp.arch?) [Computer Architecture]

Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]

From: Terje Mathisen on 18 Sep 2009 03:28

Tim McCaffrey wrote:
> In article<rrOdncImYb8uBS_XnZ2dnUVZ8tqdnZ2d(a)lyse.net>, Terje.Mathisen(a)tmsw.no
> says...
>>
>> Tim McCaffrey wrote:
>>> Here is one you might find interesting.
>> [snip]
>>> do {
>>> old_head = b->free_list_head;
>>> if (old_head == b->free_list_tail)
>>> return NULL; // list empty
>>> index = b->free_queue[old_head& MAX_BUFFER_CHUNKS];
>>> new_head = _InterlockedCompareExchange(&(b->free_list_head),
>>> old_head+1,
>>> old_head);
>>> }while (new_head != old_head);//compare failed, try again
>>
>> This is yet another O(n*n) (or worse!) algorithm in the face of heavy
>> contention, afaik?
>>
>> Terje
>
> Well, no. The while only executes once if the circular buffer has more than a
> #CPU things in it (for what I developed it for, it did. It was list of free
> buffers).
>
> In heavy use cases (where the circular buffer doesn't go empty), yes,
> everybody is pounding on the head and tail pointers. But at least they aren't
> spinning on them (usually).

I'm worried about the situation where multiple cores are all loading the
list header, then using ICAS to insert the new: Only one of them can win
that exchange, and all the rest must loop, right?

The good thing is that there _will_ be a winner, so you'll see forward
progress.

>
> I made sure the head& tail pointers had their own cache lines.

That's pretty much a given. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Tim McCaffrey on 18 Sep 2009 16:02

In article <Pdidnba4R9Mkqy7XnZ2dnUVZ8h2dnZ2d(a)lyse.net>, Terje.Mathisen(a)tmsw.no
says...
>
>Tim McCaffrey wrote:
>> In article<rrOdncImYb8uBS_XnZ2dnUVZ8tqdnZ2d(a)lyse.net>,
Terje.Mathisen(a)tmsw.no
>> says...
>>>
>>> Tim McCaffrey wrote:
>>>> Here is one you might find interesting.
>>> [snip]
>>>> do {
>>>> old_head = b->free_list_head;
>>>> if (old_head == b->free_list_tail)
>>>> return NULL; // list empty
>>>> index = b->free_queue[old_head& MAX_BUFFER_CHUNKS];
>>>> new_head = _InterlockedCompareExchange(&(b->free_list_head),
>>>> old_head+1,
>>>> old_head);
>>>> }while (new_head != old_head);//compare failed, try again
>>>
>>> This is yet another O(n*n) (or worse!) algorithm in the face of heavy
>>> contention, afaik?
>>>
>>> Terje
>>
>> Well, no. The while only executes once if the circular buffer has more
than a
>> #CPU things in it (for what I developed it for, it did. It was list of
free
>> buffers).
>>
>> In heavy use cases (where the circular buffer doesn't go empty), yes,
>> everybody is pounding on the head and tail pointers. But at least they
aren't
>> spinning on them (usually).
>
>I'm worried about the situation where multiple cores are all loading the
>list header, then using ICAS to insert the new: Only one of them can win
>that exchange, and all the rest must loop, right?
>
>The good thing is that there _will_ be a winner, so you'll see forward
>progress.
>

Yes, your right. For the application I used this in, it almost never
(according to kernrate) had a contention issue there. There was too much
other work going on, so it was unlikely two (or more) threads hit that
problem.

The performance problem I did have was at the CMPXCHG & XADD instructions: I
still had cache thrashing so changing this from a spin-lock protected list to
this "lockless" circular buffer didn't really get me anything.

Live & learn.

- Tim

From: Chris M. Thomasson on 19 Sep 2009 03:12

"Tim McCaffrey" <timcaffrey(a)aol.com> wrote in message
news:h90ovp$m9k$1(a)USTR-NEWS.TR.UNISYS.COM...
> In article <Pdidnba4R9Mkqy7XnZ2dnUVZ8h2dnZ2d(a)lyse.net>,
> Terje.Mathisen(a)tmsw.no
> says...
>>
>>Tim McCaffrey wrote:
>>> In article<rrOdncImYb8uBS_XnZ2dnUVZ8tqdnZ2d(a)lyse.net>,
> Terje.Mathisen(a)tmsw.no
>>> says...
>>>>
>>>> Tim McCaffrey wrote:
>>>>> Here is one you might find interesting.
>>>> [snip]
>>>>> do {
>>>>> old_head = b->free_list_head;
>>>>> if (old_head == b->free_list_tail)
>>>>> return NULL; // list empty
>>>>> index = b->free_queue[old_head& MAX_BUFFER_CHUNKS];
>>>>> new_head = _InterlockedCompareExchange(&(b->free_list_head),
>>>>> old_head+1,
>>>>> old_head);
>>>>> }while (new_head != old_head);//compare failed, try again
>>>>
>>>> This is yet another O(n*n) (or worse!) algorithm in the face of heavy
>>>> contention, afaik?
>>>>
>>>> Terje
>>>
>>> Well, no. The while only executes once if the circular buffer has more
> than a
>>> #CPU things in it (for what I developed it for, it did. It was list of
> free
>>> buffers).
>>>
>>> In heavy use cases (where the circular buffer doesn't go empty), yes,
>>> everybody is pounding on the head and tail pointers. But at least they
> aren't
>>> spinning on them (usually).
>>
>>I'm worried about the situation where multiple cores are all loading the
>>list header, then using ICAS to insert the new: Only one of them can win
>>that exchange, and all the rest must loop, right?
>>
>>The good thing is that there _will_ be a winner, so you'll see forward
>>progress.
>>
>
>
> Yes, your right. For the application I used this in, it almost never
> (according to kernrate) had a contention issue there. There was too much
> other work going on, so it was unlikely two (or more) threads hit that
> problem.
>
> The performance problem I did have was at the CMPXCHG & XADD instructions:
> I
> still had cache thrashing so changing this from a spin-lock protected list
> to
> this "lockless" circular buffer didn't really get me anything.
>
> Live & learn.

I agree.

If contention for a particular recourse if few and far between, then a
clever lock-based setup should work out fine.

From: Brett Davis on 20 Sep 2009 02:25

In article <5cOdnSMDsdaFmDvXnZ2dnUVZ8qydnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:

> Robert Myers wrote:
> > Prefetch is hugely important, but how it actually works must involve a
> > great deal of reverse-engineering on the part of competitors, because
> > meaningful details never seem to be forthcoming from manufacturers.
> > I'm assuming that Microsoft's compiler designers, for example, know
> > lots of useful things that most others don't, and that they got them
> > from the horse's mouth under an NDA.
>
> I haven't seen a single x86-type CPU, from any manufacturer, where
> letting the compiler issue PREFETCH instructions turns out to be a
> general win.

Let me chime in and say the PREFETCH instructions are useless on MIPS
and PowerPC as well. As currently implemented by hardware they act as
ordinary read instructions, and block one of the two read ports.

The one benefit of PREFETCH is that it will not raise a access violation
when running off the end of an array and into someone else's data, so it
is slightly better than a ordinary read.

Of course when adding PREFETCH slows down your code, that benefit is
academic.

> Yes, they obviously do help a little with SPEC, particularly after the
> compiler has been carefully tuned for these benchmarks, but in real life
> I haven't seen this.
>
> OTOH, hardware-based prefetch, in the form of stream detection in the
> memory interface is indeed a huge win, but the compiler isn't involved
> at all.

From: nmm1 on 20 Sep 2009 04:03

In article <qejbb594tjah6s64vff144lickg1m5erat(a)4ax.com>,
Emil Naepflein <netnewsegn(a)kabelmail.de> wrote:
>On Sun, 20 Sep 2009 06:25:09 GMT, Brett Davis <ggtgp(a)yahoo.com> wrote:
>
>>Of course when adding PREFETCH slows down your code, that benefit is
>>academic.
>
>I don't agree here. About 10 years ago I did a lot of performance
>optimizations for TCP checksum and bcopy on R10K cpus. I got performance
>improvements for the this functions of up to 90 %, just by adding PREF
>instructions. In total this reduced cpu consumption per transfered TCP
>byte by about 30 %.

And the date's the point. Up until recently, the main CPU bottleneck
in most programs was memory latency, and only a small proportion of
the available bandwidth was actually used - except, of course, in
tuned HPC programs (cue John McCalpin). But, with the increasing
number of cores, bandwidth is returning as the main bottleneck.

Prefetching uses extra bandwidth to save on latency, and thus works
only when you have a surplus of the former and the latter is the
problem.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]