From: Mayan Moudgill on
Tim McCaffrey wrote:
> In article <gL6dnWc2Y61AOzPXnZ2dnUVZ_j-dnZ2d(a)bestweb.net>, mayan(a)bestweb.net
> says...
>
>>Tim McCaffrey wrote:
>
>
> That doesn't really get rid of the looong pause when a CPU has to read a cache
> line into memory. If I could push it to another CPU, then there might some
> situations that could be sped up.

Depends on the implementation:
- Does it snoop bus-traffic?
- If it does, does it just invalidate hits, or does it update the data?
[It might do both if it has L2 cache; invalidate it out of the L1 and
update it into the L2].

That might get rid of (part) of the long pause.

What architecture are you working on? If its MP-able, and non-x86,
non-POWER, is it one of the ARM MP implementations? [Use e-mail if you
don't want to broadcast it to the world].

>
> Here is another case:
>
> Device has a queue of requests in memory. Unfortunately, the queue entries
> are not cache-line sized, so you get the following:
>
> CPU #1:
> Spin locks the queue (cache line miss if spin lock was grabbed by
> another CPU).
> Update pointers (ditto if another CPU wrote to the queue previously).
> Write to the queue (ditto).
> Write word to the device telling it has something in the queue.
> Device:
> Reads queue.
> - CPU #1 is forced to write back any data that device wants to
> read, including queue pointers and queue data.
>
> CPU #2:
> Spin locks the queue ....

Any chance you have locked transaction buses, and can use them? That
might get of the need for locking.

From: Tim McCaffrey on
In article <xfKdnXDSocaFui_XnZ2dnUVZ_qednZ2d(a)bestweb.net>,
mayan(a)bestweb.net
says...
>
>Tim McCaffrey wrote:
>> In article <gL6dnWc2Y61AOzPXnZ2dnUVZ_j-dnZ2d(a)bestweb.net>,
mayan(a)bestweb.net
>> says...
>>
>>>Tim McCaffrey wrote:
>>
>>
>> That doesn't really get rid of the looong pause when a CPU has to read
a
cache
>> line into memory. If I could push it to another CPU, then there might
some
>> situations that could be sped up.
>
>Depends on the implementation:
>- Does it snoop bus-traffic?
>- If it does, does it just invalidate hits, or does it update the data?
>[It might do both if it has L2 cache; invalidate it out of the L1 and
>update it into the L2].
>
>That might get rid of (part) of the long pause.
>
>What architecture are you working on? If its MP-able, and non-x86,
>non-POWER, is it one of the ARM MP implementations? [Use e-mail if you
>don't want to broadcast it to the world].

Your basic Windows box (well, PCI-e x4, which isn't "typical"). Most x86
just
invalidate the cache, except, AFAIK, Intel's latest.

I'm not real thrilled with most ARM implementations because they are not
cache
coherent, but that is besides the point in this case.
>
>>
>> Here is another case:
>>
>> Device has a queue of requests in memory. Unfortunately, the queue
entries
>> are not cache-line sized, so you get the following:
>>
>> CPU #1:
>> Spin locks the queue (cache line miss if spin lock was grabbed
by
>> another CPU).
>> Update pointers (ditto if another CPU wrote to the queue
previously).
>> Write to the queue (ditto).
>> Write word to the device telling it has something in the queue.
>> Device:
>> Reads queue.
>> - CPU #1 is forced to write back any data that device
wants
to
>> read, including queue pointers and queue data.
>>
>> CPU #2:
>> Spin locks the queue ....
>
>Any chance you have locked transaction buses, and can use them? That
>might get of the need for locking.
>
Yeah, and an OS that used them (Windows ain't it).

- Tim

From: Tim McCaffrey on
Here is one you might find interesting.

The structure is a circular buffer that contains indexes into an array that
contains the real data (buffer addresses in this case). The number of entries
in the array are guarunteed never to fill the circular buffer. However, the
circular buffer can go empty.

#define MAX_BUFFER_CHUNKS <some power of 2 minus 1>
typedef struct
{
int chunks;
int chunk_size;
int real_size; /* chunk size + tag size*/

Uint8 *base;
Uint8 *address[MAX_BUFFER_CHUNKS];
volatile Uint16 free_queue[MAX_BUFFER_CHUNKS+1];
next_cache_line pad1;
volatile int free_list_head;
next_cache_line pad2;
volatile int free_list_tail;
next_cache_line pad3;
} circular_buffer_t;

/*********************************************************
* Function : put_index
*
* Arguments : circular_buffer_t *, int
* Returns : void
*
* This function puts back the buffer into the
* free buffer pool
*********************************************************/
static void
put_index(circular_buffer_t * b,int index)
{
int old_tail;
old_tail = _InterlockedExchangeAdd(&(b->free_list_tail),1);
b->free_queue[old_tail & MAX_BUFFER_CHUNKS] =
(Uint16) (index | (old_tail & (MAX_BUFFER_CHUNKS+1)));
}

/*********************************************************
* Function : get_buffer_from_init_pool
*
* Arguments : void *, circular_buffer_t *
* Returns : Returns the address of the buffer
* of type Uint8 * or NULL
*
* This function gets buffer (which has been requested)
* from the preallocated free pool
*
*********************************************************/
static Uint8 *
get_addr(void *RESERVED, circular_buffer_t * b)
{
int index;
Uint8 *ret = NULL;

int old_head, new_head;
do {
old_head = b->free_list_head;
if (old_head == b->free_list_tail)
return NULL; // list empty
index = b->free_queue[old_head & MAX_BUFFER_CHUNKS];
new_head = _InterlockedCompareExchange(&(b->free_list_head),
old_head+1,
old_head);
}while (new_head != old_head);//compare failed, try again

while ((index ^ old_head) & (MAX_BUFFER_CHUNKS+1))
{
// too quick, hope somebody is putting values in...
index = b->free_queue[old_head & MAX_BUFFER_CHUNKS];
}
index &= MAX_BUFFER_CHUNKS;
ret = b->address[index];
return ret;
}

From: Terje Mathisen on
Tim McCaffrey wrote:
> Here is one you might find interesting.
[snip]
> do {
> old_head = b->free_list_head;
> if (old_head == b->free_list_tail)
> return NULL; // list empty
> index = b->free_queue[old_head& MAX_BUFFER_CHUNKS];
> new_head = _InterlockedCompareExchange(&(b->free_list_head),
> old_head+1,
> old_head);
> }while (new_head != old_head);//compare failed, try again

This is yet another O(n*n) (or worse!) algorithm in the face of heavy
contention, afaik?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Tim McCaffrey on
In article <rrOdncImYb8uBS_XnZ2dnUVZ8tqdnZ2d(a)lyse.net>, Terje.Mathisen(a)tmsw.no
says...
>
>Tim McCaffrey wrote:
>> Here is one you might find interesting.
>[snip]
>> do {
>> old_head = b->free_list_head;
>> if (old_head == b->free_list_tail)
>> return NULL; // list empty
>> index = b->free_queue[old_head& MAX_BUFFER_CHUNKS];
>> new_head = _InterlockedCompareExchange(&(b->free_list_head),
>> old_head+1,
>> old_head);
>> }while (new_head != old_head);//compare failed, try again
>
>This is yet another O(n*n) (or worse!) algorithm in the face of heavy
>contention, afaik?
>
>Terje
>
>--
>- <Terje.Mathisen at tmsw.no>
>"almost all programming can be viewed as an exercise in caching"


Well, no. The while only executes once if the circular buffer has more than a
#CPU things in it (for what I developed it for, it did. It was list of free
buffers).

In heavy use cases (where the circular buffer doesn't go empty), yes,
everybody is pounding on the head and tail pointers. But at least they aren't
spinning on them (usually).

I made sure the head & tail pointers had their own cache lines.

- Tim