From: Terje Mathisen on
Chris M. Thomasson wrote:
> FWIW, I did a bounded multi-producer/consumer queue a while back using
> nothing but XADD on the fast-path's. Here is the rough idea:
>
>
> < pseudo-code typed in newsreader, please forgive typos! ;^) >

Sure!
> ________________________________________________________________________
> #define DEPTH 256 // power of 2

This is crucial, since you must allow the counters/indices to simply
wrap around, so the queue length _must_ be a divisor of the number of
possible counter values, i.e. a power of two in a binary system.

> void push(void* ptr)
> {
> if (XADD(&slots_free, -1) < 1)

I would start slots_free at (DEPTH-1) so I could use a negative value as
the all_full flag, but it doesn't really matter: The compiler is free to
convert the '< 1' test into '<= 0' which still saves the explicit
compare aginst 1.

> {
> semaphore_wait(&push_sem);
> }
>
> unsigned idx = XADD(&push_idx, 1) & MASK;
>
> assert(! slots[idx]);
>
> slots[idx] = ptr;
>
> if (XADD(&slots_used, 1) < 0)
> {
> semaphore_post(&pop_sem);
> }
> }

There is at least one crucial issue here: If a number of consumer
threads get hung up after updating the counter but before they can free
the slot[] variable, while all of those following do free them up, then
it is quite possible for the push_idx to wrap around and hit a slot
where your assert() will fire!

I.e. you do need a spin loop here just like you use in the pop()
function below, instead of the assert() macro.

> void* pop()
> {
> if (XADD(&slots_used, -1) < 1)
> {
> semaphore_wait(&pop_sem);
> }
>
> unsigned idx = XADD(&pop_idx, 1) & MASK;
>
> void* ptr = slots[idx];
>
> while (! ptr)
> {
> yield(); // SwitchToThread(); Sleep(1); asm("PAUSE"); whatever
>
> ptr = slots[idx];
> }
>
> slots[idx] = NULL;
>
> if (XADD(&slots_free, 1) < 0)
> {
> semaphore_post(&push_sem);
> }
>
> return ptr;
> }

This code is _very_ similar to something I posted here quite a few years
ago, not too long after the new XADD opcode was announced.

Not too surprising probably, as it seems like this particular pattern is
more or less what XADD was designed for. :-)

> I hope I did not screw that up. Anyway, thanks to XADD, the algorithm

I think you did well. I usually mess up in a couple of places when
programming in my news reader. :-(

> has 100% wait-free fast-paths, which is pretty darn good for any
> multi-producer/consumer queue! Heck, it's even loop-free on the
> slow-paths. Those are pretty nice properties indeed.

Right, which is why I've stated several times lately that XADD is the
best primitive we currently have available, the key is to use it in
patterns where the return value is useful, even if you don't "win" the
initial access:

This reduces the contention scaling from O(n*n) to something _much_
closer to O(n), since there will be (at the hardware layer) one winner
of each attempt to simultaneously update the counter, so after n bus
iterations all n threads will get a return value, with the early winners
being able to go on with the actual work a little bit sooner, and
thereby free up the

Even if you have a single protected resource, not a queue, you can still
use XADD to effect by using the return value as input to a wait loop
before the next attempt to grab the lock: This will tend to order the
non-winning threads so that they wait nicely in line and get close to
zero contention on their second attempt (modulo new incoming threads
that didn't take part in the first collision).

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Chris M. Thomasson on
"Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
news:EbWdnQqTes6fEi3XnZ2dnUVZ8sOdnZ2d(a)lyse.net...
> Chris M. Thomasson wrote:
>> FWIW, I did a bounded multi-producer/consumer queue a while back using
>> nothing but XADD on the fast-path's. Here is the rough idea:
>>
>>
>> < pseudo-code typed in newsreader, please forgive typos! ;^) >
>
> Sure!
>> ________________________________________________________________________
>> #define DEPTH 256 // power of 2
>
> This is crucial, since you must allow the counters/indices to simply wrap
> around, so the queue length _must_ be a divisor of the number of possible
> counter values, i.e. a power of two in a binary system.

Indeed. I did not want to use any modulo operation.

;^)




>> void push(void* ptr)
>> {
>> if (XADD(&slots_free, -1) < 1)
>
> I would start slots_free at (DEPTH-1) so I could use a negative value as
> the all_full flag, but it doesn't really matter: The compiler is free to
> convert the '< 1' test into '<= 0' which still saves the explicit compare
> aginst 1.

Agreed




>> {
>> semaphore_wait(&push_sem);
>> }
>>
>> unsigned idx = XADD(&push_idx, 1) & MASK;
>>
>> assert(! slots[idx]);
>>
>> slots[idx] = ptr;
>>
>> if (XADD(&slots_used, 1) < 0)
>> {
>> semaphore_post(&pop_sem);
>> }
>> }
>
> There is at least one crucial issue here: If a number of consumer threads
> get hung up after updating the counter but before they can free the slot[]
> variable, while all of those following do free them up, then it is quite
> possible for the push_idx to wrap around and hit a slot where your
> assert() will fire!
>
> I.e. you do need a spin loop here just like you use in the pop() function
> below, instead of the assert() macro.

BINGO!!!

I knew I royally screwed something up in the damn translation. I had to look
for the original implementation in the source code of an older software
library I created in order to find that I made an explicit note describing
this scenario. My dumb as% forgot all about it. So, thank you for jarring my
failing memory!

;^/


In fact, here is some sample code I just created in Relacy Race Detector
which explicitly shows this moment in action:


http://relacy.pastebin.com/f7c0fd785


If you un-define the `TURN_OFF_RACER' macro, the bug shows up in about a
fraction of a second even when I am using all sequential consistency memory
barriers. BTW, you can download Relacy Race Detector and use it free for
non-commercial purposes here:


http://groups.google.com/group/relacy


IMVHO, it's the best multi-thread algorithm verification tools out there. It
checks synchronization algorithms for strict compliance with the C++0x
memory model. If you have any questions on how to use it, either ask me, or
contact the creator, Dmitriy Vyukov, directly at

dvyukov<no-damn-spam>@<please-no-spam>gmail.com


remove carets and all in between for legitimate address... ;^)


Before I use any exotic, or even not so exotic, synchronization algorithm, I
ALWAYS run it through Relacy.




BTW, just for clarity, here is the non-bugged version of the push function:
________________________________________________________________________
void push(void* ptr)
{
if (XADD(&slots_free, -1) < 1)
{
semaphore_wait(&push_sem);
}

unsigned idx = XADD(&push_idx, 1) & MASK;

while (slots[idx])
{
yield(); // SwitchToThread(); Sleep(1); asm("PAUSE"); whatever
}

slots[idx] = ptr;

if (XADD(&slots_used, 1) < 0)
{
semaphore_post(&pop_sem);
}
}
________________________________________________________________________




Sorry about that non-sense.




[...]
>
> This code is _very_ similar to something I posted here quite a few years
> ago, not too long after the new XADD opcode was announced.

I tried searching for you're code for a while, but failed miserably. Damn
Google search! Grrr.




> Not too surprising probably, as it seems like this particular pattern is
> more or less what XADD was designed for. :-)

Oh yes. One can definitely do a whole lot with XADD.




>> I hope I did not screw that up. Anyway, thanks to XADD, the algorithm
>
> I think you did well.

Except for forgetting about that major nasty race condition! I should have
known I was missing something important when I explicitly put that darn
assertion in there...

;^)




> I usually mess up in a couple of places when programming in my news
> reader. :-(

Coding in the news reader is definitely a bad habit of mine!




>> has 100% wait-free fast-paths, which is pretty darn good for any
>> multi-producer/consumer queue! Heck, it's even loop-free on the
>> slow-paths. Those are pretty nice properties indeed.
>
> Right, which is why I've stated several times lately that XADD is the best
> primitive we currently have available, the key is to use it in patterns
> where the return value is useful, even if you don't "win" the initial
> access:

I find XADD, along with XCHG, to be very useful in creating 100% loop-less
algorithms. Those types of algorithms tend to have great scalability and
performance characteristics overall.




> This reduces the contention scaling from O(n*n) to something _much_ closer
> to O(n), since there will be (at the hardware layer) one winner of each
> attempt to simultaneously update the counter, so after n bus iterations
> all n threads will get a return value, with the early winners being able
> to go on with the actual work a little bit sooner, and thereby free up the

Agreed.




> Even if you have a single protected resource, not a queue, you can still
> use XADD to effect by using the return value as input to a wait loop
> before the next attempt to grab the lock: This will tend to order the
> non-winning threads so that they wait nicely in line and get close to zero
> contention on their second attempt (modulo new incoming threads that
> didn't take part in the first collision).

There absolutely great for bakery algorithms! Joe Seigh did an very
excellent, elegant and simple FIFO ordered, 100% starvation-free read/write
spinlock using nothing but XADD. I cannot seem to find the damn post right
now, but let's see if I can recreate it from memory...
_________________________________________________________________
#define WRITE 1
#define READ 0x10000


int next = 0; // next ticket
int cur = 0; // current request


void write_lock()
{
int ticket = XADD(&next, WRITE);

while (ticket != LOAD(&cur)) backoff();
}


void write_unlock()
{
XADD(&cur, WRITE);
}


void read_lock()
{
int ticket = XADD(&next, READ) % READ;

while (ticket != LOAD(&cur) % READ) backoff();
}


void read_unlock()
{
XADD(&cur, READ);
}
_________________________________________________________________




There is a caveat: The total number of waiting threads at any one time has
to be less than the field sizes. One nice thing... It works even with wrap
and carry out from the write field into the read field.




I also did a read/write synchronization primitive using XADD, but it was not
a spinlock. Here is the code:


http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/65822/reply/87849


It too is 100% starvation free, and has bounded time. I could not have
created this algorithm without fetch-and-add. Notice how there are no loops
of any kind?


;^)

From: Chris M. Thomasson on
"Chris M. Thomasson" <no(a)spam.invalid> wrote in message
news:h8qpq5$1gv$1(a)news.ett.com.ua...
> "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
[...]
>> This code is _very_ similar to something I posted here quite a few years
>> ago, not too long after the new XADD opcode was announced.
>
> I tried searching for you're code for a while, but failed miserably. Damn
> Google search! Grrr.
[...]

I think I finally found it:

http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43

http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56


From: Terje Mathisen on
Chris M. Thomasson wrote:
> "Chris M. Thomasson" <no(a)spam.invalid> wrote in message
> news:h8qpq5$1gv$1(a)news.ett.com.ua...
>> "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
> [...]
>>> This code is _very_ similar to something I posted here quite a few
>>> years ago, not too long after the new XADD opcode was announced.
>>
>> I tried searching for you're code for a while, but failed miserably.
>> Damn Google search! Grrr.
> [...]
>
> I think I finally found it:
>
> http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43
>
> http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56
>
>
You did, and it seems like "quite a few years" was just over 10 years
ago. Since my posts didn't mention XADD at all I understand that they
were hard to locate. :-(

The only real difference in the code is that I didn't specify any
OS-level fallback in order to handle full/empty situations, the core
code is the same. I did mention how it is very useful to make the queue
depth greater_or_equal to the maximum number of producer threads, that
way they will all find a free slot to write into.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Chris M. Thomasson on
"Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
news:mq6dnWSyh88SnSzXnZ2dnUVZ8lidnZ2d(a)lyse.net...
> Chris M. Thomasson wrote:
>> "Chris M. Thomasson" <no(a)spam.invalid> wrote in message
>> news:h8qpq5$1gv$1(a)news.ett.com.ua...
>>> "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
>> [...]
>>>> This code is _very_ similar to something I posted here quite a few
>>>> years ago, not too long after the new XADD opcode was announced.
>>>
>>> I tried searching for you're code for a while, but failed miserably.
>>> Damn Google search! Grrr.
>> [...]
>>
>> I think I finally found it:
>>
>> http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43
>>
>> http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56
>>
>>
> You did, and it seems like "quite a few years" was just over 10 years ago.
> Since my posts didn't mention XADD at all I understand that they were hard
> to locate. :-(
>
> The only real difference in the code is that I didn't specify any OS-level
> fallback in order to handle full/empty situations, the core code is the
> same.

I found some messages from you which mentioned XADD, and I think I came
across one in which you did explicitly mention an os-level fallback.
Although, exactly what comprised the so-called "fallback" was rather vague.
I used fast-pathed semaphores. Well, I guess it could be semaphores, futexs,
eventcounts, well, whatever.

:^)




> I did mention how it is very useful to make the queue depth
> greater_or_equal to the maximum number of producer threads, that way they
> will all find a free slot to write into.

Right.




Anyway, FWIW, here is the crude pseudo-code of an unbounded non-intrusive
multi-producer/consumer queue I *half created:
________________________________________________________________
struct node
{
node* next;
void* state;
};


struct dwcas_anchor
{
int32 aba;
struct node* node;
};


struct mpmcq
{
struct dwcas_anchor tail;
struct node* head;
};


void
mpmcq_init(struct mpmcq* const self,
struct node* dummy)
{
dummy->next = NULL;
self->head = dummy;
self->tail.node = dummy;
}


void
mpmcq_push(struct mpmcq* const self,
struct node* node)
{
struct node* prev;

node->m_next = NULL;
prev = ATOMIC_SWAP(&self->head, node);
ATOMIC_STORE(&prev->next, node);
}


struct node*
mpmcq_pop(
struct mpmcq* const self)
{
void* state;
struct dwcas_anchor cmp, xchg;

cmp = self->tail;

do
{
struct node* next = ATOMIC_LOAD(&cmp.node->next);
if (! next) return NULL;
state = next->state;
xchg.node = next;
xchg.aba = cmp.aba + 1;

} while (! ATOMIC_DWCAS(&self->tail, &cmp, &xchg));

cmp.node->state = state;

return cmp.node;
}
________________________________________________________________




For an unbounded queue, DWCAS aside for a moment, this has some fairly
interesting properties. 1 simple CAS-loop for the consumer! That's pretty
nice. Also, it has a 100% wait-free producer. However, there is a caveat...
This queue can be busted to hell if the producer somehow "dies" after the
atomic swap, and before the final store. Well, not really busted in the
sense of corrupt data, but deadlocked which is way better than corrupting
anything. Well, IMVHO that is...

:^)




BTW, this beat's the heck out of the Michael and Scott queue...


:^o


I cannot seem to come up with a wait-free consumer side to this algorithm...
Can you give me any ideas? Perhaps we can come up with something neat.




(*)

I have to give credit to Dmitriy Vyukov for the excellent producer side of
this algorithm because he invented it!