| 	
Prev: VMWare tools killed my Mac OS X ? Next: Software vs hardware floating-point [was Re: What happened ...] 	
		 From: Terje Mathisen on 16 Sep 2009 03:08 Chris M. Thomasson wrote: > FWIW, I did a bounded multi-producer/consumer queue a while back using > nothing but XADD on the fast-path's. Here is the rough idea: > > > < pseudo-code typed in newsreader, please forgive typos! ;^) > Sure! > ________________________________________________________________________ > #define DEPTH 256 // power of 2 This is crucial, since you must allow the counters/indices to simply wrap around, so the queue length _must_ be a divisor of the number of possible counter values, i.e. a power of two in a binary system. > void push(void* ptr) > { > if (XADD(&slots_free, -1) < 1) I would start slots_free at (DEPTH-1) so I could use a negative value as the all_full flag, but it doesn't really matter: The compiler is free to convert the '< 1' test into '<= 0' which still saves the explicit compare aginst 1. > { > semaphore_wait(&push_sem); > } > > unsigned idx = XADD(&push_idx, 1) & MASK; > > assert(! slots[idx]); > > slots[idx] = ptr; > > if (XADD(&slots_used, 1) < 0) > { > semaphore_post(&pop_sem); > } > } There is at least one crucial issue here: If a number of consumer threads get hung up after updating the counter but before they can free the slot[] variable, while all of those following do free them up, then it is quite possible for the push_idx to wrap around and hit a slot where your assert() will fire! I.e. you do need a spin loop here just like you use in the pop() function below, instead of the assert() macro. > void* pop() > { > if (XADD(&slots_used, -1) < 1) > { > semaphore_wait(&pop_sem); > } > > unsigned idx = XADD(&pop_idx, 1) & MASK; > > void* ptr = slots[idx]; > > while (! ptr) > { > yield(); // SwitchToThread(); Sleep(1); asm("PAUSE"); whatever > > ptr = slots[idx]; > } > > slots[idx] = NULL; > > if (XADD(&slots_free, 1) < 0) > { > semaphore_post(&push_sem); > } > > return ptr; > } This code is _very_ similar to something I posted here quite a few years ago, not too long after the new XADD opcode was announced. Not too surprising probably, as it seems like this particular pattern is more or less what XADD was designed for. :-) > I hope I did not screw that up. Anyway, thanks to XADD, the algorithm I think you did well. I usually mess up in a couple of places when programming in my news reader. :-( > has 100% wait-free fast-paths, which is pretty darn good for any > multi-producer/consumer queue! Heck, it's even loop-free on the > slow-paths. Those are pretty nice properties indeed. Right, which is why I've stated several times lately that XADD is the best primitive we currently have available, the key is to use it in patterns where the return value is useful, even if you don't "win" the initial access: This reduces the contention scaling from O(n*n) to something _much_ closer to O(n), since there will be (at the hardware layer) one winner of each attempt to simultaneously update the counter, so after n bus iterations all n threads will get a return value, with the early winners being able to go on with the actual work a little bit sooner, and thereby free up the Even if you have a single protected resource, not a queue, you can still use XADD to effect by using the return value as input to a wait loop before the next attempt to grab the lock: This will tend to order the non-winning threads so that they wait nicely in line and get close to zero contention on their second attempt (modulo new incoming threads that didn't take part in the first collision). Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" 	
		 From: Chris M. Thomasson on 16 Sep 2009 09:40 "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message news:EbWdnQqTes6fEi3XnZ2dnUVZ8sOdnZ2d(a)lyse.net... > Chris M. Thomasson wrote: >> FWIW, I did a bounded multi-producer/consumer queue a while back using >> nothing but XADD on the fast-path's. Here is the rough idea: >> >> >> < pseudo-code typed in newsreader, please forgive typos! ;^) > > > Sure! >> ________________________________________________________________________ >> #define DEPTH 256 // power of 2 > > This is crucial, since you must allow the counters/indices to simply wrap > around, so the queue length _must_ be a divisor of the number of possible > counter values, i.e. a power of two in a binary system. Indeed. I did not want to use any modulo operation. ;^) >> void push(void* ptr) >> { >> if (XADD(&slots_free, -1) < 1) > > I would start slots_free at (DEPTH-1) so I could use a negative value as > the all_full flag, but it doesn't really matter: The compiler is free to > convert the '< 1' test into '<= 0' which still saves the explicit compare > aginst 1. Agreed >> { >> semaphore_wait(&push_sem); >> } >> >> unsigned idx = XADD(&push_idx, 1) & MASK; >> >> assert(! slots[idx]); >> >> slots[idx] = ptr; >> >> if (XADD(&slots_used, 1) < 0) >> { >> semaphore_post(&pop_sem); >> } >> } > > There is at least one crucial issue here: If a number of consumer threads > get hung up after updating the counter but before they can free the slot[] > variable, while all of those following do free them up, then it is quite > possible for the push_idx to wrap around and hit a slot where your > assert() will fire! > > I.e. you do need a spin loop here just like you use in the pop() function > below, instead of the assert() macro. BINGO!!! I knew I royally screwed something up in the damn translation. I had to look for the original implementation in the source code of an older software library I created in order to find that I made an explicit note describing this scenario. My dumb as% forgot all about it. So, thank you for jarring my failing memory! ;^/ In fact, here is some sample code I just created in Relacy Race Detector which explicitly shows this moment in action: http://relacy.pastebin.com/f7c0fd785 If you un-define the `TURN_OFF_RACER' macro, the bug shows up in about a fraction of a second even when I am using all sequential consistency memory barriers. BTW, you can download Relacy Race Detector and use it free for non-commercial purposes here: http://groups.google.com/group/relacy IMVHO, it's the best multi-thread algorithm verification tools out there. It checks synchronization algorithms for strict compliance with the C++0x memory model. If you have any questions on how to use it, either ask me, or contact the creator, Dmitriy Vyukov, directly at dvyukov<no-damn-spam>@<please-no-spam>gmail.com remove carets and all in between for legitimate address... ;^) Before I use any exotic, or even not so exotic, synchronization algorithm, I ALWAYS run it through Relacy. BTW, just for clarity, here is the non-bugged version of the push function: ________________________________________________________________________ void push(void* ptr) { if (XADD(&slots_free, -1) < 1) { semaphore_wait(&push_sem); } unsigned idx = XADD(&push_idx, 1) & MASK; while (slots[idx]) { yield(); // SwitchToThread(); Sleep(1); asm("PAUSE"); whatever } slots[idx] = ptr; if (XADD(&slots_used, 1) < 0) { semaphore_post(&pop_sem); } } ________________________________________________________________________ Sorry about that non-sense. [...] > > This code is _very_ similar to something I posted here quite a few years > ago, not too long after the new XADD opcode was announced. I tried searching for you're code for a while, but failed miserably. Damn Google search! Grrr. > Not too surprising probably, as it seems like this particular pattern is > more or less what XADD was designed for. :-) Oh yes. One can definitely do a whole lot with XADD. >> I hope I did not screw that up. Anyway, thanks to XADD, the algorithm > > I think you did well. Except for forgetting about that major nasty race condition! I should have known I was missing something important when I explicitly put that darn assertion in there... ;^) > I usually mess up in a couple of places when programming in my news > reader. :-( Coding in the news reader is definitely a bad habit of mine! >> has 100% wait-free fast-paths, which is pretty darn good for any >> multi-producer/consumer queue! Heck, it's even loop-free on the >> slow-paths. Those are pretty nice properties indeed. > > Right, which is why I've stated several times lately that XADD is the best > primitive we currently have available, the key is to use it in patterns > where the return value is useful, even if you don't "win" the initial > access: I find XADD, along with XCHG, to be very useful in creating 100% loop-less algorithms. Those types of algorithms tend to have great scalability and performance characteristics overall. > This reduces the contention scaling from O(n*n) to something _much_ closer > to O(n), since there will be (at the hardware layer) one winner of each > attempt to simultaneously update the counter, so after n bus iterations > all n threads will get a return value, with the early winners being able > to go on with the actual work a little bit sooner, and thereby free up the Agreed. > Even if you have a single protected resource, not a queue, you can still > use XADD to effect by using the return value as input to a wait loop > before the next attempt to grab the lock: This will tend to order the > non-winning threads so that they wait nicely in line and get close to zero > contention on their second attempt (modulo new incoming threads that > didn't take part in the first collision). There absolutely great for bakery algorithms! Joe Seigh did an very excellent, elegant and simple FIFO ordered, 100% starvation-free read/write spinlock using nothing but XADD. I cannot seem to find the damn post right now, but let's see if I can recreate it from memory... _________________________________________________________________ #define WRITE 1 #define READ 0x10000 int next = 0; // next ticket int cur = 0; // current request void write_lock() { int ticket = XADD(&next, WRITE); while (ticket != LOAD(&cur)) backoff(); } void write_unlock() { XADD(&cur, WRITE); } void read_lock() { int ticket = XADD(&next, READ) % READ; while (ticket != LOAD(&cur) % READ) backoff(); } void read_unlock() { XADD(&cur, READ); } _________________________________________________________________ There is a caveat: The total number of waiting threads at any one time has to be less than the field sizes. One nice thing... It works even with wrap and carry out from the write field into the read field. I also did a read/write synchronization primitive using XADD, but it was not a spinlock. Here is the code: http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/65822/reply/87849 It too is 100% starvation free, and has bounded time. I could not have created this algorithm without fetch-and-add. Notice how there are no loops of any kind? ;^) 	
		 From: Chris M. Thomasson on 16 Sep 2009 10:06 "Chris M. Thomasson" <no(a)spam.invalid> wrote in message news:h8qpq5$1gv$1(a)news.ett.com.ua... > "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message [...] >> This code is _very_ similar to something I posted here quite a few years >> ago, not too long after the new XADD opcode was announced. > > I tried searching for you're code for a while, but failed miserably. Damn > Google search! Grrr. [...] I think I finally found it: http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43 http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56 	
		 From: Terje Mathisen on 16 Sep 2009 11:13 Chris M. Thomasson wrote: > "Chris M. Thomasson" <no(a)spam.invalid> wrote in message > news:h8qpq5$1gv$1(a)news.ett.com.ua... >> "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message > [...] >>> This code is _very_ similar to something I posted here quite a few >>> years ago, not too long after the new XADD opcode was announced. >> >> I tried searching for you're code for a while, but failed miserably. >> Damn Google search! Grrr. > [...] > > I think I finally found it: > > http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43 > > http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56 > > You did, and it seems like "quite a few years" was just over 10 years ago. Since my posts didn't mention XADD at all I understand that they were hard to locate. :-( The only real difference in the code is that I didn't specify any OS-level fallback in order to handle full/empty situations, the core code is the same. I did mention how it is very useful to make the queue depth greater_or_equal to the maximum number of producer threads, that way they will all find a free slot to write into. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" 	
		 From: Chris M. Thomasson on 16 Sep 2009 23:39 "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message news:mq6dnWSyh88SnSzXnZ2dnUVZ8lidnZ2d(a)lyse.net... > Chris M. Thomasson wrote: >> "Chris M. Thomasson" <no(a)spam.invalid> wrote in message >> news:h8qpq5$1gv$1(a)news.ett.com.ua... >>> "Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message >> [...] >>>> This code is _very_ similar to something I posted here quite a few >>>> years ago, not too long after the new XADD opcode was announced. >>> >>> I tried searching for you're code for a while, but failed miserably. >>> Damn Google search! Grrr. >> [...] >> >> I think I finally found it: >> >> http://groups.google.com/group/comp.arch/msg/08558cea5ff7fc43 >> >> http://groups.google.com/group/comp.arch/msg/76a2b463b26e0a56 >> >> > You did, and it seems like "quite a few years" was just over 10 years ago. > Since my posts didn't mention XADD at all I understand that they were hard > to locate. :-( > > The only real difference in the code is that I didn't specify any OS-level > fallback in order to handle full/empty situations, the core code is the > same. I found some messages from you which mentioned XADD, and I think I came across one in which you did explicitly mention an os-level fallback. Although, exactly what comprised the so-called "fallback" was rather vague. I used fast-pathed semaphores. Well, I guess it could be semaphores, futexs, eventcounts, well, whatever. :^) > I did mention how it is very useful to make the queue depth > greater_or_equal to the maximum number of producer threads, that way they > will all find a free slot to write into. Right. Anyway, FWIW, here is the crude pseudo-code of an unbounded non-intrusive multi-producer/consumer queue I *half created: ________________________________________________________________ struct node { node* next; void* state; }; struct dwcas_anchor { int32 aba; struct node* node; }; struct mpmcq { struct dwcas_anchor tail; struct node* head; }; void mpmcq_init(struct mpmcq* const self, struct node* dummy) { dummy->next = NULL; self->head = dummy; self->tail.node = dummy; } void mpmcq_push(struct mpmcq* const self, struct node* node) { struct node* prev; node->m_next = NULL; prev = ATOMIC_SWAP(&self->head, node); ATOMIC_STORE(&prev->next, node); } struct node* mpmcq_pop( struct mpmcq* const self) { void* state; struct dwcas_anchor cmp, xchg; cmp = self->tail; do { struct node* next = ATOMIC_LOAD(&cmp.node->next); if (! next) return NULL; state = next->state; xchg.node = next; xchg.aba = cmp.aba + 1; } while (! ATOMIC_DWCAS(&self->tail, &cmp, &xchg)); cmp.node->state = state; return cmp.node; } ________________________________________________________________ For an unbounded queue, DWCAS aside for a moment, this has some fairly interesting properties. 1 simple CAS-loop for the consumer! That's pretty nice. Also, it has a 100% wait-free producer. However, there is a caveat... This queue can be busted to hell if the producer somehow "dies" after the atomic swap, and before the final store. Well, not really busted in the sense of corrupt data, but deadlocked which is way better than corrupting anything. Well, IMVHO that is... :^) BTW, this beat's the heck out of the Michael and Scott queue... :^o I cannot seem to come up with a wait-free consumer side to this algorithm... Can you give me any ideas? Perhaps we can come up with something neat. (*) I have to give credit to Dmitriy Vyukov for the excellent producer side of this algorithm because he invented it! |