From: Terje Mathisen on
Mayan Moudgill wrote:
> I guess you're implementing a mutex, not a barrier, right? If so, here's
> what we do in the case of the SB3500 (somewhat simplified).
> 1. There are a set of memory regions (I believe its 1K addresses in the
> current implementation) that return a 32 bit value and auto increment,
> initialized to return 0 on the first read, 1 on the second and so on.
> Call these location the ticket-window. Each mutex has a ticket-window
> location.

So this is a hardware-based XADD implementation, avoiding all the cache
aquire/update/release traffic.

> 2. For each mutex, for each thread that can access the mutex there is a
> memory location containing the current value of the serving ticket,
> initially 0. [Note that this per-thread serving-ticket location is
> intended to be in close-to-thread-memory]
> 3. To implement a mutex:
> - a thread goes and reads the ticket-window, getting a ticket.
> - if the value of the serving ticket is the same as its ticket, it
> enters the mutex, otherwise it spin-waits for its ticket to be the
> serving-ticket
> - when it exits the mutex, it updates all threads serving tickets to be
> the next higher number. This is a remote write into other threads local
> memory.

Which means that each lock must have an associated list of registered
users, right?

I presume updates to the lock user list is _very_ infrequent? :-)

This also means that you have a hw limit on the number of locks you can
support, this is (probably) OK for an embedded solution, not so good for
a general-purpose OS.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on
Terje Mathisen wrote:
> Mayan Moudgill wrote:
>
>> I guess you're implementing a mutex, not a barrier, right? If so, here's
>> what we do in the case of the SB3500 (somewhat simplified).
>> 1. There are a set of memory regions (I believe its 1K addresses in the
>> current implementation) that return a 32 bit value and auto increment,
>> initialized to return 0 on the first read, 1 on the second and so on.
>> Call these location the ticket-window. Each mutex has a ticket-window
>> location.
>
>
> So this is a hardware-based XADD implementation, avoiding all the cache
> aquire/update/release traffic.

Yeah; in a cached system, this would be a non-cacheable load.

>> 2. For each mutex, for each thread that can access the mutex there is a
>> memory location containing the current value of the serving ticket,
>> initially 0. [Note that this per-thread serving-ticket location is
>> intended to be in close-to-thread-memory]
>> 3. To implement a mutex:
>> - a thread goes and reads the ticket-window, getting a ticket.
>> - if the value of the serving ticket is the same as its ticket, it
>> enters the mutex, otherwise it spin-waits for its ticket to be the
>> serving-ticket
>> - when it exits the mutex, it updates all threads serving tickets to be
>> the next higher number. This is a remote write into other threads local
>> memory.
>
>
> Which means that each lock must have an associated list of registered
> users, right?

In the default implementations all participants are known a priori. The
default case is 1:1 with the number of _hardware_threads_; whether or
not that h/w is actually using that lock is irrelevant. Note that this
is a function of spin-waits.

> I presume updates to the lock user list is _very_ infrequent? :-)

Well, I'm guessing if you allow hot-upgrades, it might happen :)

> This also means that you have a hw limit on the number of locks you can
> support, this is (probably) OK for an embedded solution, not so good for
> a general-purpose OS.

Not really; it depends on the memory space/controller. Basically, in the
general case, there are two possible solutions based on the memory
controller:
- a region of memory can be given two address ranges for the same memory
- one of which auto-increments the memory, and the other treats it as
normal memory.
- certain pages can be marked as auto-increment
For instance, if you have a L3 controller, that controller can implement
the read-modify-write for auto-increment (note that if you use a cache
for auto-increment locations, you don't actually have to do the
writeback immediately).
From: nmm1 on
In article <ZJ6dnTKJq8hXdzDXnZ2dnUVZ8sednZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>This is the crux of the matter:
>
>It is _easy_ to provide sync when contention is (very close to) zero, it
>is _hard_ to do the same when lots of cores/threads tries to do it
>simultaneously.
>
>The usual answer to this kind of situation is "Don't do that!".

Yes. Don't try to solve the problem you actually have; give it up
and choose one that is easier to solve. Fine for ivory tower
computer scientists; impractical for everyone else.

>To me, when I see an O(n*n) situation, I try to look for O(n*log(n))
>or O(n):
>
>One obvious approach would be to have a hierarchy of locks:
>
>Just having two levels with sqrt(n) first-level barriers all protecting
>access to the single "real" lock would turn the problem into something
>that takes twice as long in the zero contention situation, but where the
>worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right?

Well, yes, but .... You are also increasing the contention for the
higher level lock in the case where there are a lot of low-contention
locks. So it could be worse than twice as slow even in that case.


Regards,
Nick Maclaren.
From: Terje Mathisen on
nmm1(a)cam.ac.uk wrote:
> In article<ZJ6dnTKJq8hXdzDXnZ2dnUVZ8sednZ2d(a)lyse.net>,
> Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote:
>> One obvious approach would be to have a hierarchy of locks:
>>
>> Just having two levels with sqrt(n) first-level barriers all protecting
>> access to the single "real" lock would turn the problem into something
>> that takes twice as long in the zero contention situation, but where the
>> worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right?
>
> Well, yes, but .... You are also increasing the contention for the
> higher level lock in the case where there are a lot of low-contention
> locks. So it could be worse than twice as slow even in that case.

I can't see that:

In the best case, i.e. no contention at all, it is of course twice as
slow, but (except for the test&branch on the result of first-level lock)
not worse than that.

At the other end of the spectrum, where every single thread is spinning
on the same resource, we
we've gotten that O(n*n) -> O(2n) == O(n) improvement.

The worst case intermediate position would be a single thread in each of
the first-level locks arriving simultaneously, getting it and then
hitting on the second-level (real/original) lock at the same time:

In this situation we have sqrt(n) threads accessing the same lock,
taking O(n) time.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <JPmdnSLOCMhlzTPXnZ2dnUVZ8t6dnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>
>>> One obvious approach would be to have a hierarchy of locks:
>>>
>>> Just having two levels with sqrt(n) first-level barriers all protecting
>>> access to the single "real" lock would turn the problem into something
>>> that takes twice as long in the zero contention situation, but where the
>>> worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right?
>>
>> Well, yes, but .... You are also increasing the contention for the
>> higher level lock in the case where there are a lot of low-contention
>> locks. So it could be worse than twice as slow even in that case.
>
>I can't see that:
>
>In the best case, i.e. no contention at all, it is of course twice as
>slow, but (except for the test&branch on the result of first-level lock)
>not worse than that.

I am talking about a good but not unrealistically perfect case.
The O(N) analysis isn't useful when one is talking about constant
factors, of course.

Consider a large number M of locks with usage rate P, per lock.
Actual clashes are K.M.P^2, for some K. With your new scheme,
there are sqrt(M) higher level locks, with a usage rate of sqrt(M).P,
per lock. Actual clashes are K.sqrt^3(M).P^2.


Regards,
Nick Maclaren.