What happened to computer architecture (and comp.arch?) [Computer Architecture]

Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]

From: Mayan Moudgill on 10 Sep 2009 23:04

MitchAlsup wrote:
> On Sep 10, 5:39 am, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote:
>
>>MitchAlsup wrote:
>>
>>>Nobody has solved the synchronization problem that will embolden cores
>>>and threads to really deliver on their promise
>>>The closest solution to synchronization has been lobotomized by its
>>>implementation
>>
>>Hmmm..., do I hear the bitter sound of an architect twarthed? :-)
>
>
> More like the sound of the architect ignored--with no actual
> bitterness.
>
> But then again, I could have been talking about transactional
> memory.....
>
> Mitch

Well, synchronization can be pretty easy to implement - depends on what
you are trying to accomplish with it (barriers, exclusion, queues,
etc.). Just as a thought exercise - assume every unit of control that
was trying to interact was simultaneously executing in the system (i.e.
no swap). Use spinning+HW to implement whatever coordination between
threads/processes/etc you required. Synchronization can now become a
low-overhead activity.

For instance: on the SB3500, we can write a barrier so that (upto 4)
threads on the same core exit a barrier within 4-6 cycles of the last
thread entering it. Across (upto 3) cores, we can exit a barrier in
about 30 cycles of the last thread (I'll have to double-check that
number). Those numbers are extremely important, given that we have to
hit real-time. BTW: we could pull the cross-core synch number way lower
if we had to, but its not required for our target apps.

But think of the OS/runtime changes that are required to guarantee that
we can use spin-locks efficiently. If threads are pre-emptively switched
on 10ms boundaries, and the threads associated with a synchronization
entity are not guaranteed to be switched in fast (or to be already
running), then, for efficiency, you're forced to move to queue-on-wait,
signal-on-release structures, with the accompanying inefficiencies.

As the number of HW contexts increases, gang-scheduling,
hardware-to-thread-group-bindings etc. become more practical. But, it
will require a change in OS.

Sorry if I'm a little incoherent. Point is, in the appropriate
environment, with the appropriate (cheap) H/W assistance, we can
implement synchronization efficiently. Its trying to "virtualize"
synchronization structures so that they can be used on context-switching
single processor enviornments that makes it expensive.

From: MitchAlsup on 13 Sep 2009 16:00

On Sep 10, 10:04 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
> Well, synchronization can be pretty easy to implement - depends on what
> you are trying to accomplish with it (barriers, exclusion, queues,
> etc.).

If it is so easy to implement then why are (almost) all
synchronization models at lest BigO( n**2 ) in time? per unit of
observation. That is, it takes a minimum of n**2 memory accesses for 1
processor to recognize that it is the processor that can attempt to
make forward progress amongst n contending processors/threads.

> But think of the OS/runtime changes that are required to guarantee that
> we can use spin-locks efficiently.

Don't want spin-locks--I want a means whereby everyone trying to
synchronize recognizes their place in line in BigO( ln(x) ) time and
then another means whereby they can use their place in line to access
the queue (or whatever) without (/with minimal) interfering with the
other threads doing likewise.

> Sorry if I'm a little incoherent. Point is, in the appropriate
> environment, with the appropriate (cheap) H/W assistance, we can
> implement synchronization efficiently.

No, it does not and can not and its a problem of the primatives and
the memory order and cache coherent models these primitaves have to
live within.

To adequately solve the problem you have to make the BigO term smaller
than n**2.

Mitch

From: nmm1 on 13 Sep 2009 18:33

In article <da524b6d-bc4d-4ad7-9786-3672f7e9e52c(a)j19g2000yqk.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
>On Sep 10, 10:04=A0pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
>> Well, synchronization can be pretty easy to implement - depends on what
>> you are trying to accomplish with it (barriers, exclusion, queues,
>> etc.).
>
>If it is so easy to implement then why are (almost) all
>synchronization models at lest BigO( n**2 ) in time? per unit of
>observation. That is, it takes a minimum of n**2 memory accesses for 1
>processor to recognize that it is the processor that can attempt to
>make forward progress amongst n contending processors/threads.

Ah, but that's only worst case. If the program doesn't do much
synchronisation, then it can be implemented efficiently. Which
is a strong contender for the most useless result in computer
science.

That isn't denying that algorithms designed for low contention
aren't useful; they are. What they aren't is a general solution.

>> But think of the OS/runtime changes that are required to guarantee that
>> we can use spin-locks efficiently.
>
>Don't want spin-locks--I want a means whereby everyone trying to
>synchronize recognizes their place in line in BigO( ln(x) ) time and
>then another means whereby they can use their place in line to access
>the queue (or whatever) without (/with minimal) interfering with the
>other threads doing likewise.

Quite.

>> Sorry if I'm a little incoherent. Point is, in the appropriate
>> environment, with the appropriate (cheap) H/W assistance, we can
>> implement synchronization efficiently.
>
>No, it does not and can not and its a problem of the primatives and
>the memory order and cache coherent models these primitaves have to
>live within.

Precisely. Synchronisation isn't purely about the synchronisation
of the synchronisation primitives (that's not useful), but about the
synchronisation of other actions as well. So, even if the cost is
less than O(N^2), cheap hardware assistance will not deliver good
efficiency.

Unless, of course, "the appropriate environment" means that using
a synchronisation function does not needed to synchronise anything
else, which is equivalent to specifying a message-passing system.
I would be happy with that, but thousands wouldn't.

Regards,
Nick Maclaren.

From: Mayan Moudgill on 13 Sep 2009 20:26

MitchAlsup wrote:
> On Sep 10, 10:04 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
>
>>Well, synchronization can be pretty easy to implement - depends on what
>>you are trying to accomplish with it (barriers, exclusion, queues,
>>etc.).
>
>
> If it is so easy to implement then why are (almost) all
> synchronization models at lest BigO( n**2 ) in time? per unit of
> observation. That is, it takes a minimum of n**2 memory accesses for 1
> processor to recognize that it is the processor that can attempt to
> make forward progress amongst n contending processors/threads.
>

I guess you're implementing a mutex, not a barrier, right? If so, here's
what we do in the case of the SB3500 (somewhat simplified).
1. There are a set of memory regions (I believe its 1K addresses in the
current implementation) that return a 32 bit value and auto increment,
initialized to return 0 on the first read, 1 on the second and so on.
Call these location the ticket-window. Each mutex has a ticket-window
location.
2. For each mutex, for each thread that can access the mutex there is a
memory location containing the current value of the serving ticket,
initially 0. [Note that this per-thread serving-ticket location is
intended to be in close-to-thread-memory]
3. To implement a mutex:
- a thread goes and reads the ticket-window, getting a ticket.
- if the value of the serving ticket is the same as its ticket, it
enters the mutex, otherwise it spin-waits for its ticket to be the
serving-ticket
- when it exits the mutex, it updates all threads serving tickets to be
the next higher number. This is a remote write into other threads local
memory.

Basically, the Bakery algorithm with a hardware ticket provider.
This appears to take 1 ticket-window read (comparatively long latency
access), 1 serving ticket read (local, low latency read) to enter in the
non-contending case, and N serving updates (writes to other people's
local memories) to exit the mutex.

There's a paper lying around that shows the exact numbers and scaling.

Thinking about the problem, I'm not sure how the remote write issue will
interact with cache-coherence. I can think of ways of playing games with
the coherence mechanism, but I'm guessing that it might be more
efficient to just allow push-modifies to lines.

>>But think of the OS/runtime changes that are required to guarantee that
>>we can use spin-locks efficiently.
>
>
> Don't want spin-locks--I want a means whereby everyone trying to
> synchronize recognizes their place in line in BigO( ln(x) ) time and
> then another means whereby they can use their place in line to access
> the queue (or whatever) without (/with minimal) interfering with the
> other threads doing likewise.

That's a different problem; if you want a thread under contention to go
to sleep, and then be woken up when its turn comes around, you're not
solving the synchronization problem - you're really solving a load
sharing/event communication problem.

There is an exact analog: how do we deliver an interrupt to a process?
If the process is guaranteed to be running at the time of the interrupt,
we can use a process-specific interrupt handler. If the process can,
however, be swapped out, then the interrupt handling must somehow be
virtualized. And then you get into a complicated cascade of OS private
FLIH/SLIH intercepting the interrupt, determining the process for which
this interrupt is destined, then converting it into a message for later
delivery to the process. This can possibly include moving the process
from a waiting-for-event-queue into the schedule-when-appropriate-queue.

>
>>Sorry if I'm a little incoherent. Point is, in the appropriate
>>environment, with the appropriate (cheap) H/W assistance, we can
>>implement synchronization efficiently.
>
>
> No, it does not and can not and its a problem of the primatives and
> the memory order and cache coherent models these primitaves have to
> live within.
>
> To adequately solve the problem you have to make the BigO term smaller
> than n**2.
>
> Mitch

Again, I beg to differ. Please let me know how what I'm proposing uses
O(n**2) memory accesses?

From: Terje Mathisen on 14 Sep 2009 03:03

MitchAlsup wrote:
> On Sep 10, 10:04 pm, Mayan Moudgill<ma...(a)bestweb.net> wrote:
>> Well, synchronization can be pretty easy to implement - depends on what
>> you are trying to accomplish with it (barriers, exclusion, queues,
>> etc.).
>
> If it is so easy to implement then why are (almost) all
> synchronization models at lest BigO( n**2 ) in time? per unit of
> observation. That is, it takes a minimum of n**2 memory accesses for 1
> processor to recognize that it is the processor that can attempt to
> make forward progress amongst n contending processors/threads.

This is the crux of the matter:

It is _easy_ to provide sync when contention is (very close to) zero, it
is _hard_ to do the same when lots of cores/threads tries to do it
simultaneously.

The usual answer to this kind of situation is "Don't do that!".

To me, when I see an O(n*n) situation, I try to look for O(n*log(n)) or
O(n):

One obvious approach would be to have a hierarchy of locks:

Just having two levels with sqrt(n) first-level barriers all protecting
access to the single "real" lock would turn the problem into something
that takes twice as long in the zero contention situation, but where the
worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right?

Each thread could use a hash of its thread ID to determine which
first-level lock to use.

With log(n) levels, each lock would only have two users, making the time
for each of them constant, and the total becomes O(log(n)).

However, if I ever found myself in a situation where this was a real
consideration, I'd say that the solution must still be to make sure
contention doesn't happen quite so often!

>> But think of the OS/runtime changes that are required to guarantee that
>> we can use spin-locks efficiently.
>
> Don't want spin-locks--I want a means whereby everyone trying to
> synchronize recognizes their place in line in BigO( ln(x) ) time and
> then another means whereby they can use their place in line to access
> the queue (or whatever) without (/with minimal) interfering with the
> other threads doing likewise.

This is indeed the proper solution.

As I've stated previously, XADD is IMHO the best of the currently
available primitives, as long as you use the return value as your index
into the work table: A single access/update per thread to get forward
progress for all of them, so at least we avoid looping by all those who
didn't win the first exchange.

>
>> Sorry if I'm a little incoherent. Point is, in the appropriate
>> environment, with the appropriate (cheap) H/W assistance, we can
>> implement synchronization efficiently.
>
> No, it does not and can not and its a problem of the primatives and
> the memory order and cache coherent models these primitaves have to
> live within.
>
> To adequately solve the problem you have to make the BigO term smaller
> than n**2.

Right.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]