From: Ken Hagan on
On Mon, 25 Jan 2010 21:48:20 -0000, Stephen Fuld
<SFuld(a)alumni.cmu.edu.invalid> wrote:

> ISTM that the time lag is pretty small at least for clusters within the
> same room and incurring that delay is worth the guarantee of
> sequentiality. As for the scaling issues, even with current technology,
> given that a single CPU register can be accessed easily multiple times
> per ns, I just don't see a scaling issue for any reasonable usage. Does
> anyone have a feel for how often the clock/counter needs to be accessed
> for any typical/reasonable use?

Depends whether it is a clock or a counter. If you are timing code, the
chances are you will be comparing two values read on the same CPU, so the
issue really doesn't arise.

If you want a counter to assign a unique order to transactions then I'd
have thought it was really rather likely that two transactions might be
pulled from the "inbox" in quick succession, dispatched to separate
processors, then take roughly the same length of time to get started, and
consequently both request their "order number" at the roughly same time.
Then it is just a matter of practice before the system turns "roughly"
into "exactly".

This is one of those "inevitable co-incidences" that make real parallel
systems so much more exciting than time-sliced ones.
From: nmm1 on
In article <op.u64wluxbss38k4(a)khagan.ttx>,
Ken Hagan <K.Hagan(a)thermoteknix.com> wrote:
>On Mon, 25 Jan 2010 21:48:20 -0000, Stephen Fuld
><SFuld(a)alumni.cmu.edu.invalid> wrote:
>
>> ISTM that the time lag is pretty small at least for clusters within the
>> same room and incurring that delay is worth the guarantee of
>> sequentiality. As for the scaling issues, even with current technology,
>> given that a single CPU register can be accessed easily multiple times
>> per ns, I just don't see a scaling issue for any reasonable usage. Does
>> anyone have a feel for how often the clock/counter needs to be accessed
>> for any typical/reasonable use?
>
>Depends whether it is a clock or a counter. If you are timing code, the
>chances are you will be comparing two values read on the same CPU, so the
>issue really doesn't arise.

Once you start to program in parallel, and not merely tack serial code
together with a bit of parallel glue, that ceases to be the case.
You can't do any serious tuning (or even quite a lot of debugging) of
parallel code without knowing whether events in one domain[*] happened
before events in another. The problem is that the question is easy
to pose, but can't be answered without (a) a more precise definition
of what "happened before" means and (b) accepting that parallel time
is not a monotonic scalar.

For example, one of the basic questions to answer when parallel code
is running far slower than expected is whether the problem is the time
taken to communicate data from one domain to another. I had to
write a clock synchroniser to track down once such problem with MPI.
When I located it, I realised that I had completely misunderstood
where the problem was.

[*] The word "domain" means core, thread or other concept, depending
on the program and system details.


Regards,
Nick Maclaren.
From: Morten Reistad on
In article <g2j237-the1.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Morten Reistad wrote:
>> In article<u5d137-vtc1.ln1(a)ntp.tmsw.no>,
>> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>> Stephen Fuld wrote:
>>
>> Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches
>> and registers should be able to run about at a fourth of the fundamental
>> switching speed, which would beat even the fastest instruction
>> engines at least by a factor of two. So every read would get a
>> different value.
>
>Such a global reference, possibly in the form of a PPS (or higher
>frequency) signal going to each cpu is a way to sync up the individual
>counters, it still allows two "simultaneous" reads on two independent
>cores/cpus/boards to return the identical value, right?

Yes, that is possible. I just try to point out that it does not
require all that much in hardware design to build slave clocks that
run at speeds faster than what cpu's can realistically utilise.

>This means that you still need to add a cpu ID number to make them
>globally unique, and at that point you're really back to the same
>problem (and solution), although your high-frequency reference signal
>allows for much easier synchronization of all the individual timers.

I am usually advocating software solutions, but this question just
screams out for a simple hardware solution.

>The key is that as soon as you can sync them all to better than the
>maximum read speed, there is no way to make two near-simultaneous reads
>in two locations and still be able to guarantee that they will be
>different, so you need to handle this is some way.

That is correct.

The raw switching speeds of the transistors in modern computers
should be in or close to single digit picoseconds. So, building
elementary logic like shift registers and clock drivers to operate
at the 100 ps / 0.1 ns / 10 GHz level should be doable without
jumping through too many hoops.

That should still be an order of two faster than the best cpu
speeds, which means the clock is better than the instruction decode
that should evaluate it. Adding a cpu id beyond this precision should
not be problematic. The time jitter for instruction decodes would
be orders of magnitude higher.

At these speeds the "clock event horison" between each tick is
around 30 centimeters. So, if the distance between processors is
bigger than that you cannot have causality between events, that
is prohibited by relativity. Database ordering kind of breaks down
from there.

>Appending a cpu ID is by far the easiest solution. :-)

Yep.

You can define an sequence from a clock synchronised from an single
source, and uniquely time-staggered units. This give us sequence,
but it becomes physically rather meaningless. It does give us unique
transaction IDs though.

-- mrr
From: Larry on
On Jan 20, 9:33 am, n...(a)cam.ac.uk wrote:
> In article <b6gj27-5bn....(a)ntp.tmsw.no>,
> Terje Mathisen  <"terje.mathisen at tmsw.no"> wrote:
>
> >n...(a)cam.ac.uk wrote:
> >> In article<n7dj27-n7n....(a)ntp.tmsw.no>,
> >> Terje Mathisen<"terje.mathisen at tmsw.no">  wrote:
> >>> If you instead use a memory-mapped timer chip register, then you've
> >>> still got the cost of a real bus transaction instead of a couple of
> >>> core-local instructions.
>
> >> Eh?  But how are you going to keep a thousand cores synchronised?
> >> You can't do THAT with a couple of core-local instructions!
>
> >You and I have both written NTP-type code, so as I wrote in another
> >message: Separate motherboards should use NTP to stay in sync, with or
> >without hw assists like ethernet timing hw and/or a global PPS source.
>
> Yes, but I thinking of a motherboard with a thousand cores on it.
> While it could use NTP-like protocols between cores, and for each
> core to maintain its ownclock, that's a fairly crazy approach.
>
> All right, realistically, it would be 64 groups of 16 cores, or
> whatever, but the point stands.  Having to use TWO separate
> protocols on a single board isn't nice.
>
> Regards,
> Nick Maclaren.

For what it's worth, the SiCortex machines, even at the 5800 core
level, had synchronous and nearly synchronized cycle counters.

The entire system, ultimately, ran off a single 100 MHz or so clock,
with PLLs on each multicore chip to upconvert that to the proper
internal rates. On those cores, the cycle counters started from zero
when reset was released, so they were not synchronized at boot time.
There was a low level timestamp-the-interconnect scheme that would
then synchronize all the cycle counters within a few counts, giving
~10 nanosecond synchronization across all 5832 cores.

This was used to create MPI_WTIME and other system wide timestamps,
and very handy for large scale performance tuning, but not useful for
UIDs.

By the way, once your applications get to large scale (over 1000
cores), problems of synchronization and load balancing start to
dominate, and in that regime, I suspect variable speed clocks make the
situation worse. Better to turn off cores to save power than to let
them run at variable speed.

-Larry (ex SiCortex)

From: nmm1 on
In article <0b40dbdb-53c0-4c5c-a19b-e68316f3d9c4(a)p17g2000vbl.googlegroups.com>,
Larry <lstewart2(a)gmail.com> wrote:
>
>For what it's worth, the SiCortex machines, even at the 5800 core
>level, had synchronous and nearly synchronized cycle counters.
>
>The entire system, ultimately, ran off a single 100 MHz or so clock,
>with PLLs on each multicore chip to upconvert that to the proper
>internal rates. On those cores, the cycle counters started from zero
>when reset was released, so they were not synchronized at boot time.
>There was a low level timestamp-the-interconnect scheme that would
>then synchronize all the cycle counters within a few counts, giving
>~10 nanosecond synchronization across all 5832 cores.

That's impressive. The demise of SiCortex was very sad :-(

>This was used to create MPI_WTIME and other system wide timestamps,
>and very handy for large scale performance tuning, but not useful for
>UIDs.

Yes.

>By the way, once your applications get to large scale (over 1000
>cores), problems of synchronization and load balancing start to
>dominate, and in that regime, I suspect variable speed clocks make the
>situation worse. Better to turn off cores to save power than to let
>them run at variable speed.

Oh, gosh, YES! The more I think about tuning parallel codes in a
variable clock context, the more I think that I don't want to go
there. And that's independent of whether I have an application or
an implementor hat on.


Regards,
Nick Maclaren.