From: Stephen Fuld on
On 1/25/2010 12:56 PM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>>
>> Perhaps I am missing something, but I don't think that, by itself works.
>> If you have multiple timers, doesn't that require a much smaller
>> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1
>> us granularity, nothing prevents two calls within the same us from the
>> same processor from getting the same value. But if you try to maintain
>> multiple clocks in different chips in sync with each other within 10 ns,
>> you run into other problems which makes that hard.
>
> The trick is simply to add a bunch of bits below the least significant
> timer bit, and then use those as a cpu/core ID.
>
> I.e. each time cpu 0 and cpu 1 happens to record exactly the same real
> timestamp, cpu 0 will be considered to have happened before cpu 1, since
> those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1.
>
> With 16 such bits you can handle a 64K cluster and still guarantee that
> all timestamps will be globally unique.

Yes, I understand that. But I am still missing something. If the idea
of this is to guarantee uniqueness across a cluster when using a single
clock register, then the other mechanisms seem to provide than and more.
If the idea is to allow multiple clock/registers (perhaps one per
board) in order to reduce potential scaling issues or reduce the time
lag required to go across the interconnect network, then ISTM that you
lose the guarantee of sequentiality, that is that a timer call that
occurs before a second one gets a lower number. That is, the numbers
will be unique, but not necessarily in time order.

ISTM that the time lag is pretty small at least for clusters within the
same room and incurring that delay is worth the guarantee of
sequentiality. As for the scaling issues, even with current technology,
given that a single CPU register can be accessed easily multiple times
per ns, I just don't see a scaling issue for any reasonable usage. Does
anyone have a feel for how often the clock/counter needs to be accessed
for any typical/reasonable use?

So, in summary, the static ID seems to me to be a sub optimal solution
in all situations.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
From: Morten Reistad on
In article <u5d137-vtc1.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Stephen Fuld wrote:
>>
>> Perhaps I am missing something, but I don't think that, by itself works.
>> If you have multiple timers, doesn't that require a much smaller
>> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1
>> us granularity, nothing prevents two calls within the same us from the
>> same processor from getting the same value. But if you try to maintain
>> multiple clocks in different chips in sync with each other within 10 ns,
>> you run into other problems which makes that hard.
>
>The trick is simply to add a bunch of bits below the least significant
>timer bit, and then use those as a cpu/core ID.
>
>I.e. each time cpu 0 and cpu 1 happens to record exactly the same real
>timestamp, cpu 0 will be considered to have happened before cpu 1, since
>those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1.
>
>With 16 such bits you can handle a 64K cluster and still guarantee that
>all timestamps will be globally unique.

And, just distributing a master clock using a serial wire with some
RLL-like code that ticks out timing plus some info to major epochs,
like seconds, should be a reasonably trivial thing to implement. Just
make all the wires have the same delay, and you are set. The wire
drives a counter, a shift register, and two latches. Every tick,
the counter increments and loads the first latch. At a much lower
speed, the shift register loads the absolute value that is shifted
into the latches on an event pulse, say once a second. The lower bits
are cpu-id.

Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches
and registers should be able to run about at a fourth of the fundamental
switching speed, which would beat even the fastest instruction
engines at least by a factor of two. So every read would get a
different value.

-- mrr

Yes, the cable will contain data. Lots of data.

But it is junior to the transatlantic cables, which actually contain
a few hundred gigabytes, in transit across the ocean.


From: Stefan Monnier on
> Perhaps I am missing something, but I don't think that, by itself works. If
> you have multiple timers, doesn't that require a much smaller granularity
> timer? i.e. say 10 ns versus 1 us.

Yes, of course.


Stefan
From: Terje Mathisen "terje.mathisen at on
Stephen Fuld wrote:
> On 1/25/2010 12:56 PM, Terje Mathisen wrote:
>> With 16 such bits you can handle a 64K cluster and still guarantee that
>> all timestamps will be globally unique.
>
> Yes, I understand that. But I am still missing something. If the idea of
> this is to guarantee uniqueness across a cluster when using a single
> clock register, then the other mechanisms seem to provide than and more.
> If the idea is to allow multiple clock/registers (perhaps one per board)
> in order to reduce potential scaling issues or reduce the time lag
> required to go across the interconnect network, then ISTM that you lose
> the guarantee of sequentiality, that is that a timer call that occurs
> before a second one gets a lower number. That is, the numbers will be
> unique, but not necessarily in time order.

That is a feature, not a bug!

When we time sufficiently small intervals, i.e. smaller than the minimum
time to get from one node to the nearest neighbor, then there is no way
to globally determine the "real" order, simply because using such a
global timer would have to give less resolution than what each
cpu/core-based counter can do.

Adding the core ID makes each timestamp unique, so the idea is simply to
be able to compare them after the fact, and using the numeric order as
the effective time order.

> So, in summary, the static ID seems to me to be a sub optimal solution
> in all situations.

Except that it carries an order of magnitude less overhead, scales
perfectly, and allows timing resolution down to whatever the local core
can do (~ns).

Otherwise it might besub optimal. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen "terje.mathisen at on
Morten Reistad wrote:
> In article<u5d137-vtc1.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Stephen Fuld wrote:
>>>
>>> Perhaps I am missing something, but I don't think that, by itself works.
>>> If you have multiple timers, doesn't that require a much smaller
>>> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1
>>> us granularity, nothing prevents two calls within the same us from the
>>> same processor from getting the same value. But if you try to maintain
>>> multiple clocks in different chips in sync with each other within 10 ns,
>>> you run into other problems which makes that hard.
>>
>> The trick is simply to add a bunch of bits below the least significant
>> timer bit, and then use those as a cpu/core ID.
>>
>> I.e. each time cpu 0 and cpu 1 happens to record exactly the same real
>> timestamp, cpu 0 will be considered to have happened before cpu 1, since
>> those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1.
>>
>> With 16 such bits you can handle a 64K cluster and still guarantee that
>> all timestamps will be globally unique.
>
> And, just distributing a master clock using a serial wire with some
> RLL-like code that ticks out timing plus some info to major epochs,
> like seconds, should be a reasonably trivial thing to implement. Just
> make all the wires have the same delay, and you are set. The wire
> drives a counter, a shift register, and two latches. Every tick,
> the counter increments and loads the first latch. At a much lower
> speed, the shift register loads the absolute value that is shifted
> into the latches on an event pulse, say once a second. The lower bits
> are cpu-id.
>
> Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches
> and registers should be able to run about at a fourth of the fundamental
> switching speed, which would beat even the fastest instruction
> engines at least by a factor of two. So every read would get a
> different value.

Such a global reference, possibly in the form of a PPS (or higher
frequency) signal going to each cpu is a way to sync up the individual
counters, it still allows two "simultaneous" reads on two independent
cores/cpus/boards to return the identical value, right?

This means that you still need to add a cpu ID number to make them
globally unique, and at that point you're really back to the same
problem (and solution), although your high-frequency reference signal
allows for much easier synchronization of all the individual timers.

The key is that as soon as you can sync them all to better than the
maximum read speed, there is no way to make two near-simultaneous reads
in two locations and still be able to guarantee that they will be
different, so you need to handle this is some way.

Appending a cpu ID is by far the easiest solution. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"