From: Peter Olcott on

"Ersek, Laszlo" <lacos(a)caesar.elte.hu> wrote in message
news:Pine.LNX.4.64.1004051406400.9882(a)login01.caesar.elte.hu...
> On Mon, 5 Apr 2010, Paul wrote:
>
>> Sorry to interrupt the thread. You had a question about
>> variability in performance of a small test program. (Your
>> "memory bandwidth" test...)
>>
>> Maybe you've figured it out already. It isn't a problem
>> with TLB performance. The notion of a random walk is
>> wrong.
>
>> In other words, for about the first (roughly) 4143 steps,
>> the program is doing a random walk. Then, the array walk
>> gets stuck in a repeating, 23 step loop.
>
>> In any event, the random walk is not random. It is
>> getting stuck in a (short) loop. The TLB might still be a
>> factor, but you'll need a better test design, to evaluate
>> that.
>
> Would you comment on the code herein?
>
> http://groups.google.com/group/comp.os.linux.development.apps/msg/aaf49b54b2501740?dmode=source
>
> Thanks,
> lacos

uint32 GetRandom(uint32 size) {
return (rand() * (RAND_MAX + 1) + rand()) % size;
}

The only thing that I care about is making this function
produce sequences of values that are more widely
distributed. I want to force a cache busting algorithm on
memory accesses.


From: Ersek, Laszlo on
On Mon, 5 Apr 2010, Peter Olcott wrote:

> "Paul" <nospam(a)needed.com> wrote in message
> news:hpc817$9kn$1(a)news.eternal-september.org...

>> Sorry to interrupt the thread. You had a question about variability in
>> performance of a small test program. (Your "memory bandwidth" test...)
>>
>> Maybe you've figured it out already. It isn't a problem with TLB
>> performance. The notion of a random walk is wrong.
>>
>> Your array was initialized with this.
>>
>> Random = rand() * rand();
>> Random %= size;
>> Data[N] = Random;
>>
>> And tested for a billion cycles with this.
>>
>> for (uint32 N = 0; N < Max; N++)
>> num = Data[num];
>>
>> (In my previous posting, I'd missed the fact, that you were doing a
>> constant number of steps in this loop, and varying the array size. In
>> which case, the time to execute this walk should be constant, as long
>> as the time to do the 1 billion memory accesses was the same.)
>
> No. The time should (and does) generally increase because of decreasing
> cache spatial locality of reference. The random generator was already
> modified:
>
> uint32 GetRandom(uint32 size) {
> return (rand() * (RAND_MAX + 1) + rand()) % size;
> }
>
> This is still not random enough but was not worth the time to fix on
> this quick and dirty little proof of concept program. Simply run the
> test ten times on every memory size and ignore the fast times.

I'm looking at the source in
<http://groups.google.ca/group/comp.os.linux.development.apps/msg/0d869538b2dc4d39?dmode=source>
as if Initialize() would call the GetRandom() function above to populate
Data[N].

IMO the problem is not the quality of the PRNG. If you were simply
accessing array elements at offsets returned by the PRNG, the emergent
pattern might perfectly well suit your needs. You are, however, placing
"random pointers" at increasing offsets in Initialize(), then traverse
this linked list in Process(). No matter how good and uniformly
distributed the PRNG is, this may (and as Nicolas has shown, *will*)
result in a short cycle somewhere.

Suppose m, n, k are pairwise distinct, and x[m] == k and x[n] == k. The
chain starting at x[k] will lead to either m or n (supposing it doesn't
lead to an even shorter loop). Whichever comes first, the other one won't
be reached anymore. Supposing n is the one reached no more, any node of
any path leading to n isn't reached either.

Perhaps picking the worst performing one of ten subsequent test runs could
be shown to be "good enough", but I'm not confident.

lacos
From: Peter Olcott on

"Ersek, Laszlo" <lacos(a)caesar.elte.hu> wrote in message
news:Pine.LNX.4.64.1004051935200.27152(a)login01.caesar.elte.hu...
> On Mon, 5 Apr 2010, Peter Olcott wrote:
>
>> "Paul" <nospam(a)needed.com> wrote in message
>> news:hpc817$9kn$1(a)news.eternal-september.org...
>
>>> Sorry to interrupt the thread. You had a question about
>>> variability in performance of a small test program.
>>> (Your "memory bandwidth" test...)
>>>
>>> Maybe you've figured it out already. It isn't a problem
>>> with TLB performance. The notion of a random walk is
>>> wrong.
>>>
>>> Your array was initialized with this.
>>>
>>> Random = rand() * rand();
>>> Random %= size;
>>> Data[N] = Random;
>>>
>>> And tested for a billion cycles with this.
>>>
>>> for (uint32 N = 0; N < Max; N++)
>>> num = Data[num];
>>>
>>> (In my previous posting, I'd missed the fact, that you
>>> were doing a constant number of steps in this loop, and
>>> varying the array size. In which case, the time to
>>> execute this walk should be constant, as long as the
>>> time to do the 1 billion memory accesses was the same.)
>>
>> No. The time should (and does) generally increase because
>> of decreasing cache spatial locality of reference. The
>> random generator was already modified:
>>
>> uint32 GetRandom(uint32 size) {
>> return (rand() * (RAND_MAX + 1) + rand()) % size;
>> }
>>
>> This is still not random enough but was not worth the
>> time to fix on this quick and dirty little proof of
>> concept program. Simply run the test ten times on every
>> memory size and ignore the fast times.
>
> I'm looking at the source in
> <http://groups.google.ca/group/comp.os.linux.development.apps/msg/0d869538b2dc4d39?dmode=source>
> as if Initialize() would call the GetRandom() function
> above to populate Data[N].
>
> IMO the problem is not the quality of the PRNG. If you
> were simply accessing array elements at offsets returned
> by the PRNG, the emergent pattern might perfectly well
> suit your needs. You are, however, placing "random
> pointers" at increasing offsets in Initialize(), then
> traverse this linked list in Process(). No matter how good
> and uniformly distributed the PRNG is, this may (and as
> Nicolas has shown, *will*) result in a short cycle
> somewhere.

Not if you force every access to be at (least size * 0.25)
distance from the prior access.

>
> Suppose m, n, k are pairwise distinct, and x[m] == k and
> x[n] == k. The chain starting at x[k] will lead to either
> m or n (supposing it doesn't lead to an even shorter
> loop). Whichever comes first, the other one won't be
> reached anymore. Supposing n is the one reached no more,
> any node of any path leading to n isn't reached either.
>
> Perhaps picking the worst performing one of ten subsequent
> test runs could be shown to be "good enough", but I'm not
> confident.
>
> lacos


From: Nicolas George on
"Peter Olcott" wrote in message
<j-qdnXbpHJorrSfWnZ2dnUVZ_jWdnZ2d(a)giganews.com>:
> Not if you force every access to be at (least size * 0.25)
> distance from the prior access.

Distance has nothing to do with it. You can cycle between the first and last
element of the array, that makes size - epsilon, but it still is a very
small cycle.
From: Peter Olcott on

"Nicolas George" <nicolas$george(a)salle-s.org> wrote in
message news:4bba35de$0$29842$426a74cc(a)news.free.fr...
> "Peter Olcott" wrote in message
> <j-qdnXbpHJorrSfWnZ2dnUVZ_jWdnZ2d(a)giganews.com>:
>> Not if you force every access to be at (least size *
>> 0.25)
>> distance from the prior access.
>
> Distance has nothing to do with it. You can cycle between
> the first and last
> element of the array, that makes size - epsilon, but it
> still is a very
> small cycle.

I am trying to find algorithm that makes cache as
ineffective as possible. I want to get an accurate measure
of the worst case performance of my deterministic finite
automaton. The only thing that I can think of is to make
sure that each memory access is more than max cache size
away from the prior one. This should eliminate spatial
locality of reference.