From: Paul on
Peter Olcott wrote:

>
> Yeah so now that I understand how pipes work a little
> better, and that the function named poll() is a misnomer and
> does not actually do polling, I can provide a much better
> solution.
>

Sorry to interrupt the thread. You had a question about
variability in performance of a small test program.
(Your "memory bandwidth" test...)

Maybe you've figured it out already. It isn't a problem
with TLB performance. The notion of a random walk is
wrong.

Your array was initialized with this.

Random = rand() * rand();
Random %= size;
Data[N] = Random;

And tested for a billion cycles with this.

for (uint32 N = 0; N < Max; N++)
num = Data[num];

(In my previous posting, I'd missed the fact, that you were
doing a constant number of steps in this loop, and varying the
array size. In which case, the time to execute this walk should
be constant, as long as the time to do the 1 billion memory accesses
was the same.)

So I tried a technique to lock the array in memory, and these
are the execution time results with the new storage method.
(Results collected on Debian 5.04 AMD64, 2GB RAM system, 2.6GHz Core2 Duo)

Size in bytes--->400000000 12.31 Seconds
Size in bytes--->440000000 12.67 Seconds
Size in bytes--->480000000 10.45 Seconds
Size in bytes--->520000000 13.30 Seconds
Size in bytes--->560000000 12.83 Seconds
Size in bytes--->600000000 13.57 Seconds
Size in bytes--->640000000 18.03 Seconds
Size in bytes--->680000000 5.38 Seconds 761.34 Megabytes Per Second
Size in bytes--->720000000 21.61 Seconds
Size in bytes--->760000000 13.99 Seconds
Size in bytes--->800000000 28.38 Seconds
Size in bytes--->840000000 13.28 Seconds
Size in bytes--->880000000 92.86 Seconds
Size in bytes--->920000000 11.78 Seconds
Size in bytes--->960000000 15.34 Seconds
Size in bytes--->1000000000 15.63 Seconds
Size in bytes--->1040000000 67.98 Seconds
Size in bytes--->1080000000 13.08 Seconds
Size in bytes--->1120000000 12.81 Seconds
Size in bytes--->1160000000 98.32 Seconds
Size in bytes--->1200000000 14.66 Seconds

You see, there is really no pattern to this at all, and the (constant)
time expected, varies from 5.38 seconds to 92.86 seconds. Which is
a huge variation. So I figured it is time for another hypothesis.

So the next modification to the program, was to print out every value of

num = Data[num];

and see exactly where it was walking. This created an approximately 8GB
text file, to record all one billion random steps in the table. I printed
the numbers in hex to the file, to try to keep the size down a bit. I particularly
wanted to study this case, because of its ridiculously short execution time.

Size in bytes--->680000000 5.38 Seconds

It turns out, that there is a repeating pattern of 23 steps in that
program run. In other words, for about the first (roughly) 4143 steps,
the program is doing a random walk. Then, the array walk gets stuck
in a repeating, 23 step loop. Which means it is no longer random,
and is visiting the same locations over and over again. For
example, in the 1 billion step test, location 0x9EC3A1E is
visited 46,684,248 times. And that is why, that particular run,
"went like stink...". By visiting the same 23 locations over and
over, that test case stayed in cache, and didn't need to visit
the memory subsystem after step ~4143 of 1 billion.

You might consider printing the "num = Data[num]" numbers in a run,
and then take that resulting 8GB text file, and sorting the numbers
in frequency bins, to get some idea whether all locations are getting
hit at about the same rate. Perhaps the 92.86 test run, would be a good
one to choose, as I suspect the loop size in that one, must be fairly large.

In any event, the random walk is not random. It is getting stuck in
a (short) loop. The TLB might still be a factor, but you'll need
a better test design, to evaluate that.

(Your source code of March 25, 2010)

http://groups.google.ca/group/comp.os.linux.development.apps/msg/0d869538b2dc4d39?dmode=source

HTH,
Paul



From: Ersek, Laszlo on
On Mon, 5 Apr 2010, Paul wrote:

> Sorry to interrupt the thread. You had a question about variability in
> performance of a small test program. (Your "memory bandwidth" test...)
>
> Maybe you've figured it out already. It isn't a problem with TLB
> performance. The notion of a random walk is wrong.

> In other words, for about the first (roughly) 4143 steps, the program is
> doing a random walk. Then, the array walk gets stuck in a repeating, 23
> step loop.

> In any event, the random walk is not random. It is getting stuck in a
> (short) loop. The TLB might still be a factor, but you'll need a better
> test design, to evaluate that.

Would you comment on the code herein?

http://groups.google.com/group/comp.os.linux.development.apps/msg/aaf49b54b2501740?dmode=source

Thanks,
lacos
From: Nicolas George on
Paul wrote in message <hpc817$9kn$1(a)news.eternal-september.org>:
> In any event, the random walk is not random. It is getting stuck in
> a (short) loop.

Please note that this short loop is not the proof of the flaw in the libc's
PRNG. It is merely a flaw in the algorithm used to generate the random walk.

For each case, you (pseudo-)randomly select the next case: with such an
algorithm, the probability of small cycles is very high. For example, the
probability to have at least a fixed point is (asymptotically) 1-1/e, or
about 63%. Of course, this by no means implies that you will indeed fall on
that fixed point, but if you take all the possibilities of small cycles into
account, the probability rises.

You are, indeed, meeting the anniversary paradox: in small groups of people,
you will find surprisingly often people who have the same birthday. In fact,
probabilities tell us that the odds go above 50% for groups of just 23
persons (assuming the births are uniform over the year; they are not, and
that still raises the odds).

In the general case, if you take random integers until you find once a
second time (which is equivalent to what you are doing, just imagine you
initialize your array lazily), then the odds rise significantly when the
number of selected values is about the square root of the number of possible
values. And the standard deviation is huge.

In other words, when you try your random walk on one billion cells, you are
very probably only walking on less than 40000 of them, with non-negligible
odds of much less.

There is a solution to that in this case: it is possible to select the
transitions to avoid completely cycles, so that the random walk is a single
cycle around the whole array.

To do that you can work from the standard random permutation algorithm:

for(i = 0; i < N; i++)
s[i] = i;
for(i = 0; i < N - 1; i++) {
j = random_between(i, N);
swap(s[i], s[j]);
}

After that algorithm, s has all the numbers between 0 and N exactly once in
a completely random order. To turn that into a transition array, you can
just state that for all i, next[s[i]] = s[i + 1]. I am not sure if you can
do without the auxiliary array.
From: Peter Olcott on
It is a "given" (something I can't change) that the web
server will have one thread per HTTP request. It is also a
"given" architectural decision to provide optimal
performance that I will have at least one thread per
processor core that doe the OCR processing. I might have two
threads per processor core to do OCR processing because I
will have two level of OCR processing priority, and the
first level of priority has absolute priority over the
second level.

I now understand how several IPC mechanisms work much
better, thus will probably use one of these very efficient
mechanisms. I am still leaning towards named pipes. Do you
know about the design tradeoffs between named pipes and
other forms of IPC? I guess it would also be given that I
would have to force all disk writes to take place
immediately. It might also be a good practice to use the
SQLite fault recovery design pattern on a disk writes.

The primary means of IPC in this case would be a sequential
transaction log file, the secondary means of IPC (to avoid a
polled interface) could either report changes to this file
or use named pipes.

What do you think of these ideas?

"David Schwartz" <davids(a)webmaster.com> wrote in message
news:c714a588-2e5c-4022-be2b-3404d18e96e8(a)10g2000yqq.googlegroups.com...
On Apr 4, 5:33 am, "Peter Olcott" <NoS...(a)OCR4Screen.com>
wrote:

> So then are you saying that the previous respondent is
> wrong?

Not wrong, just incomplete.

> I will only have at most two file descriptors that will be
> waited on.

Then your best bet is either to have two threads, one
blocking in
'read' on each file descriptor (blocking sockets) or one
thread
blocking in 'poll' or 'select' (non-blocking sockets).

> I really want to avoid checking to see if there is input
> available a billion times per second in a tight loop. The
> approach that I am considering is to check to see if input
> is available 100 times per second in a tight loop that
> sleeps for 10 ms using nanosleep().

There is no need to do that.

DS


From: Peter Olcott on

"Paul" <nospam(a)needed.com> wrote in message
news:hpc817$9kn$1(a)news.eternal-september.org...
> Peter Olcott wrote:
>
>>
>> Yeah so now that I understand how pipes work a little
>> better, and that the function named poll() is a misnomer
>> and does not actually do polling, I can provide a much
>> better solution.
>>
>
> Sorry to interrupt the thread. You had a question about
> variability in performance of a small test program.
> (Your "memory bandwidth" test...)
>
> Maybe you've figured it out already. It isn't a problem
> with TLB performance. The notion of a random walk is
> wrong.
>
> Your array was initialized with this.
>
> Random = rand() * rand();
> Random %= size;
> Data[N] = Random;
>
> And tested for a billion cycles with this.
>
> for (uint32 N = 0; N < Max; N++)
> num = Data[num];
>
> (In my previous posting, I'd missed the fact, that you
> were
> doing a constant number of steps in this loop, and varying
> the
> array size. In which case, the time to execute this walk
> should
> be constant, as long as the time to do the 1 billion
> memory accesses
> was the same.)

No. The time should (and does) generally increase because of
decreasing cache spatial locality of reference. The random
generator was already modified:

uint32 GetRandom(uint32 size) {
return (rand() * (RAND_MAX + 1) + rand()) % size;
}

This is still not random enough but was not worth the time
to fix on this quick and dirty little proof of concept
program. Simply run the test ten times on every memory size
and ignore the fast times.

>
> So I tried a technique to lock the array in memory, and
> these
> are the execution time results with the new storage
> method.
> (Results collected on Debian 5.04 AMD64, 2GB RAM system,
> 2.6GHz Core2 Duo)
>
> Size in bytes--->400000000 12.31 Seconds
> Size in bytes--->440000000 12.67 Seconds
> Size in bytes--->480000000 10.45 Seconds
> Size in bytes--->520000000 13.30 Seconds
> Size in bytes--->560000000 12.83 Seconds
> Size in bytes--->600000000 13.57 Seconds
> Size in bytes--->640000000 18.03 Seconds
> Size in bytes--->680000000 5.38 Seconds 761.34
> Megabytes Per Second
> Size in bytes--->720000000 21.61 Seconds
> Size in bytes--->760000000 13.99 Seconds
> Size in bytes--->800000000 28.38 Seconds
> Size in bytes--->840000000 13.28 Seconds
> Size in bytes--->880000000 92.86 Seconds
> Size in bytes--->920000000 11.78 Seconds
> Size in bytes--->960000000 15.34 Seconds
> Size in bytes--->1000000000 15.63 Seconds
> Size in bytes--->1040000000 67.98 Seconds
> Size in bytes--->1080000000 13.08 Seconds
> Size in bytes--->1120000000 12.81 Seconds
> Size in bytes--->1160000000 98.32 Seconds
> Size in bytes--->1200000000 14.66 Seconds

Yes because the random number is not evenly distributed
across memory enough you must run the test ten times for
every memory size and only look at the slower times for each
memory size.

>
> You see, there is really no pattern to this at all, and
> the (constant)
> time expected, varies from 5.38 seconds to 92.86 seconds.
> Which is
> a huge variation. So I figured it is time for another
> hypothesis.

There is a huge difference when testing 1-10 MB against
100-1500 MB
Because of cache spatial locality of reference there is
about a ten fold difference in speed for the small sizes as
compared to the huge sizes.

>
> So the next modification to the program, was to print out
> every value of
>
> num = Data[num];
>
> and see exactly where it was walking. This created an
> approximately 8GB
> text file, to record all one billion random steps in the
> table. I printed
> the numbers in hex to the file, to try to keep the size
> down a bit. I particularly
> wanted to study this case, because of its ridiculously
> short execution time.
>
> Size in bytes--->680000000 5.38 Seconds
>
> It turns out, that there is a repeating pattern of 23
> steps in that
> program run. In other words, for about the first (roughly)
> 4143 steps,
> the program is doing a random walk. Then, the array walk
> gets stuck
> in a repeating, 23 step loop. Which means it is no longer
> random,
> and is visiting the same locations over and over again.
> For
> example, in the 1 billion step test, location 0x9EC3A1E is
> visited 46,684,248 times. And that is why, that particular
> run,
> "went like stink...". By visiting the same 23 locations
> over and
> over, that test case stayed in cache, and didn't need to
> visit
> the memory subsystem after step ~4143 of 1 billion.
>
> You might consider printing the "num = Data[num]" numbers
> in a run,
> and then take that resulting 8GB text file, and sorting
> the numbers
> in frequency bins, to get some idea whether all locations
> are getting
> hit at about the same rate. Perhaps the 92.86 test run,
> would be a good
> one to choose, as I suspect the loop size in that one,
> must be fairly large.
>
> In any event, the random walk is not random. It is getting
> stuck in
> a (short) loop. The TLB might still be a factor, but
> you'll need
> a better test design, to evaluate that.
>
> (Your source code of March 25, 2010)
>
> http://groups.google.ca/group/comp.os.linux.development.apps/msg/0d869538b2dc4d39?dmode=source
>
> HTH,
> Paul
>
>
>