From: Ersek, Laszlo on
On Mon, 29 Mar 2010, M�ns Rullg�rd wrote:

> David Schwartz <davids(a)webmaster.com> writes:
>
>> On Mar 27, 7:21�am, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
>>
>>> That may slow things down too much, if I want true random I
>>> an get a hardware random number generator dongle. For my
>>> purposes I only need the numbers to be very widely
>>> dispersed. I could generated my own pseudo random numbers
>>> based on a lookup table.
>>
>> On Linux, you can open /dev/urandom and read an unlimited supply of
>> random bits. So long as your distribution sets things up properly,
>> they should pass most statistical tests for randomness and should be
>> cryptographically strong.
>
> The problem here isn't so much the true randomness of the sequence as
> the distribution and the avoiding of accesses turning into a short
> loop.


(The local nntp server I used to use seems to have a problem with
forwarding my messages outwards. At least I was unable to find my messages
with google, and they appear absent also from news.eternal-september.org.
I went kind of crazy seeing this question discussed in multiple newsgroups
and nobody reflecting on my pertinent(?) message -- reposted below. I'm
switching to a different news client and the aforementioned news server,
in the hope that I'll be able to repost Sattolo's name to all relevant
newsgroups. Excuse the massive cross-posting. I set Followup-To to c.u.p.,
because that's where I saw the topic first.)

lacos


X-News: ludens comp.unix.programmer:167386
From: lacos(a)ludens.elte.hu (Ersek, Laszlo)
Subject: Re: Can extra processing threads help in this case?
Date: 24 Mar 2010 02:49:32 +0100
Message-ID: <RfITPGAL3QUd(a)ludens>

In article <pe9k77-gjk.ln1(a)wilbur.25thandClement.com>, William Ahern <william(a)wilbur.25thandClement.com> writes:
> Peter Olcott <NoSpam(a)ocr4screen.com> wrote:
>> "Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in
>> message news:ho5tof$lon$1(a)news.eternal-september.org...
> <snip>
>> > But if there's another CPU/core/strand/pipeline, it's possible that one
>> > processor's stall time could be put to productive use by another if
>> > there were multiple execution threads.
> <snip>
>> is there any possible way that this app is not memory bound that you can
>> provide a specific concrete example of?
>
> Your question was answered.
>
> You're hung up on your numbers and preconceived ideas. Your application
> could be BOTH memory bound AND able to benefit from multiple CPUs. But it's
> nearly impossible to guess without knowing at least the algorithm; more
> specifically, the code.

This thread has inspired me to write some code. I wrote it last night in
a sleep deprivation induced delirium. (I almost modified the upper limit
of LOG2_NEL from 31 to 32 before posting, but then decided to post the
version with 31 so that it works where "size_t" is a 32-bit unsigned
integer type with conversion rank at least that of "int".) I was fairly
sure about knowing what it measures. Now I'm fairly sure I was stupid.
So please tell me, I ask you kindly, what does the following code
measure? Thank you.

First some test results (inserting whitespace generously):

$ time -p ./jump_around 1

starting threads
14.5209 s/thread

real 16.43
user 16.37
sys 0.02


$ time -p ./jump_around 2

starting threads
15.571 s/thread

real 17.64
user 32.95
sys 0.05


$ time -p ./jump_around 4

starting threads
15.587 s/thread

real 33.25
user 64.16
sys 0.08


CPU (from <http://mysite.verizon.net/pchardwarelinks/current_cpus.htm>):

Athlon 64 X2 6000+ MMX 3DNow! SSE SSE2 SSE3 (Windsor)
(128-bit on-Die unbuffered DDR2 PC6400 mem controller, AMD-V)
(dual core)

940 pins, 3000MHz (200x15), (64-bit dual-pumped bus), 1.4v

2x 64KB data (2-way)
2x 64KB instruction (2-way)
2x 1MB on-Die unified L2 (16-way exclusive)


Here cometh the code. Please excuse the bugs, both the blatant and the
insidious.

Thanks,
lacos


/*
$Id: jump_around.c,v 1.4 2010/03/23 00:02:22 lacos Exp $

This program creates a single-cycle permutation of [0, 2**LOG2_NEL - 1] in a
static array with Sattolo's algorithm. Then it starts the specified number of
threads. Each sub-thread will traverse the permutation NROUND times. Each
thread starts at a different position in the loop (at offsets increasing
simply one by one in the buffer). Since the shuffle is driven by a uniform
distribution PRNG, the sub-threads should divide the loop more or less evenly
among themselves. The program measures and prints the user CPU time consumed
by a single sub-thread on average. Waiting for memory should be included in
this value.

Compile and link with eg.

$ gcc -ansi -pedantic -Wall -Wextra -o jump_around jump_around.c -l pthread

uint32_t and uint64_t are used only when their sizes are considered
significant for some reason.
*/

#define _XOPEN_SOURCE 500


#include <stdlib.h> /* size_t, strtol(), EXIT_FAILURE, jrand48() */
#include <inttypes.h> /* uint32_t, uint64_t */
#include <assert.h> /* assert() */
#include <stdio.h> /* fprintf(), fflush() */
#include <sys/resource.h> /* getrusage() */
#include <pthread.h> /* pthread_create(), pthread_join() */


/* Number of rounds a single sub-thread traverses the cycle. */
#define NROUND 10u

/* Highest number of sub-threads allowed on the command line. */
#define MAXTHR 4L

/*
Base two logarithm of the number of elements in the array (that is, log2 of
the cycle length). Make sure it is in [log2(MAXTHR), 31].
*/
#define LOG2_NEL 24


/* Number of elements; IOW, cycle length. */
#define NEL ((size_t)1 << LOG2_NEL)


/* 28 == LOG2_NEL will result in a 1GB array. */
static uint32_t buf[NEL];


/*
"*arg" points to a size_t object holding the start offset in the buffer.
*/
static void *
meander(void *arg)
{
unsigned round;

for (round = 0u; round < NROUND; ++round) {
size_t pos,
cnt;

pos = *(const size_t *)arg;
for (cnt = 0u; cnt < NEL; ++cnt) {
pos = buf[pos];
}
assert(pos == *(const size_t *)arg);
}

return 0;
}


int
main(int argc, char **argv)
{
long nthr;

/* Get number of sub-threads. */
{
char *endptr;

if (2 != argc || (nthr = strtol(argv[1], &endptr, 10)) < 1L
|| nthr > MAXTHR || '\0' != *endptr) {
(void)fprintf(stderr, "pass number of threads (1-%ld)\n", MAXTHR);
return EXIT_FAILURE;
}
}

/* Initialize array with Sattolo's algorithm. */
{
size_t cnt;
short unsigned xsubi[3] = { 1u, 2u, 3u };

/* Identity permutation. */
for (cnt = 0u; cnt < NEL; ++cnt) {
buf[cnt] = cnt;
}

/* Swaps. */
for (cnt = 0u; cnt < NEL - 1u; ++cnt) {
uint32_t save;
size_t rnd;

save = buf[cnt];
/*
Choose a pseudo-random number in [0, NEL - 2u - cnt] with uniform
distribution.

(uint32_t)jrand48(xsubi) covers [0, 2**32 - 1].

See also the constraint on LOG2_NEL.
*/
rnd = (uint64_t)(uint32_t)jrand48(xsubi) * ((NEL - 2u - cnt) + 1u) >> 32;
buf[cnt] = buf[cnt + 1u + rnd];
buf[cnt + 1u + rnd] = save;
}
}

(void)fprintf(stderr, "starting threads\n");

{
int ret;
struct rusage start, stop;

ret = getrusage(RUSAGE_SELF, &start);
assert(0 == ret);

{
long widx;
size_t startpos[MAXTHR];
pthread_t worker[MAXTHR];

for (widx = 0L; widx < nthr; ++widx) {
startpos[widx] = widx;
ret = pthread_create(worker + widx, 0, &meander, startpos + widx);
assert(0 == ret);
}

for (widx = 0L; widx < nthr; ++widx) {
ret = pthread_join(worker[widx], 0);
assert(0 == ret);
}
}

ret = getrusage(RUSAGE_SELF, &stop);
assert(0 == ret);

(void)fprintf(stdout, "%Lg s/thread\n",
(((long double)stop.ru_utime.tv_usec - start.ru_utime.tv_usec)
/ 1000000.0L + stop.ru_utime.tv_sec - start.ru_utime.tv_sec) / nthr);
if (EOF == fflush(stdout)) {
return EXIT_FAILURE;
}
}

return EXIT_SUCCESS;
}