From: David Schwartz on
On Apr 5, 9:07 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> The only lock that I will need is when the customer's
> account balance is updated, this could be an exception to
> preempting.

So your threads will never allocate memory? Never access files?

> I must do the 3.5 minute job, but every 50 ms job can
> preempt it. It is more important that the 50 ms job gets
> done (in as best as can be accomplished) within 500 ms
> response time, than it is that the 3.5 minute job achieve
> less than 12 hour response time.

So code that. Don't try to use some clever trick. If you want the 50ms
job to run and the 3.5 minute job to yield, then write the code that
runs the 3.5 minute job such that it waits if there's a higher-
priority task.

> I always want the 50 ms jobs to have all of the CPU
> resources, one whole CPU core to itself.

Right, but that's the wrong way to think about it. Don't think "I have
to make the 50 ms job get a CPU by itself" think "I have to write my
threads so that they do the work I want done, and the scheduler will
find the most efficient way to do that (which I suspect is giving it
its own CPU)".

You're trying to design across levels of abstraction. That makes
things much harder and is much less likely to work.

For example, suppose you have four cores and five high-priority tasks.
Suppose the memory allocator lock is held by a lower-priority task. If
you have locked all four cores exclusively to the high-priority tasks,
and all high-priority tasks are blocked on memory allocation, further
forward progress will be *impossible*.

So you typically want the higher-priority tasks to get dedicated
cores, but you definitely don't want to actually enforce that. Because
that's a *means* to the end you want and not the end you want. Code
the end, not the means.

DS
From: David Schwartz on
On Apr 5, 9:09 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:

> If I have 8MB of cache then random access to a 2 GB memory
> space will have little benefit from the 8MB cache. I know
> this from empirical testing and analysis. Do you disagree?

I totally disagree. In fact, you presented your own counter example.
You yourself had a case where what the programmer thought was random
access actually degenerated into a tight loop that completely fit in
cache. Your access was random. Your access was to a 2 GB memory space.
Yet it benefited massively from cache.

DS
From: Paul on
Peter Olcott wrote:
> If I have 8MB of cache then random access to a 2 GB memory
> space will have little benefit from the 8MB cache. I know
> this from empirical testing and analysis. Do you disagree?
>
> "David Schwartz" <davids(a)webmaster.com> wrote in message
> news:61b0d032-d1ad-43d5-94ac-ff02efd2eca7(a)g30g2000yqc.googlegroups.com...
> On Apr 5, 12:43 pm, "Peter Olcott" <NoS...(a)OCR4Screen.com>
> wrote:
>
>> I am trying to find algorithm that makes cache as
>> ineffective as possible. I want to get an accurate measure
>> of the worst case performance of my deterministic finite
>> automaton. The only thing that I can think of is to make
>> sure that each memory access is more than max cache size
>> away from the prior one. This should eliminate spatial
>> locality of reference.
>
> You have some fundamental misunderstandings about how cache
> works. If
> you have, say, a 512KB cache, there is no spacial locality
> difference
> between two accesses 511KB apart and two accesses 513KB
> apart. A 512KB
> cache will treat those accesses precisely the same.
>
> DS
>

[First - apologies for the code below. I'm a hardware guy, and write
code about once a year, so never have a clue what I'm doing :-) ]

I have some more test results.

To solve the need to walk the memory array, I decided to use
a maximal length LFSR. What a maximal length LFSR gets you, is a
guarantee to visit all the locations in the power_of_two memory
array, save one location ( BigArr[0] ). It does so, without getting
stuck in a loop. But what is not apparent, is any statements about
the statistics of the walk. (Whether hops are big or small, how
hop distances are distributed, and so on.)

http://www.edn.com/archives/1996/010496/01df4.htm

For example, I set my memory array size, to a power_of_two. This
is from my source file "26.c".

const uint32 size = 67108864;
typedef uint32 BigArr[67108864]; /* allocated from my locked memory pages */

I modified the Initialize routine.

int Initialize2(BigArr Data) {
uint32 N;
uint32 lfsr = 0x03FFFFFF ; /* 26 bit LFSR, 67,108,863=3FFFFFF, [0,1,5,25] */
/* X**25 + X**5 + X + 1 */
/* bits are stored X**0 ... X**25 */
uint32 bit;
uint32 period = 0;

N = lfsr; /* Compute value to store in seed location first */
/* All zeros is illegal state for this LFSR, so all 1's is chosen as seed */
do {
/* X**25 X**5 X**1 X**0 */
bit = ((lfsr >> 0) ^ (lfsr >> 20) ^ (lfsr >> 24) ^ (lfsr >> 25) ) & 1;
lfsr = (lfsr >> 1) | (bit << 25);

Data[N] = lfsr;
N = lfsr;
period++;
} while (lfsr != 0x03FFFFFF); /* stop when the sequence starts to repeat */

printf("period = %x\n", period); /* should be 3FF FFFF in this example */

char M;
printf("Hit any key to Continue:");
scanf("%c", &M);
return 0;
}

The Process routine gets a slight modification to the starting
location used in the memory array. If you start at location 0,
it gets stuck in a tight loop testing location 0. Location 0 in
the array is not to be used, except to show what a "array walk
of size one" would look like.

double Process(BigArr Data) {

uint32 num = 0x03FFFFFF;

These are the results. My E4700 processor has 2MB of L2 cache, 64KB L1, and I didn't
reduce the memory array size all the way down to L1 size. I was using
hugetlbfs, hugepagesz=2M, hugepages=600, for a total of 1200MB of
locked memory. The memory allocation remained the same for each
test, and only the memory array (using a section of that) was tested.
So even when I'm testing a 524288 sized memory array, I still have 1200MB
of locked pages allocated.

The first line here, is testing num = data[num], where data[num]
contains 0 and num = 0, and the test keeps cycling on the same
location. Memtest86+ reports about 3GB/sec memory bandwidth for the
memory subsystem of my computer, so compare the 945.96MB/sec to that figure.
Since this loop should be running in L1, I don't understand why this
isn't faster. The L1 and L2 bandwidths are higher than 3GB/sec, as
reported by memtest86+.

(Access the same location via data[0]=0) 4.33 Seconds 945.96 Megabytes Per Second)

Size in bytes---> 524288 period = 1ffff 8.62 Seconds 475.17 Megabytes Per Second 17.c
Size in bytes---> 1048576 period = 3ffff 8.76 Seconds 467.58 Megabytes Per Second 18.c
Size in bytes---> 2097152 period = 7ffff 9.12 Seconds 449.12 Megabytes Per Second 19.c
Size in bytes---> 4194304 period = fffff 57.40 Seconds 71.36 Megabytes Per Second 20.c
Size in bytes---> 8388608 period = 1fffff 80.50 Seconds 50.88 Megabytes Per Second 21.c
Size in bytes---> 16777216 period = 3fffff 91.85 Seconds 44.59 Megabytes Per Second 22.c
Size in bytes---> 33554432 period = 7fffff 97.30 Seconds 42.10 Megabytes Per Second 23.c
Size in bytes---> 67108864 period = ffffff 100.36 Seconds 40.81 Megabytes Per Second 24.c
Size in bytes---> 134217728 period = 1ffffff 104.07 Seconds 39.36 Megabytes Per Second 25.c
Size in bytes---> 268435456 period = 3ffffff 105.87 Seconds 38.69 Megabytes Per Second 26.c
Size in bytes---> 536870912 period = 7ffffff 106.93 Seconds 38.31 Megabytes Per Second 27.c
Size in bytes--->1073741824 period = fffffff 107.67 Seconds 38.04 Megabytes Per Second 28.c

My processor is supposed to have two TLB arrays internally. The huge page
one is supposed to have 32 entries. If each one is 2MB (I'm running a 64 bit
OS), then in theory 64MB of memory array walking should stay in the TLB. But
I'm not really sure, whether the gradual change is due to a TLB effect
or not.

The L2 cache is beginning to "cut in the boosters" at the 2097152 memory array size
mark. So at least the transition fro 57.4 seconds to 9.12 seconds makes some
sense.

Anyway, that is a first cut at a test method. At least the test
times aren't all over the place, like my previous table.

I was not testing your "std::vector<uint32> Data;" method, so you'll
have to take a crack at that yourself, to see how much of a difference
that approach makes.

HTH,
Paul
From: Ersek, Laszlo on
On Mon, 5 Apr 2010, David Schwartz wrote:

> If you don't want a thread to be doing the 3.5 minute job, because
> there's something more important to do, for the love of god DON'T CODE
> IT TO DO THAT JOB.
>
> What you're trying to do is code the thread to do one thing and then use
> some clever manipulation to get it to do something else. Just code the
> thread to do the work you want done and you won't have to find some way
> to pre-empt it or otherwise "trick" it.

One way to do this, I guess, is to enable all threads to handle all kinds
of tasks, both low priority and high priority. A thread executing a low
priority job should quickly notice the availability of a high priority
thread, put back the half-baked low prio job to the low prio work queue,
and switch to the high prio job.

This would mean the following:

- The state of the low prio job cannot be kept on the stack (or more
precisely, in auto variables) for the entire duration between the start of
that job and the completion of the job. When the switch(es) to the high
prio job(s) occur(s), the low prio job must be "serialized" (at least
conceptually) and put back to the low prio work queue.

- A way must be devised to notify, more or less asynchronously, the
thread(s) working on low prio jobs:

(a) Asynchronous thread cancellation would kind of defeat the whole idea.

(b) Checking periodically for a locked resource (eg. the head of the high
prio work queue) could work on Linux without incurring a huge performance
penalty, because I'm aware (on a gossip level) of mutexes being
implemented there with futexes, ie. in user space. I guess this would be a
good use case for the pthread_mutex_trylock() function.

(c) Atomic (lockless) access to a boolean variable could be specified in a
non-portable way. (Eg. in assembly or maybe with gcc-specific types or
attributes.)

(d) A signal could be generated for the worker thread specifically. The
handler would set a volatile sig_atomic_t object with boolean meaning. The
handler setting the flag asynchronously and the code checking the flag
would have to be executed by the same thread, see
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html>,
"Existing portable uses" / 3.

(e) Perhaps thread-specific data might be summoned in (d). I'm led to the
idea by a recent reddit comment
<http://www.reddit.com/r/programming/comments/bmnkl/pthreads_as_a_case_study_of_good_api_design/c0nj62j>
stating how fast accesses to thread-specific data on common platforms are.
Perhaps they exhibit some atomic properties too.

lacos
From: Noob on
Peter Olcott wrote:

> I will be receiving up to 100 web requests per second (that
> is the maximum capacity) and I want to begin processing them
> as soon as they arrive hopefully without polling for them.
> If I have to I will poll every 10 ms, in a thread that
> sleeps for 10 ms.

You might enjoy the following article.

The C10K problem, by Dan Kegel
http://www.kegel.com/c10k.html

Regards.