Ye Olde Log-Based SpMT Uarch [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Intel cache inclusion

From: Andy 'Krazy' Glew on 5 May 2010 11:27

On 5/5/2010 5:41 AM, Torben �gidius Mogensen wrote:
> Andy 'Krazy' Glew<ag-news(a)patten-glew.net> writes:
>> On 5/4/2010 1:23 AM, Torben �gidius Mogensen wrote:
>>> Since the function result is hard to predict,
>>> it would make sense to just block the thread
>>> when it needs the function result
>>
>> I have done this. It cuts the limit study performance down quite a bit.
>>
>> Instead I try to skip around the often quite small amounts of code that depend on the function value.
>
> It should be possible to statically schedule the code so code that
> depends on the return value is pushed as far back as possible. This
> way, you can keep the hardware simpler. This doesn't solve the
> error-value problem, though.

My SpMT threads would often run for thousands of instructions after they forked. I have rarely seen compilers able to
statically move code so far as that.

In fact, I would often use length as a criterion: since forking threads has overhead, only fork threads that will run
for a certain minimum time.

E.g. if running on a multicore with simple, non-OOO, non-renaming cores, fork overhead is higher, so you need longer
threads. Whereas if running on my preferred substrate, a multicore that is capable of using renaming tricks and "runt
threads" to make fork overhead less, you can use smaller threads.
I refused, however, to allow myself to have a "magic wand", and magically transfer state to a distant, or even
neighboring core. Limit studies along this line can be done.
By the way: the fewer the registers, the faster to fork. As x86 has added more and more registers (SSE, etc.) this
has become worse. Tracking which instruction set extensions are actually in use helps.

While I am suspicious of compiler ability to move the function result dependent code as far as is needed, compiler
cooperation can accomplish a complementary goal: instead of forking at the instruction right after the CALL, the
compiler could indicate where the code after the CALL that is independent of the function result. Which is really just a
special case of the generic "Hint instruction that says where control and data independent code starts".
By the way: it is trivial for hardware to determine the register dataflow independent parts, if hardware knows the
calling convention. (Of course, x86 has several.) The biggest compiler contribution would be to do the memory dataflow
analysis, since existing mechanisms already in the processor only allow hardware to do this for a small window, not for
the scope of a large tree of nested function calls. Hardware to do memory dataflow analysis can be imagined, but it is
new hardware not needed for other purposes, and I always want to reduce the amount of new hardware.
Also, control flow analysis. Hardware to find control flow independence is easier than memory dataflow, but it is
still new. My preferred "hardware" mechanism is to have a state machine or microcontroller browse over the BTB - which
is essentially "software" or "firmware". Compiler static analysis can have bigger scope, although less dynamic information.

From: Stefan Monnier on 5 May 2010 15:00

>> By the way, that reminds me: malloc was often the limiter on
>> performance. Needlessly so. Many malloc implementations are
>> serialized, incrementing pointers, etc.
> Many malloc implementations are needlessly costly, period. If you
> allocate and deallocate a lot of small objects, GC is nearly always
> faster. And if you don't, the cost of either GC or malloc/free is
> unimportant.

The issue is not GC vs non-GC, but just that memory allocation libraries
tend to have the property that the result of a call depends on all the
preceding calls, and that the malloc call has special semantics: the
precise value returned doesn't matter.

When adding support for parallelism in a language implementation,
providing per-thread memory allocation is one of the first step; but in
the legacy-SpMT case, that's not as easy.

Stefan

From: Andy 'Krazy' Glew on 5 May 2010 20:36

On 5/5/2010 12:00 PM, Stefan Monnier wrote:
>>> By the way, that reminds me: malloc was often the limiter on
>>> performance. Needlessly so. Many malloc implementations are
>>> serialized, incrementing pointers, etc.
>> Many malloc implementations are needlessly costly, period. If you
>> allocate and deallocate a lot of small objects, GC is nearly always
>> faster. And if you don't, the cost of either GC or malloc/free is
>> unimportant.
>
> The issue is not GC vs non-GC, but just that memory allocation libraries
> tend to have the property that the result of a call depends on all the
> preceding calls, and that the malloc call has special semantics: the
> precise value returned doesn't matter.
>
> When adding support for parallelism in a language implementation,
> providing per-thread memory allocation is one of the first step; but in
> the legacy-SpMT case, that's not as easy.

And if trying to make malloc friendly to an aggressively speculative machine, you ILP-ize malloc.

E.g. instead of

int malloc(int nbytes) {
...
if( nbytes < current_chunk_size ) {
char* ret = current_ptr;
current_ptr += nbytes;
return ret;
}
...
}

or some variant such as the below, which has separate lists by thread and for some quantization of sizes

int malloc(int nbytes) {
...
if( nbytes < allocation_list[current_thread].chunk_list_by_size[sized_chuck_index(nbytes)] ) {
char* ret = current_ptr;
current_ptr += nbytes;
return ret;
}
...
}

instead, do

int malloc(int nbytes) {
...
allocation_pool = &(hash_table_of_allocation_lists[hash(current_thread,nbytes)]);
if( nbytes < allocation_pool->bytes_left ) {
char* ret = allocation_pool->current_ptr;
current_ptr += nbytes;
return ret;
}
...
}

(Note: I have ommitted locking / synchronization)

The last version parallelizes for multiple threads and cores, without having to have dedicate completely different
allocation pools per thread. It also parallelizes for ILP, like SpMT.

Malloc's not-really-needed serializations are rarely a problem for present day OOO machines, but may well become a
problem if instruction windows continue to grow. SpMT, of course, is really just OOO with a really big instruction window.

First | Prev |
Pages: 1 2
Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Intel cache inclusion