From: George Neuner on
On Fri, 09 Jul 2010 22:03:31 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"> wrote:

>MitchAlsup wrote:
>> But beyond this, the predictors are really swimming up stream agains
>> the current of random placement of semi-random sized objects allocated
>> on a heap. Especially when subject to garbage collection.
>
>I think this is a key issue:
>
>Garbage collectors seems to me to be _the_ proper spot to implement data
>location optimizations:
>
>Since most garbage-collectable languages allow the collector to have
>full knowledge of all pointer data items, it should also be capable of
>detecting left/right pointers in a tree structure right?
>
>If so, one possible approach would be to try to minimize the total
>distance covered by said pointers, measured as the offset between the
>address of the pointer itself and the object it points to.

Coalescing tree data structures during GC has been studied quite a
bit. There seems to be consensus that, where possible, the best thing
to do is to maintain the address order of object allocations because
that tends to be a good predictor of object lifetimes. But
maintaining object address ordering is relatively complex due to the
need to sort addresses, so where it is infeasible, regrouping
according to depth first traversal has been shown to yield slightly
better locality than breadth first (at least for the test cases that
have been studied).

Most copying collectors already do regroup data during collection. It
is easier to maintain object allocation order in mark-compact
collectors than in Cheney-style copiers, but Cheney's - which are
implicitly breadth-first - can relatively easily incorporate secondary
depth-first structure traversals.


>If we could also have hit counters of some form, it would be possible to
>cluster heavily-used object together, but that would probably require
>either HW or virtual machine support. Marking pages as not present just
>in order to detect accesses is probably to heavy duty, right?
>
>It seems to me that this could tend to cluster useful data?

There have been a few studies on regrouping data dynamically according
to mutator access patterns, but AFAIHS the extra complexity vs just
maintaining some simple traversal order just isn't worthwhile.

This is the same problem as VMM page replacement. The ideal page
policy is L)east L)ikely T)o B)e N)eeded - which is, in general,
impossible to implement. L)east R)ecenty U)sed is an achievable
compromise, but LRU's approximation of LLTBN is highly dependent on
the system's dynamic program mix.

Similarly, you'd want to group clusters of data according to LLTBN ...
but you can't and LRU may not be a good approximation.

Grouping program objects by age is possible, though, and since the VMM
already tracks resident page accesses, these can be taken into
account. One problem with using VMM is that non-resident pages cannot
be differentiated because their age data is not retained. They can
uniformly be considered "ancient", but this would be wrong for pages
pushed out in thrashing due to an overloaded system.

Another issue is that many GC implementations already do not (and do
not want to) use VMM mechanisms due to high OS trap latencies, the
coarse granularity of processing (relatively large) VM pages and the
difficulty of dealing with page-spanning objects incrementally. The
result is that a lot of collectors are pure software implementations
which don't rely much at all on VMM hardware.

Not to mention that GC itself, regardless of algorithm(s), is a fairly
bad fit to LRU paging because the collection process potentially may
touch and drag in large amounts of non-resident data - displacing data
actively in use. There have been proposals for various extensions to
VMM APIs to give processes more direct control over their own paging,
but, of course, these may be at odds with the global policies the VMM
system may be trying to implement.


>Terje

George
From: jacko on
Yes a referer count GC with on zero free and count-1 refered loop back
to zero test.

This will not chase many pointers as in a mark release GC, but it
makes the list operations slightly more complicated.

The question is what GC list node turnover has to happen to make it
more efficient?
From: Andy Glew on
On 7/10/2010 12:12 PM, Robert Myers wrote:
> nmm1(a)cam.ac.uk wrote:

>> The same is NOT true of branch and preload prediction. If you can
>> reduce the misprediction by only 5% (relatively), it isn't exciting
>> enough to change to an experimental design.
>>
>
> To quote the learned Andrew Glew, from this same thread in this same
> august forum:
>
> <quote>
>
> I warned you: at UWisc, I wanted to work on big instruction windows. But
> Jim Smith said "big instruction windows cannot
> be justified, because branch predictors are not good enough". Now, I
> knew something that Jim didn't: I knew that
> Willamette's branch predictors were significantly better than any than
> had been published in academia (I don't know the
> exact numbers off my head, but: if you go from 94& to 97%, you have
> doubled the instructions between mispredicted
> branches on average. Often more so. Also, I knew that main memory
> latencies of 300 cycles were coming.
>
> </quote>
>
> I did that not to prove you wrong, Nick, but to show Andy that I am
> paying attention. ;-)

I think that Nick may have meant a 5% relative improvement n branch
prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) =
0.30%, to a net branch prediction rate of 94.30%.

But in case anyone cares, even a 0.30% increase in performance is the
sort of thing Intel and AMD would kill for now.

When I started as a computer architecture, my boss said "don't waste
time on anything that doesn't get a 2X improvement". Then 20%. Now, we
are scraping the bottom of the barrel looking for 1%ers.

That's why multicore is so attractive.

As y'all know, I think there are big improvements possible in single
thread CPU performance - 10%ers at least, and NXers in some areas.


From: Andy Glew on
On 7/10/2010 12:12 PM, Robert Myers wrote:
> nmm1(a)cam.ac.uk wrote:

>> The same is NOT true of branch and preload prediction. If you can
>> reduce the misprediction by only 5% (relatively), it isn't exciting
>> enough to change to an experimental design.
>>
>
> To quote the learned Andrew Glew, from this same thread in this same
> august forum:
>
> <quote>
>
> I warned you: at UWisc, I wanted to work on big instruction windows. But
> Jim Smith said "big instruction windows cannot
> be justified, because branch predictors are not good enough". Now, I
> knew something that Jim didn't: I knew that
> Willamette's branch predictors were significantly better than any than
> had been published in academia (I don't know the
> exact numbers off my head, but: if you go from 94& to 97%, you have
> doubled the instructions between mispredicted
> branches on average. Often more so. Also, I knew that main memory
> latencies of 300 cycles were coming.
>
> </quote>
>
> I did that not to prove you wrong, Nick, but to show Andy that I am
> paying attention. ;-)

I think that Nick may have meant a 5% relative improvement n branch
prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) =
0.30%, to a net branch prediction rate of 94.30%.

But in case anyone cares, even a 0.30% increase in performance is the
sort of thing Intel and AMD would kill for now.

When I started as a computer architecture, my boss said "don't waste
time on anything that doesn't get a 2X improvement". Then 20%. Now, we
are scraping the bottom of the barrel looking for 1%ers.

That's why multicore is so attractive.

As y'all know, I think there are big improvements possible in single
thread CPU performance - 10%ers at least, and NXers in some areas.


From: nmm1 on
In article <4C3A995D.7040503(a)andy.glew.ca>,
Andy Glew <giganews(a)andy.glew.ca> wrote:
>
>I think that Nick may have meant a 5% relative improvement n branch
>prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) =
>0.30%, to a net branch prediction rate of 94.30%.

I did.

>But in case anyone cares, even a 0.30% increase in performance is the
>sort of thing Intel and AMD would kill for now.
>
>When I started as a computer architecture, my boss said "don't waste
>time on anything that doesn't get a 2X improvement". Then 20%. Now, we
>are scraping the bottom of the barrel looking for 1%ers.

That's bonkers. But you know that.

What really baffles me is that I know there is a market for LESS
performance - provided that it comes in a lot less power hungry.
Not a small market, either. And it could be done.

>That's why multicore is so attractive.
>
>As y'all know, I think there are big improvements possible in single
>thread CPU performance - 10%ers at least, and NXers in some areas.

And it's those areas that are interesting.


Regards,
Nick Maclaren.