From: Kaz Kylheku on
On 2009-12-16, Tamas K Papp <tkpapp(a)gmail.com> wrote:
> On Wed, 16 Dec 2009 18:04:14 +0000, Kaz Kylheku wrote:
>
>> Without cycles, you can't express graph structures, so an entire branch
>> of computer science goes out the window. Many useful algorithms
>
> I disagree. Graphs can be represented, for example, in a matrix
> (a[i,j] is nonzero iff i->j, etc). You can use this to reconstruct a
> lookup table, basically emulating pointers.
>
> Which would, of course, would be a pretty pointless thing to do -- but
> the OP's questions have been pretty pointless so far, so maybe this
> would be a harmonious outcome :-)

If we have a graph structure, it means that given some node object N, we can
navigate to some adjancent object. For instance if the edges are numbered from
zer0, we can call a function like (neighbor N 3) to obtain the fourth neighbor
of N.

In order to be able to do this with the matrix representation, the object
N has to have a backpointer to the matrix. And that creates the
cyclical reference at the storage level: N points to the array, and the array
contains N somewhere.

The only way you can eliminate the circularity is if the nodes do not know who
their neighbors are (do not have a backreference to the matrix). But that's a
different kind of graph then.

I don't want to be constrained into working only with that kind of graph, just
because my memory management is dumb.
From: Kaz Kylheku on
On 2009-12-16, George Neuner <gneuner2(a)comcast.net> wrote:
> On Wed, 16 Dec 2009 18:04:14 +0000 (UTC), Kaz Kylheku
><kkylheku(a)gmail.com> wrote:
>
>>Refcounting can't be deferred,
>
> There are lazy refcounters which defer recycling until the memory is

But do they defer the counting? The actual bumping up and down of refcounts?

That's a cost.

My point was that in short-lived programs, GC has zero cost, because it never
happens, and the program code itself contains vanishingly few traces of
anything that caters to the task of memory management.

Recounting programs have to always be refcounting. The code to do so is
compiled in. If you want a refcounted program to have zero overhead in cases
where it is expected to be short lived, you have to compile a separate image of
that program, in which the inline refcounting instructions are removed.
From: Kaz Kylheku on
On 2009-12-16, Pillsy <pillsbury(a)gmail.com> wrote:
> On Dec 16, 1:04 pm, Kaz Kylheku <kkylh...(a)gmail.com> wrote:
>
>> On 2009-12-16, Bennu Strelitzia <bennu.strelit...(a)gmail.com> wrote:
>
>> > i.e. reference counting could do the trick for an efficient minimal
>> > implementation if there were no referential cycles, I hate the real
>
>> Reference counting is not efficient.  Except in some highly contrived
>> benchmarks that don't reflect real-world use, it is slower than garbage
>> collection. Even simple mark-and-sweep can beat reference counting in
>> realistic situations.
>
> The following is a very naive observation.
>
> The one thing it seems like reference counting could allow you to do
> is immediate recycling of space. For example, if I want to sum two
> vectors, using
>
> (map 'vector #'+ a-big-vector another-big-vector)
>
> I'll get a third big vector that has to be consed. If I know that A-
> BIG-VECTOR isn't being used by anything else, I can recycle it using
> MAP-INTO instead, but in general I have no idea. However, if there
> were a reference count on those big vectors, the implementation could
> make the decision for me at run-time. I'm primarily thinking of using
> using reference counting as a way of augmenting garbage collection,
> not as a replacement for it; indeed, you could limit it to only work
> with some data types or objects, like big simple arrays of floats or
> characters.
>
> There's no way this is a new idea, and it's entirely possible I'm
> overlooking some trade-offs that keep it from being worthwhile.

Your idea depends on the assumption that ``it is a big deal that a large
vector has been allocated for temporary use and is now garbage.''

Why is that a big deal?

A larger vector is no more difficult to hunt down and reclaim than a
small vector. How vectors can be implemented is that they are small control
records which are themselves in a garbage-collected heap, but the vector data
is in an ordinary heap that isn't garbage-collected. So then it's a large
/number/ of vectors that contributes to GC times, rather than large vector
size. Under this representation, when we allocate a vector, that only
contributes a small record to the garbage collected heap which must be
traversed in preparation for marking, and then traversed again to sweep it for
unreachable objects.

So the remaining reason why we might be concerned about a large vector being
garbage is that we are holding on to a large amount memory longer than
necessary, preventing it from being used elsewhere.

This can be solved without reference counting. Simply provide an operation
which lets the program explicitly shrink the vector. If a ten million element
vector is reduced to size zero (or to a tiny, nonzero size, like 1), the memory
can be reclaimed instantly.

A shrunk vector is better than one which is prematurely deleted, because it
isn't garbage. It's safe. The worst that can happen is that some code thinks
that the vector is larger than it really is; but that only results in
out-of-bounds array references, which can be checked. We can disallow
direct interior pointers into vector data, such that everything goes through
indexing which is checked agains size (even displaced-to arrays).
From: Kaz Kylheku on
On 2009-12-17, George Neuner <gneuner2(a)comcast.net> wrote:
> On Wed, 16 Dec 2009 11:29:01 -0800 (PST), Pillsy <pillsbury(a)gmail.com>
> wrote:
>
>>On Dec 16, 1:04 pm, Kaz Kylheku <kkylh...(a)gmail.com> wrote:
>>
>>> On 2009-12-16, Bennu Strelitzia <bennu.strelit...(a)gmail.com> wrote:
>>
>>> > i.e. reference counting could do the trick for an efficient minimal
>>> > implementation if there were no referential cycles, I hate the real
>>
>>> Reference counting is not efficient.  Except in some highly contrived
>>> benchmarks that don't reflect real-world use, it is slower than garbage
>>> collection. Even simple mark-and-sweep can beat reference counting in
>>> realistic situations.
>>
>>The following is a very naive observation.
>>
>>The one thing it seems like reference counting could allow you to do
>>is immediate recycling of space. For example, if I want to sum two
>>vectors, using
>>
>>(map 'vector #'+ a-big-vector another-big-vector)
>>
>>I'll get a third big vector that has to be consed. If I know that A-
>>BIG-VECTOR isn't being used by anything else, I can recycle it using
>>MAP-INTO instead, but in general I have no idea. However, if there
>>were a reference count on those big vectors, the implementation could
>>make the decision for me at run-time. I'm primarily thinking of using
>>using reference counting as a way of augmenting garbage collection,
>>not as a replacement for it; indeed, you could limit it to only work
>>with some data types or objects, like big simple arrays of floats or
>>characters.
>>
>>There's no way this is a new idea, and it's entirely possible I'm
>>overlooking some trade-offs that keep it from being worthwhile.
>
> Refcounters (even lazy ones) eventually have to scan all the garbage
> blocks.

That's right. Suppose P to a root object R is the program's last reference to a
huge DAG of millions of objects rooted at R. When the program, through pointer
P, drops its last reference on R, the entire DAG becomes garbage. What this
means is that each object in the DAG is visited, and its refcount is dropped.

So this action may cause a drastic pause in the execution of the program:
the very kind of pause that garbage collection is accused of, and which
refcounting is supposed to eliminate.
From: Kaz Kylheku on
On 2009-12-16, Rahul Jain <rjain(a)nyct.net> wrote:
> Kaz Kylheku <kkylheku(a)gmail.com> writes:
>
>> Every time an object's refcount is modified, an entire cache line
>> becomes dirty and has to be flushed to main memory. When another
>> processor in the system wants to access anything which lands in that
>> cache line, having noticed that it's dirty, it has to fetch the latest
>> version of that cache line from main memory. This means that two or
>> more processors can't work with the same refcounted object in a way that
>> is scalable, if at all they touch the refcounts---even if they are
>> treating the objects as read-only! As the refcounts are bumped, you have
>> performance-wasting cache ping-pong, and this can happen even with
>> separate objects (false sharing) if they land wholly or partially into
>> the same cache line. Refcounts have to be updated in a thread-safe way,
>> using expensive atomic operations that can cause stalls of thousands of
>> cycles. The consequences is that a refcounted program simply will not
>> scale well to a multiprocessor system.
>
> Thanks, Kaz. I hadn't even thought of that issue. It's a huge one for
> modern programs!

Here is another one. The most naive garbage collection has a sweep phase. But
let me present an argument that this sweep phase is better than refcounting.

Whenever refcounting reclaims an object, it is because a refcount has reached
zero. These zero refcounts are discovered dynamically by chasing the references
among objects: essentially, the order in which objects are discovered to be
unreachable is random.

What's worse, sometimes refcounted objects are shared. An unreachable object
may have a refcount that is greater than one, and has to be visited that many
times before its count drops to zero. These visits may be spread apart such
that the object is evacuated from the cache and fetched again.

Under mark-and-sweep naive GC, objects are discovered to be unreachable in a
nice, predictable order: they are visited by marching a pointer through the
heap (often, a heap of identically-sized objects, like conses, or the header
records of vectors, etc). Each object is visited once. It is discovered that it
was not marked with the ``reachable'' flag, and so its is updated in some way
to recycle it.

Marching through memory in a linear way is something that processors can do
fast.

Firstly, we avoid fetching any of the data into the L1 cache more than once,
since we are making one pass. Secondly, the access pattern is such that it is
susceptible to speedup by prefetching. Some modern processors have L2 to L1
prefetch, which is similar to prefetching the blocks of a file, under the
assumption that it's being sequentially accessed. This means that while you are
working with one cache line full of data, the processor can be generating the
cache refill for the next one.

Secondly, on some processors, there are cache hint instructions: you can tell
the processors that you exepect to be accessing a particular address. If
you're sweeping through objects in a linear heap, you know the address of any
number of objects in advance by simple pointer arithmetic. Using cache hints,
we can tell the processor to load several cache lines ahead.
If you're traversing objects under refcounting, you can generate a cache hinit
only for the immediately reachable objects. To get any objects which are two
hops away, you have to load the first-hop objects into the cache, check
their type, so you know where their references are.

Thirdly, small objects are packed several to a cache line. If your processor
has 64 byte L1 cache lines, and you have 8 byte conses, you get 8 conses
to a cache line. It's much faster to access several object in the same cache
line, than several objects all over the place. This is similar to traversing a
disk-based database in record-bucket order, compared to fetching records in
random order.

Fourth, DRAM memories support burst mode access, which works only for
consecutive addresses. The memory bandwidth of linear access is much faster
than that of random access. Burst mode allows fast main memory -> L2 cache
fills. Again, works in your favor if you're linearly scanning. We have had
decently high resolution displays with decent refresh rates, even in the dark
ages of slow memory with 1000 nanosecond + access times. High resolution video
memories have for decades demonstrated that predictable, linear access to
memory can be trivially optimized with a bit of parallelism, and a high
bandwidth can be achieved even though the latency of random accesses is high.

Summary: visiting dead objects random order with refcounting: bad. Dumb sweep
through dead objects in address order: better.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8
Prev: TPS
Next: Style, or lack thereof in Graham's ANSI CL