From: Yao Qi on

We want to record the stack backtrace of some points in program on
runtime. In some real application, a huge number of stack backtraces
are recorded, and it is expensive on space.

Is it possible to optimize the method to record all the stack backtrace?
We found that some of the stack backtrace are them same, or partially
same. For example,

A call B, B call C, C call D, C call E,

In function D, we get stack backtrace like D, C, B, A, while in function
E, we get stack backtrace E, C, B, A. We could see that the last three
stack element are the same.

Any comments on this? Thanks.


Best Regards

--
Yao Qi <qiyaoltc AT gmail DOT com> GNU/Linux Developer
http://duewayqi.googlepages.com/

If a listener nods his head when you're explaining your program, wake him up.
From: Logan Shaw on
Yao Qi wrote:
> We want to record the stack backtrace of some points in program on
> runtime. In some real application, a huge number of stack backtraces
> are recorded, and it is expensive on space.
>
> Is it possible to optimize the method to record all the stack backtrace?
> We found that some of the stack backtrace are them same, or partially
> same. For example,
>
> A call B, B call C, C call D, C call E,
>
> In function D, we get stack backtrace like D, C, B, A, while in function
> E, we get stack backtrace E, C, B, A. We could see that the last three
> stack element are the same.
>
> Any comments on this? Thanks.

What form are you storing them in? Arrays of StackTraceElement, like
Thread.getStackTrace() and Throwable.getStackTrace() return? Or some
other format? Are you storing them in memory, or writing them to
external storage like a file or database?

If you're storing them in memory, you could create your own data structure
to represent them. It would be relatively easy to store them in a graph
that eliminates the redundancies. Imagine storing the traces in a linked
list where each node is a StackTraceElement plus a reference to the following
node. If two linked lists shared a common suffix, you could store that
common suffix only once, having both prior nodes point to the same suffix.

So, you could have a data structure that looks like this:

D -> C -> B -> A
E ---^

Finding the largest already-stored suffix seems a bit tricky at first,
but is probably not that bad. One way to do it is to use a hash table
to store every possible sublist in the tree. (That is, every list that
results from starting at a node and going all the way to the finish
from that point.)

So for example, in this case, you would first insert the trace
D -> C -> B -> A. That would mean the following operations:

1. Create a node A.
2. Add A to the hash table.
3. Create a node B, and point it at A.
4. Add B -> A to the hash table.
5. Create a node C, and point it at B.
6. Add C -> B -> A to the hash table.
7. Create a node D, and point it at C.
8. Add D -> C -> B -> A to the hash table.

Now it comes time to add E -> C -> B -> A. The steps are:

1. Check if A is in the hash table. It is.
2. Check if B -> A is in the hash table. It is.
3. Check if C -> B -> A is in the hash table. It is.
4. Check if E -> C -> B -> A is in the hash table. It is not.
5. Since it is not, create a node E, and point it at C.
6. Add E -> C -> B -> A to the hash table.

This may look like it takes a lot of space, but when you are adding
to the hash table, the only thing being stored in the hash table is
a single reference to a node, so the size of the hash table is
proportional to the number of nodes, which is not bad.

Note that the hash table is only an internal data structure that's
useful for finding the longest possible tail to reuse. You should
maintain some other data structure external to all of this which
contains all the stack traces you care about. With any luck, this
external data structure will contain a lot of duplicate references.

This approach is only useful if you can compress your data by eliminating
redundancy in storing a lot of common suffixes. If your stack traces
don't follow that pattern, it won't help much. For example, you might
have stack traces of a recursive application which has a lot of really
deep call stacks. Those aren't that likely to share common suffixes.


An entirely different approach might useful depending on the answer to
a question. StackTraceElements contain strings. If two StackTraceElement
instances represent the same or related points in the code, do they share
a reference to the same string, i.e. the string for the method name or
class name? They might, but if they do not, these duplicate but identical
strings will be responsible for a lot of the wasted space. You could save
space by creating a pool of strings and re-creating new StackTraceElement
instances to replace the old ones, but with the new ones referring to
strings from the pool.

- Logan
From: Gene on
On May 4, 12:40 pm, Yao Qi <qiyao...(a)gmail.com> wrote:
> We want to record the stack backtrace of some points in program on
> runtime.  In some real application, a huge number of stack backtraces
> are recorded, and it is expensive on space.
>
> Is it possible to optimize the method to record all the stack backtrace?
> We found that some of the stack backtrace are them same, or partially
> same.  For example,
>
> A call B, B call C, C call D, C call E,
>
> In function D, we get stack backtrace like D, C, B, A, while in function
> E, we get stack backtrace E, C, B, A.  We could see that the last three
> stack element are the same.
>
> Any comments on this?  Thanks.

Zlib ought to do a pretty good job of compressing data like this.
From: Yao Qi on
Logan Shaw <lshaw-usenet(a)austin.rr.com> writes:

> Yao Qi wrote:
>> We want to record the stack backtrace of some points in program on
>> runtime. In some real application, a huge number of stack backtraces
>> are recorded, and it is expensive on space.
>>
>> Is it possible to optimize the method to record all the stack
> backtrace?
>> We found that some of the stack backtrace are them same, or partially
>> same. For example,
>>
>> A call B, B call C, C call D, C call E,
>>
>> In function D, we get stack backtrace like D, C, B, A, while in
> function
>> E, we get stack backtrace E, C, B, A. We could see that the last
> three
>> stack element are the same.
>>
>> Any comments on this? Thanks.
>
> What form are you storing them in? Arrays of StackTraceElement, like
> Thread.getStackTrace() and Throwable.getStackTrace() return? Or some
> other format? Are you storing them in memory, or writing them to
> external storage like a file or database?

An array of StackTraceElement from Thread.getStackTrace(). Storing them
in memory.

>
> If you're storing them in memory, you could create your own data
> structure
> to represent them. It would be relatively easy to store them in a graph
> that eliminates the redundancies. Imagine storing the traces in a
> linked

Yeah, graph is a good candidate.

> list where each node is a StackTraceElement plus a reference to the
> following
> node. If two linked lists shared a common suffix, you could store that
> common suffix only once, having both prior nodes point to the same
> suffix.
>
> So, you could have a data structure that looks like this:
>
> D -> C -> B -> A
> E ---^
>
> Finding the largest already-stored suffix seems a bit tricky at first,
> but is probably not that bad. One way to do it is to use a hash table
> to store every possible sublist in the tree. (That is, every list that
> results from starting at a node and going all the way to the finish
> from that point.)
>
> So for example, in this case, you would first insert the trace
> D -> C -> B -> A. That would mean the following operations:
>
> 1. Create a node A.
> 2. Add A to the hash table.
> 3. Create a node B, and point it at A.
> 4. Add B -> A to the hash table.
> 5. Create a node C, and point it at B.
> 6. Add C -> B -> A to the hash table.
> 7. Create a node D, and point it at C.
> 8. Add D -> C -> B -> A to the hash table.
>
> Now it comes time to add E -> C -> B -> A. The steps are:
>
> 1. Check if A is in the hash table. It is.
> 2. Check if B -> A is in the hash table. It is.
> 3. Check if C -> B -> A is in the hash table. It is.
> 4. Check if E -> C -> B -> A is in the hash table. It is not.
> 5. Since it is not, create a node E, and point it at C.
> 6. Add E -> C -> B -> A to the hash table.
>
> This may look like it takes a lot of space, but when you are adding
> to the hash table, the only thing being stored in the hash table is
> a single reference to a node, so the size of the hash table is
> proportional to the number of nodes, which is not bad.
>
> Note that the hash table is only an internal data structure that's
> useful for finding the longest possible tail to reuse. You should
> maintain some other data structure external to all of this which
> contains all the stack traces you care about. With any luck, this
> external data structure will contain a lot of duplicate references.
>
> This approach is only useful if you can compress your data by
> eliminating
> redundancy in storing a lot of common suffixes. If your stack traces
> don't follow that pattern, it won't help much. For example, you might
> have stack traces of a recursive application which has a lot of really
> deep call stacks. Those aren't that likely to share common suffixes.
>

I think my stack traces follow this pattern, and I could have a try.
Thanks for you explanation in such details.

>
> An entirely different approach might useful depending on the answer to
> a question. StackTraceElements contain strings. If two
> StackTraceElement
> instances represent the same or related points in the code, do they
> share
> a reference to the same string, i.e. the string for the method name or
> class name? They might, but if they do not, these duplicate but
> identical
> strings will be responsible for a lot of the wasted space. You could
> save
> space by creating a pool of strings and re-creating new
> StackTraceElement
> instances to replace the old ones, but with the new ones referring to
> strings from the pool.

This is useful, because I find that huge number of String objects are
created when we are tracking the stack back trace, however, most of the
String objects are identical. Thanks again for your advice.


Best Regards

--
Yao Qi <qiyaoltc AT gmail DOT com> GNU/Linux Developer
http://duewayqi.googlepages.com/

No directory.