From: Janis Papanagnou on
Thomas 'PointedEars' Lahn wrote:
> Janis Papanagnou wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Janis Papanagnou wrote:
>>>> Maxwell Lol wrote:
>>>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>>>>> Janis Papanagnou wrote:
>>>> [To make sure; I didn't write the subsequent two lines; that was Mr
>>>> [Lahn.]
>>> Isn't it a bit pointless to include that attribution line then?
>> Tell that Maxwell if you're happy playing the control freak.
>
> I am telling it *you* because I am referring to *your* reply. This has
> nothing to do with me being a control freak. Point, kettle, black.

You can tell that whomever you like; the point was that I don't want to be
associated with what YOU have wrote.

>
>> (Stripping context is fine if done right and if it's not misleading.) It
>> certainly is *not* pointless to make clear that I haven't wrote anything
>> that YOU, specifically, wrote. No big issue, but clarification should be
>> in place.
>
> Why, you could have simple removed that attribution line and spared yourself
> the explanation.

I thought I've explained that already twice. Here again, just for you, more
explicit; there was a posting with misleading attribution, and not reacting
might leave a wrong impression, and leave it documented in the archives. By
clarifying that in a reply this is documented as well.

> Since computational complexity appears to be the most
> important to you, you could have scored there.

What I consider important is my own business and certainly you're incapable
of knowing that. Simply put; speak for yourself, and don't pretent to know
anything about other's agenda.

>
>>>>>> Temporary files and arrays are unnecessary. You can read and print
>>>>>> the columns as rows one by one.
>>>> I didn't say they are necessary; rather I suggested to use one of the
>>>> two approaches, depending on the actual problem domain. Your suggestion
>>>> seems to me to be much inferiors WRT performance; O(N^2) instead of
>>>> O(N), which is inacceptable if you're processing Gigabytes of data as
>>>> the OP actually does.
>>> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
>>> asked for a solution that would save memory, hence my suggestion.
>> Suggestions, especially sensible ones, are always welcome.
>
> You should face the simple fact that what you consider sensible might not be
> considered sensible by others, depending on the circumstances. In this
> case, one might consider it sensible to do what I suggested because one does
> not have the choice.

And this triviality should be of any value in response to that friendly
line of comment that I gave about the welcomness sensible suggestions?

>
>>> As to whether that iterative approach would be inacceptable, that depends
>>> on the machine speed, and the other processes it is supposed to be
>>> running. ISTM nowadays processing a large text file multiple times is to
>>> be preferred over using a multiple of the amount of memory required to do
>>> the former in order to store the user data of the text file.
>> You seem to not know or haven't thought much about the computational
>> complexity, about the _practical relevance_ between O(N^2) and O(N),
>> I suppose?
>
> No, you are mistaken. As you could have noticed, complexity is exactly what
> I have in mind in this case. You don't seem to understand that reducing
> computational complexity is not always the ultimate goal. It certainly
> is not in this particular case.

Well, if that's your conclusion I really cannot help you.

>
>>> Using an implementation of my suggestion, transposing a 1000�1000 matrix
>>> in a 6.6 MiB text file and storing the result in another text file took
>>> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB)
>>> that already had an estimated combined load of 50-75% primarily due to a
>>> multimedia application running in the background. The shell process that
>>> did it and its child processes combined never required more than 4 MiB of
>>> memory.
>> You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
>> going to compare an O(N^2) algorithm to an O(N) algorithm.
>
> I am well aware that it is only a fraction of the data that the OP wants to
> process. You could multiply the time required to get an idea of how low it
> would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard
> hours);

Then you've done your math wrong. The factor linear in data size enters
the equation in an O(N^2) algorithm in quadratic magnitude; not 1358, as
you've guessed. But you removed that quote from the reply, and preferred
to despise it as "pointless babbling", instead of thinking about it. It's
of course completely up to you how you behave and how you communicate.

> however, you should also note that the matrix was described to be
> rectangular, not quadratic. So it might have a lot of columns and
> measurably fewer rows, which would reduce the number of rows that cut(1) or
> awk(1) would have to process considerably.

Whether the matrix as a few colums and many lines or vice versa, we can
speculate all day long - *that's* pointless, if anything.

I will agree that we can make up corner cases, like a matrix of only two
lines of data, each line 4 GB large, or vice versa, two columns with many
(US-)billions or (US-)trillions of lines. But in a more realistic view it
is preferable to not assume corner cases if the primary problem of any
proposal is it's inappropriate algorithmic/computational base. YMMV.

It would be easier to just set up that test case on your computer and try
it, as I suggested upthread; then you would see where that naive thinking
will lead to in real world applications.

> Compare this against the
> computational complexity of storing the entire matrix in memory.

I don't think that storing a many GB large data set in memory; rather I
gave a proposal for a sparcely pupulated matrix, where that could be an
option. OTOH, it has been pointed out, not only by me, that you'll get
a complexity issue if you try an O(N^2) algorithm on gigabytes of data.
If you don't understand that, don't want to try that out, belittle the
consequences, well.., then I haven't more to add.

>
> In any case, the point that you don't seem to get is that the OP does not
> want to store 9.4 GiB of data in memory (or a database), maybe because they
> just cannot do that.

If you would have been a bit less aggressive and more observant you
might have seen that I made a suggestion for sparce matrices, and not
suggested to keep all those gigabytes in memory. And I think that other
approaches are also feasible, like spreading the contents across many
files - _as long as the algorithm used is O(N)_, not O(N^2)!

But even if the matrix is not sparce - we still don't know -, I would
not refrain from pointing out that a O(N^2) algorithm on that huge data
set will just not work in practise.

If you are still convinced that your approach works, provide the math
(if you think what I wrote is wrong), or provide test cases that prove
me wrong, but with appropriate data. Toy data are okay as well, if you
extrapolate with the correct math; but having both wrong does not lead
to results that are of any worth.

> So they could not care less how long it takes (i.e.,
> the computational complexity of the algorithm) as long as it can be done.

Your algorithm requires a couple years on those gigabytes on your box;
unlikely that this will be considered as "can be done" for a practical
application that more likely will need results in hours.

Janis

>
>> [snip more pointless babbling]
>
>
> PointedEars

WRT further replies a personal note to you:
I've made this last effort to respond to you. Frankly, I don't think it
will be of any help to you, judging from what I could observe from your
behaviour since you recently joined here. I would have really preferred
if you've had put me in your personal killfile, as you seem to have
publicly announced a couple days ago; that would have saved me some more
time to explain what should have already been obvious enough. Since your
words are, both topical and meta, apparently unrealiable you'll make it
now into my killfile, as my first and only member; you're unique here in
c.u.s. - Good luck!