From: Janis Papanagnou on
pk wrote:
> Janis Papanagnou wrote:
>
>> It would be interesting if there'd be a tool to directly access entities
>> in a file by seek'ing to the respective position to avoid such performance
>> degradation; say, as intelligent tail(1) program may operate on character
>> entities. But that seems impossible without having a "structured file"; a
>> database, maybe, or using file concepts of higher abstraction levels which
>> I haven't heard to be in use in practise.
>
> Well, to nitpick a bit, tail can certainly be told to count bytes rather
> than lines (using -c), and if all the "cells" are the same width, it may be
> possible to calculate the offsets for a certain [row,column] cell.

I was referring to something else. I've heard (not checked the tail(1)
source code) that modern implementations would make some guessed seek
close to the point where the start might be. I was specifically not
speaking of equal-width elements which would certainly make the task
quite trivial.

Janis

> In the
> same way, dd can seek to arbitrary positions in the file, so (just for fun!)
> one could do, say
>
> # assuming each element is 2 characters wide, 7 rows, 5 columns
>
> $ cat matrix
> 01 02 03 04 05
> 06 07 08 09 10
> 11 12 13 14 15
> 16 17 18 19 20
> 21 22 23 24 25
> 26 27 28 29 30
> 31 32 33 34 35
>
> # matrix[row,column] is at offset row * 15 + col * 3
> # in the file, if row and col are 0-based
>
> # bash
> for col in {0..5}; do
> line=""
> sep=""
> for row in {0..6}; do
> offset=$(( row * 15 + col * 3 ))
> elem=$(dd if=matrix skip=$offset bs=1 count=2)
> line="$line$sep$elem"
> sep=" "
> done
> printf "%s\n" "$line"
> done > transposed
>
> $ cat transposed
>
>
> Again, it's just for fun. With a large file that ends up calling dd possibly
> millions of times. And although seeks may be fast, I would not bet that in
> terms of pure runtime it would be faster than the pure O^2 approach (but I
> haven't tried).
From: Janis Papanagnou on
Thomas 'PointedEars' Lahn wrote:
> Janis Papanagnou wrote:
>
>> Maxwell Lol wrote:
>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>>> Janis Papanagnou wrote:
>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>
> Isn't it a bit pointless to include that attribution line then?

Tell that Maxwell if you're happy playing the control freak. (Stripping
context is fine if done right and if it's not misleading.) It certainly
is *not* pointless to make clear that I haven't wrote anything that YOU,
specifically, wrote. No big issue, but clarification should be in place.

>
>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>> columns as rows one by one.
>> I didn't say they are necessary; rather I suggested to use one of the two
>> approaches, depending on the actual problem domain. Your suggestion seems
>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>> is inacceptable if you're processing Gigabytes of data as the OP actually
>> does.
>
> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
> asked for a solution that would save memory, hence my suggestion.

Suggestions, especially sensible ones, are always welcome.

>
> As to whether that iterative approach would be inacceptable, that depends on
> the machine speed, and the other processes it is supposed to be running.
> ISTM nowadays processing a large text file multiple times is to be preferred
> over using a multiple of the amount of memory required to do the former in
> order to store the user data of the text file.

You seem to not know or haven't thought much about the computational
complexity, about the _practical relevance_ between O(N^2) and O(N),
I suppose?

>
> Using an implementation of my suggestion, transposing a 1000×1000 matrix in
> a 6.6 MiB text file and storing the result in another text file took only 84
> seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already
> had an estimated combined load of 50-75% primarily due to a multimedia
> application running in the background. The shell process that did it and
> its child processes combined never required more than 4 MiB of memory.

You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
going to compare an O(N^2) algorithm to an O(N) algorithm. If you've not
yet learned about computational complexity I suggest to try a linear
O(N) algorithm and compare the timing with your O(N^2) algorithm for a
certain data set of size 10^6 Byte, then increase that to 4*10^6 Byte,
and finally increase it to the data set that is topic here; which is
approx. 1358 times larger than your toy data, and that factor enters the
equation squared (1358^2). You won't see any results soon[*] with your
approach or with any other approach of complexity O(N^2).

Another point you may recognize is; those "Core2"/"GHz"/... terminology
that you are celebrating above is good if you want to sell computers,
not if you're searching a good algorithm that performs well and scales
well with real world application data. Don't expect commercial system
development/evolution to solve real world performance issues of a
significant magnitude. If you want to compare algorithms of *different*
complexity classes it's not that helpful. OTOH, your data *is* helpful
if you make the effort to calculate - or try it out - the difference to
the data size relevant in this thread; you've omitted this crucial step.
This just BTW.

Janis

[*] And while you wait for the results you can do the math; it's surely
an eye opener, I promise.

>
>
> PointedEars
From: Maxwell Lol on
Janis Papanagnou <janis_papanagnou(a)hotmail.com> writes:

> It would be interesting if there'd be a tool to directly access entities
> in a file by seek'ing to the respective position to avoid such performance
> degradation; say, as intelligent tail(1) program may operate on character
> entities. But that seems impossible without having a "structured file"; a
> database, maybe, or using file concepts of higher abstraction levels which
> I haven't heard to be in use in practise.

There are databases that are designed to do this. I was reading an
article in the July 2010 Linux Journal that talk about the NoSQL
databases (BigTable, HBase, Cassandra, Redis, MongoDB, Voldemort,
CouchDB, DYnomite, HyperTable, etc.)

The just us e a friggin large database, and a simple key to value
lookup.
As a simple form, use awk's associative array:
a[row "," column]=value;
In perl (which might perform better), it's
$a{"$row $column"}=$value;

This keeps the entire data in memory, but with virual memory, it
should work. You just get pages swapped in and out. I have no idea of
the performance. But if you have a 4GB file, and 4GB of memory...

Well, you can also get a machine with BIG ram. It might be cheaper
than a MATLAB license.



From: Maxwell Lol on
Janis Papanagnou <janis_papanagnou(a)hotmail.com> writes:

> Tell that Maxwell if you're happy playing the control freak.

Oops. If I screwed up, I apologize.
From: Thomas 'PointedEars' Lahn on
Janis Papanagnou wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Janis Papanagnou wrote:
>>> Maxwell Lol wrote:
>>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>>>> Janis Papanagnou wrote:
>>> [To make sure; I didn't write the subsequent two lines; that was Mr
>>> [Lahn.]
>>
>> Isn't it a bit pointless to include that attribution line then?
>
> Tell that Maxwell if you're happy playing the control freak.

I am telling it *you* because I am referring to *your* reply. This has
nothing to do with me being a control freak. Point, kettle, black.

> (Stripping context is fine if done right and if it's not misleading.) It
> certainly is *not* pointless to make clear that I haven't wrote anything
> that YOU, specifically, wrote. No big issue, but clarification should be
> in place.

Why, you could have simple removed that attribution line and spared yourself
the explanation. Since computational complexity appears to be the most
important to you, you could have scored there.

>>>>> Temporary files and arrays are unnecessary. You can read and print
>>>>> the columns as rows one by one.
>>> I didn't say they are necessary; rather I suggested to use one of the
>>> two approaches, depending on the actual problem domain. Your suggestion
>>> seems to me to be much inferiors WRT performance; O(N^2) instead of
>>> O(N), which is inacceptable if you're processing Gigabytes of data as
>>> the OP actually does.
>>
>> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
>> asked for a solution that would save memory, hence my suggestion.
>
> Suggestions, especially sensible ones, are always welcome.

You should face the simple fact that what you consider sensible might not be
considered sensible by others, depending on the circumstances. In this
case, one might consider it sensible to do what I suggested because one does
not have the choice.

>> As to whether that iterative approach would be inacceptable, that depends
>> on the machine speed, and the other processes it is supposed to be
>> running. ISTM nowadays processing a large text file multiple times is to
>> be preferred over using a multiple of the amount of memory required to do
>> the former in order to store the user data of the text file.
>
> You seem to not know or haven't thought much about the computational
> complexity, about the _practical relevance_ between O(N^2) and O(N),
> I suppose?

No, you are mistaken. As you could have noticed, complexity is exactly what
I have in mind in this case. You don't seem to understand that reducing
computational complexity is not always the ultimate goal. It certainly
is not in this particular case.

>> Using an implementation of my suggestion, transposing a 1000×1000 matrix
>> in a 6.6 MiB text file and storing the result in another text file took
>> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB)
>> that already had an estimated combined load of 50-75% primarily due to a
>> multimedia application running in the background. The shell process that
>> did it and its child processes combined never required more than 4 MiB of
>> memory.
>
> You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're
> going to compare an O(N^2) algorithm to an O(N) algorithm.

I am well aware that it is only a fraction of the data that the OP wants to
process. You could multiply the time required to get an idea of how low it
would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard
hours); however, you should also note that the matrix was described to be
rectangular, not quadratic. So it might have a lot of columns and
measurably fewer rows, which would reduce the number of rows that cut(1) or
awk(1) would have to process considerably. Compare this against the
computational complexity of storing the entire matrix in memory.

In any case, the point that you don't seem to get is that the OP does not
want to store 9.4 GiB of data in memory (or a database), maybe because they
just cannot do that. So they could not care less how long it takes (i.e.,
the computational complexity of the algorithm) as long as it can be done.

> [snip more pointless babbling]


PointedEars