how to transpose large matrix? [Shell]

Prev: Simple hack to get $500 to your home.
Next: Access to the output of the last command

From: Ed Morton on 4 Jun 2010 09:15

On 6/4/2010 3:12 AM, Greg Russell wrote:
> "Janis Papanagnou"<janis_papanagnou(a)hotmail.com> wrote in message
> news:hua1jh$4j3$1(a)news.m-online.net...
>
>> You're already aware of the problem that you have in principle with
>> this type of task. One more thought; are your matrices fully filled
>> or sparsely populated?
>
> Excellemt point.
>
>> In the latter case you might be able to use
>> the all-in-memory approach anyway, because you'd need just allocate
>> the actual values and leave the zero-values away.
>
> Are you assuming that zero values comstitute "sparse" elements of the array?
> The OP made no such indication, only that "... columns [are] separated by
> tabs, rows [are] separated by newlines."
>
> Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
> but of course the OP can define the data structure that is of question.
>
>> (In any language, like awk, that supports associative arrays this
>> would require just a few lines of code.)
>
> Please educate us in such simplicity ... please.
>
> Even such well-known routines as the R/S/SPlus http://cran.r-project.org
> tramspose function, or the Fortran transpose routines from http://ornl.gov
> (U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
> arrays, are more than "a few lines of code."
>
> If you can provide the "few lines" of awk necessary then you too can be
> famous rather than merely flippant.
>

I don't think Janis was being flippant, it's just that, as he showed in a
followup, it's a trivial solution as long as we can identify a "null" value for
input fields (e.g. either zero or a null string).

I didn't read the references - if they make it sound difficult then they must be
dealing with a different problem or coded in a language that doesn't support
associative arrays or addressing input that has no specific null value (?) or.....

Ed.

From: Janis Papanagnou on 4 Jun 2010 09:59

Maxwell Lol wrote:
> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>
>> Janis Papanagnou wrote:
[To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>
>>
>> Temporary files and arrays are unnecessary. You can read and print the
>> columns as rows one by one.

I didn't say they are necessary; rather I suggested to use one of the two
approaches, depending on the actual problem domain. Your suggestion seems
to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
is inacceptable if you're processing Gigabytes of data as the OP actually
does.

>
> Yes, by re-reading the input file each time you want a new row.

Indeed.

It would be interesting if there'd be a tool to directly access entities
in a file by seek'ing to the respective position to avoid such performance
degradation; say, as intelligent tail(1) program may operate on character
entities. But that seems impossible without having a "structured file"; a
database, maybe, or using file concepts of higher abstraction levels which
I haven't heard to be in use in practise.

Janis

From: pk on 4 Jun 2010 10:25

Janis Papanagnou wrote:

> It would be interesting if there'd be a tool to directly access entities
> in a file by seek'ing to the respective position to avoid such performance
> degradation; say, as intelligent tail(1) program may operate on character
> entities. But that seems impossible without having a "structured file"; a
> database, maybe, or using file concepts of higher abstraction levels which
> I haven't heard to be in use in practise.

Well, to nitpick a bit, tail can certainly be told to count bytes rather
than lines (using -c), and if all the "cells" are the same width, it may be
possible to calculate the offsets for a certain [row,column] cell. In the
same way, dd can seek to arbitrary positions in the file, so (just for fun!)
one could do, say

# assuming each element is 2 characters wide, 7 rows, 5 columns

$ cat matrix
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35

# matrix[row,column] is at offset row * 15 + col * 3
# in the file, if row and col are 0-based

# bash
for col in {0..5}; do
line=""
sep=""
for row in {0..6}; do
offset=$(( row * 15 + col * 3 ))
elem=$(dd if=matrix skip=$offset bs=1 count=2)
line="$line$sep$elem"
sep=" "
done
printf "%s\n" "$line"
done > transposed

$ cat transposed

Again, it's just for fun. With a large file that ends up calling dd possibly
millions of times. And although seeks may be fast, I would not bet that in
terms of pure runtime it would be faster than the pure O^2 approach (but I
haven't tried).

From: pk on 4 Jun 2010 10:34

pk wrote:

> Again, it's just for fun. With a large file that ends up calling dd
> possibly millions of times. And although seeks may be fast, I would not
> bet that in terms of pure runtime it would be faster than the pure O^2
> approach (but I haven't tried).

Just to follow up a bit on that, still assuming fixed widths, the same
algorithm could also be implemented using some programming language like
perl pr python, which would probably be much more efficient as it won't have
all the overhead of spawning processes. Here's an example in Perl:

open(M,"+<",$ARGV[0]);
for $col (0..4) {
$line=$sep="";
for $row (0..6) {
seek(M, $row * 15 + $col * 3, 0);
read(M, $elem, 2);
print $sep, $elem;
$sep = " ";
}
print "\n";
}
close(M);

(all error checking omitted)

From: Thomas 'PointedEars' Lahn on 4 Jun 2010 10:41

Janis Papanagnou wrote:

> Maxwell Lol wrote:
>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>> Janis Papanagnou wrote:
> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]

Isn't it a bit pointless to include that attribution line then?

>>> Temporary files and arrays are unnecessary. You can read and print the
>>> columns as rows one by one.
>
> I didn't say they are necessary; rather I suggested to use one of the two
> approaches, depending on the actual problem domain. Your suggestion seems
> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
> is inacceptable if you're processing Gigabytes of data as the OP actually
> does.

AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP
asked for a solution that would save memory, hence my suggestion.

As to whether that iterative approach would be inacceptable, that depends on
the machine speed, and the other processes it is supposed to be running.
ISTM nowadays processing a large text file multiple times is to be preferred
over using a multiple of the amount of memory required to do the former in
order to store the user data of the text file.

Using an implementation of my suggestion, transposing a 1000×1000 matrix in
a 6.6 MiB text file and storing the result in another text file took only 84
seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already
had an estimated combined load of 50-75% primarily due to a multimedia
application running in the background. The shell process that did it and
its child processes combined never required more than 4 MiB of memory.

PointedEars

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Simple hack to get $500 to your home.
Next: Access to the output of the last command