From: kj on



(1) I have a relatively large file (9.4GB) that containing a
rectangular matrix (columns separated by tabs, rows separated by
newlines). I want to generate a file that contains the transpose
of this matrix, and do so without slurping the entire matrix into
memory all at once.

Is there a utility that would be helpful for this task?

(2) The only approach I can think of is to write temporary files
containing the transposes of the submatrices corresponding to
"strips" of n consecutive rows, and then using /usr/bin/paste to
glue all these submatrices into a single file.

Still, even this strategy requires transposing the n rows. I can
do this easily with a Python or Perl script, but I was wondering
if there is some Unix utility to do it?

Any suggestions for accomplishing (1) or (2) from the command line
using Unix utilities would be appreciated.

(FWIW, I use zsh.)

TIA!

~K
From: Thomas 'PointedEars' Lahn on
kj wrote:

> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> memory all at once.
^^^^^^^^^^^^^^^^^^^

The only way to do this that I can see is to cut out the i-th column,
convert newlines to tabs, print newline, and repeat that for every column
with increasing `i'. IOW, you would be trading memory efficiency against
runtime efficiency, since you would have to read the entire file as often as
you have columns. Utilities that would come in handy then are cut(1) or
awk(1), and tr(1). Shell redirection can optionally write the result into a
new file.


HTH

PointedEars
From: pk on
kj wrote:

> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
>
> Is there a utility that would be helpful for this task?
>
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
>
> Still, even this strategy requires transposing the n rows. I can
> do this easily with a Python or Perl script, but I was wondering
> if there is some Unix utility to do it?
>
> Any suggestions for accomplishing (1) or (2) from the command line
> using Unix utilities would be appreciated.

AFAICT, you can't even write a single complete line of the transposition
without having read up to the last line of the original matrix.

While on 64-bit machines with enough RAM a process could probably keep 9.4GB
of data in memory, I think the approach of reading the file line-by-line,
and append each item to its own "column" file, and then just cat all these
files together to get the transposed matrix.

For example, with a small matrix using awk:

$ cat matrix
a b c d
e f g h
i j k l
m n o p
q r s t

$ awk 'NR==1{
for(i=1;i<=NF;i++)names[i]=sprintf("column%03d", i)
}
{
for(i=1;i<=NF;i++){
printf "%s%s", s[i], $i > names[i]
s[i]=FS
}
}
END{
# add terminating newlines
for(i=1;i<=NF;i++)
print "" > names[i]
}' matrix

(adapt to your actual number of columns, separator, etc.)
To avoid the overhead of calling sprintf() every time, you could probably
save the names in an array at the beginning, eg


and then redirect output to names[i]. When that has run,

$ cat column001
a e i m q
$ cat column002
b f j n r

$ cat column* > transposed
$ cat transposed
a e i m q
b f j n r
c g k o s
d h l p t
From: Seebs on
On 2010-06-03, kj <no.email(a)please.post> wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.

> Is there a utility that would be helpful for this task?

There are utilities that would be helpful, but I don't think it
can be done without temporary files.

Quite simply: So far as I can tell, before you can finish the first
line of your output, you have to have read the last line, so either
you're jumping around a lot, which will be excruciatingly slow, or
you have it all in memory, or...

> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.

Well, pragmatically, the shortest path is probably to do something
much like this.

A thought: Do you have any expectations about the contents of the
fields? Say, are they of fixed length? Could they be padded to a
fixed length without undue hardship? Do you know in advance the number
of rows and columns? It wouldn't be exceptionally hard to write a
new file containing a fixed-size grid of fixed-size fields, then
write a tiny little C program to go through populating fields
appropriately.

Finally, last but not least: Use sqlite, shove everything into a
table, extract from the table. It will use a ton of disk space and
CPU time, but it will work within available memory and be surprisingly
zippy, I'd guess.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
From: Janis Papanagnou on
kj wrote:
> (1) I have a relatively large file (9.4GB) that containing a
> rectangular matrix (columns separated by tabs, rows separated by
> newlines). I want to generate a file that contains the transpose
> of this matrix, and do so without slurping the entire matrix into
> memory all at once.
>
> Is there a utility that would be helpful for this task?
>
> (2) The only approach I can think of is to write temporary files
> containing the transposes of the submatrices corresponding to
> "strips" of n consecutive rows, and then using /usr/bin/paste to
> glue all these submatrices into a single file.
>
> Still, even this strategy requires transposing the n rows. I can
> do this easily with a Python or Perl script, but I was wondering
> if there is some Unix utility to do it?
>
> Any suggestions for accomplishing (1) or (2) from the command line
> using Unix utilities would be appreciated.

You're already aware of the problem that you have in principle with
this type of task. One more thought; are your matrices fully filled
or sparsely populated? In the latter case you might be able to use
the all-in-memory approach anyway, because you'd need just allocate
the actual values and leave the zero-values away. (In any language,
like awk, that supports associative arrays this would require just
a few lines of code.) If your matrices not sparsely populated then
follow the way of using temporary files if your memory is limited.

Janis

>
> (FWIW, I use zsh.)
>
> TIA!
>
> ~K