From: pg on
I use the approach with intermediate files, using Unix "split", "tr",
then "paste", admittedly on much smaller data sets. This is faster
than "cut" because cutting each column iteratively means reading

n lines, each till the first delimiter, then
n lines till the second delimiter
....
n lines to the end of line.

Thus, it requires reading O(n + 2n + ... + mn) = O(n m^2) amount of
data to produce m rows, then O(mn) amount to read those and write them
as a single file.

Using "split" requires 1 pass O(mn) to produce n rows, 1 pass to read
each row, transpose it and write to a column file, then 1 pass for
"paste" to read one line from each column file and concatenate it to
the final output file.

On datasets of several MB, split was also 5-6 times faster than a
"while read" loop to achieve the same thing. So here is the code I use
(more or less):

#!/bin/sh
d1=$(mktemp -d)
d2=$(mktemp -d)
split -l 1 -a 20 "$1" $d1/ # Assuming m < 26^20
for f in $d1/*; do
tr '\t' '\n' < $f > $d2/$f
done
paste $d2/* > "$1".t
From: Robert Bonomi on
In article <86rqqmF14dU1(a)mid.individual.net>,
Greg Russell <grussell(a)example.con> wrote:
>"Janis Papanagnou" <janis_papanagnou(a)hotmail.com> wrote in message
>news:hua1jh$4j3$1(a)news.m-online.net...
>
>> You're already aware of the problem that you have in principle with
>> this type of task. One more thought; are your matrices fully filled
>> or sparsely populated?
>
>Excellemt point.
>
>> In the latter case you might be able to use
>> the all-in-memory approach anyway, because you'd need just allocate
>> the actual values and leave the zero-values away.
>
>Are you assuming that zero values comstitute "sparse" elements of the array?
>The OP made no such indication, only that "... columns [are] separated by
>tabs, rows [are] separated by newlines."
>
>Null values e.g. ...\t\t... are far more reasonable for the sparse entries,
>but of course the OP can define the data structure that is of question.
>
>> (In any language, like awk, that supports associative arrays this
>> would require just a few lines of code.)
>
>Please educate us in such simplicity ... please.
>
>Even such well-known routines as the R/S/SPlus http://cran.r-project.org
>tramspose function, or the Fortran transpose routines from http://ornl.gov
>(U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such
>arrays, are more than "a few lines of code."

Yup. they don't handle arrays the way AWK does, either.

Awk has *BUILT*IN* support for sparse arrays,

>If you can provide the "few lines" of awk necessary then you too can be
>famous rather than merely flippant.
>

In awk it *IS* trivial. you pull the input value into a simple variable.
Then you compare the value of that variable against the 'missing data'
value. if *mismatch* assign to array[row,col] Awk arrays are _dynamic_
in size, when you reference a previously un-referenced element, the array
grows by the size of *one* element. regardless of whether it is contiguous
to another element or not.


From: Robert Bonomi on
In article <hub0s6$fkp$1(a)news.m-online.net>,
Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
>Maxwell Lol wrote:
>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>
>>> Janis Papanagnou wrote:
>[To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>
>>>
>>> Temporary files and arrays are unnecessary. You can read and print the
>>> columns as rows one by one.
>
>I didn't say they are necessary; rather I suggested to use one of the two
>approaches, depending on the actual problem domain. Your suggestion seems
>to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>is inacceptable if you're processing Gigabytes of data as the OP actually
>does.
>
>>
>> Yes, by re-reading the input file each time you want a new row.
>
>Indeed.
>
>It would be interesting if there'd be a tool to directly access entities
>in a file by seek'ing to the respective position to avoid such performance
>degradation; say, as intelligent tail(1) program may operate on character
>entities. But that seems impossible without having a "structured file"; a
>database, maybe, or using file concepts of higher abstraction levels which
>I haven't heard to be in use in practise.

Actually, 'dd' will do that. IF the fields are fixed-width (which they
probably are *NOT*, in the OP's case) one could do
"dd bs={fieldwidth} iskip={fieldindex} sourcefile"
to extract a single field. I suspect that this would be _seriously_ slower
than the 'read a row, write to per-column tempfiles. Would eliminate the
need for additional 9+ gigs of disk-space for the tempfile, though.

From: Janis Papanagnou on
On 24/06/10 20:20, Robert Bonomi wrote:
> In article <hub0s6$fkp$1(a)news.m-online.net>,
> Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
>> Maxwell Lol wrote:
>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>>
>>>> Janis Papanagnou wrote:
>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>>
>>>>
>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>> columns as rows one by one.
>>
>> I didn't say they are necessary; rather I suggested to use one of the two
>> approaches, depending on the actual problem domain. Your suggestion seems
>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>> is inacceptable if you're processing Gigabytes of data as the OP actually
>> does.
>>
>>>
>>> Yes, by re-reading the input file each time you want a new row.
>>
>> Indeed.
>>
>> It would be interesting if there'd be a tool to directly access entities
>> in a file by seek'ing to the respective position to avoid such performance
>> degradation; say, as intelligent tail(1) program may operate on character
>> entities. But that seems impossible without having a "structured file"; a
>> database, maybe, or using file concepts of higher abstraction levels which
>> I haven't heard to be in use in practise.
>
> Actually, 'dd' will do that. IF the fields are fixed-width

That's exactly the crux why it doesn't apply in the general case we have.

Janis

> (which they
> probably are *NOT*, in the OP's case) one could do
> "dd bs={fieldwidth} iskip={fieldindex} sourcefile"
> to extract a single field. I suspect that this would be _seriously_ slower
> than the 'read a row, write to per-column tempfiles. Would eliminate the
> need for additional 9+ gigs of disk-space for the tempfile, though.
>

From: Robert Bonomi on
In article <i00b2s$cof$1(a)news.m-online.net>,
Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
>On 24/06/10 20:20, Robert Bonomi wrote:
>> In article <hub0s6$fkp$1(a)news.m-online.net>,
>> Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote:
>>> Maxwell Lol wrote:
>>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes:
>>>>
>>>>> Janis Papanagnou wrote:
>>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.]
>>>>>
>>>>>
>>>>> Temporary files and arrays are unnecessary. You can read and print the
>>>>> columns as rows one by one.
>>>
>>> I didn't say they are necessary; rather I suggested to use one of the two
>>> approaches, depending on the actual problem domain. Your suggestion seems
>>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which
>>> is inacceptable if you're processing Gigabytes of data as the OP actually
>>> does.
>>>
>>>>
>>>> Yes, by re-reading the input file each time you want a new row.
>>>
>>> Indeed.
>>>
>>> It would be interesting if there'd be a tool to directly access entities
>>> in a file by seek'ing to the respective position to avoid such performance
>>> degradation; say, as intelligent tail(1) program may operate on character
>>> entities. But that seems impossible without having a "structured file"; a
>>> database, maybe, or using file concepts of higher abstraction levels which
>>> I haven't heard to be in use in practise.
>>
>> Actually, 'dd' will do that. IF the fields are fixed-width
>
>That's exactly the crux why it doesn't apply in the general case we have.

just playing devil's advocate -- it would be "somewhat" (may be a record for
understatement!) painful, but one *could* still use dd, if you had another
tool that made a pass through the file, and counted the start-position and
length of each value.

Seriously, your best bet is going to be to take a 'strip' of rows that will
fit into memory, transpose, and output to a column file. repeat for as
many additional strips as it takes to process the file, then concatenate th
column files.

Or, if you know the maximum width of a field, you can write a little program
that reads sequential values from the source file, and plugs them into a
calculated position in a transposition file, being careful to include an
appropriate delimiter with each field written. Then use another hand-crafted
program that reads that transposition file, and 'squeezes out' all the 'null'
values (this converts back to the variable-width fields of the original).