From: lbrtchx on
~
I was wonderign if you know of a more efificient way to stratify values in a column of a file
~
Say you have a file with data on each line and you would like to know how many times the data was found in it. File could look like this:
~
// __ f00.txt
~
6456
qweRt
aAbCC
aAbCC
aabCC
96
qwert
96
645
aAbCC
~
1) the way to go (I think) would be to first sort the initial file:
~
knoppix(a)Microknoppix:~/tmp$ sort f00.txt -o f02.txt
knoppix(a)Microknoppix:~/tmp$ cat f02.txt
645
6456
96
96
aAbCC
aAbCC
aAbCC
aabCC
qweRt
qwert
~
2) then compare file line by line weighting the comparison with a snippet looking like:
~
#!/bin/sh

## reads file in as first and only argument line by line

# input file
_IFl="$1"

# output file
_OFl="$2"

rm ${_OFl}

ORIGIFS=$IFS
IFS=$(echo -en "\n\b")

exec 3<&0
exec 0<$_IFl

_ln00=""
_icntttl=0
_icnt=1

#
while read line
do
_ln02=$line
echo \"${_ln02}\" \"${_ln00}\"
if [[ ${_ln00} == ${_ln02} ]]; then
_icnt=`expr $_icnt + 1`
else
if [[ $_icnt > 1 ]]; then
echo \"$_ln00\",$_icnt >> ${_OFl}
_icnt=1
fi
fi
_ln00=$_ln02
_icntttl=`expr $_icntttl + 1`
done

echo "~"
echo "// __ output file: ${_OFl}"
cat ${_OFl}

exec 0<&3

IFS=$ORIGIFS
~
sh ./comp00.sh f02.txt f04.txt
~
3) to get as result:
~
knoppix(a)Microknoppix:~/tmp$ cat f04.txt
"96",2
"aAbCC",3
~
lbrtchx
From: Ben Bacarisse on
lbrtchx(a)gmail.com writes:

> ~
> I was wonderign if you know of a more efificient way to stratify values in a column of a file
> ~
> Say you have a file with data on each line and you would like to know how many times the data was found in it. File could look like this:
> ~
> // __ f00.txt
> ~
> 6456
> qweRt
> aAbCC
> aAbCC
> aabCC
> 96
> qwert
> 96
> 645
> aAbCC
> ~

I'd use awk:

awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }'

(the name c is not a good one but it does make the code a on-liner).

--
Ben.
From: Stephane CHAZELAS on
2010-02-04, 17:06(+00), lbrtchx(a)gmail.com:
[...]
> Say you have a file with data on each line and you would like
> to know how many times the data was found in it. File could
> look like this:
[...]

Maybe uniq -c?

sort < file | uniq -c

sorted by number of occurrence:

sort < file | uniq -c | sort -rn

If you only want the duplicated ones:

sort < file | uniq -c | sort -n | awk '$1>1,0'
or with GNU uniq:
sort < file | uniq -D | uniq -c

--
St�phane
From: Albretch Mueller on
~
OK, we have three algorithms which need two passes through the
original file (even if Ben's looks like a one liner ;-) and mine looks
lenghtier).
~
if you use a high level programming language, say java, you will be
effectively looping twice anyway, once for the sort and another for
the accumulation/counting. Even if you recreate the illusion of having
only one loop, for example by using a hash table, the hash table would
still internally do the sorting part
~
I can't recall now exactly how is it you can log what the OS is doing
in these three cases, but sometimes what looks like a shorter way to
do things is not the most effiicient regarding speed and footprint
~
Database algorithms do this all the time I am curious as to how they
do it. I mean if they actually use any slick optimizations instead of
going the procedural monkey way as I did
~
lbrtchx
From: Seebs on
On 2010-02-04, Ben Bacarisse <ben.usenet(a)bsb.me.uk> wrote:
> I'd use awk:
>
> awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }'
>
> (the name c is not a good one but it does make the code a on-liner).

That was my first thought, actually. But then it occurred to me that
you could also do
sort | uniq -c

Which is probably (?) faster. I'm actually not sure; I think it depends
on the size of the file and number of duplicates. If you have only a few
words, which occur millions of times, it will probably be slower. For the
vast majority of real-world data sets, I imagine that both will occur fast
enough that you don't actually have to wait for the prompt to come back.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!