From: Jesper Rønn-Jensen on
Hi there.

I'm using this fine script to find all duplicate files in a project:

OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm
\1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF


[from http://elonen.iki.fi/code/misc-notes/remove-duplicate-files/ ]


However, it must ignore .svn folders (because basically there is a
duplicate file in the hidden svn folder for every versioned file)

So my idea was to pipe the find into "grep -v .svn" and then add --
null flag to make sure xargs -0 will get the appropriate input.

However, I cant get it to work. Executing grep on each line makes it
compute the line of the files -- not the filename:
#! /bin/sh
OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 -exec grep -v .svn '{}' \; |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm
\1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF


How do I change this to grep in the filename and not the content of
the filename?

PS. I also tried to grep in the final file -- but at that time it's
too late: a grep only removes the duplicate in the svn folder -- not
the file it duplicates and thus giving me a long list with all files
in the directory structure.

Any help appreciated!

Thanks!

/Jesper Rønn-Jensen
blog: http://justaddwater.dk/
From: Stephane CHAZELAS on
2008-06-18, 06:01(-07), Jesper R�nn-Jensen:
[...]
> I'm using this fine script to find all duplicate files in a project:
>
> OUTF=rem-duplicates.sh;
> echo "#! /bin/sh" > $OUTF;
> find "$@" -type f -print0 |
> xargs -0 -n1 md5sum |
> sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
> sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm
> \1/' >> $OUTF;
> chmod a+x $OUTF; ls -l $OUTF
>
>
> [from http://elonen.iki.fi/code/misc-notes/remove-duplicate-files/ ]
>
>
> However, it must ignore .svn folders (because basically there is a
> duplicate file in the hidden svn folder for every versioned file)
[...]

Change the find cmd to

find "$@" \( ! -name .svn -o -prune \) -type f -print0

It prevents find from descending in the .svn directories.

--
St�phane
From: Jesper Rønn-Jensen on
Stephane CHAZELAS>
> find "$@" \( ! -name .svn -o -prune \) -type f -print0
>
> It prevents find from descending in the .svn directories.

Works like a charm! Thanks a lot for your precise and quick answer!


/Jesper

From: Dan Stromberg on
On Wed, 18 Jun 2008 13:06:52 +0000, Stephane CHAZELAS wrote:

> 2008-06-18, 06:01(-07), Jesper Rønn-Jensen: [...]
>> I'm using this fine script to find all duplicate files in a project:
>>
>> OUTF=rem-duplicates.sh;
>> echo "#! /bin/sh" > $OUTF;
>> find "$@" -type f -print0 |
>> xargs -0 -n1 md5sum |
>> sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r
>> 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm
>> \1/' >> $OUTF;
>> chmod a+x $OUTF; ls -l $OUTF
>>
>>
>> [from http://elonen.iki.fi/code/misc-notes/remove-duplicate-files/ ]
>>
>>
>> However, it must ignore .svn folders (because basically there is a
>> duplicate file in the hidden svn folder for every versioned file)
> [...]
>
> Change the find cmd to
>
> find "$@" \( ! -name .svn -o -prune \) -type f -print0
>
> It prevents find from descending in the .svn directories.

I often use the slightly more concise:

find "$@" -name .svn -prune -o -type f -print0

From: Stephane CHAZELAS on
2008-06-18, 19:24(+00), Dan Stromberg:
[...]
>> find "$@" \( ! -name .svn -o -prune \) -type f -print0
>>
>> It prevents find from descending in the .svn directories.
>
> I often use the slightly more concise:
>
> find "$@" -name .svn -prune -o -type f -print0

The less concise way does print the non-directory files called
..svn though.


--
St�phane