Efficiently searching multiple files [Perl]

Prev: FAQ 4.74 How do I print out or copy a recursive data structure?
Next: Where to get the document for a library function in command line (linux)?

From: Jorge on 20 May 2010 15:19

I have a Linux directory full of shell scripts (~100 scripts with
average
size of ~60 lines) with many of them calling other scripts as
depencencies.

Not being the author of the scripts and with the immediate need to
become familiar with
the overall scope of this script directory, I needed to create a cross-
reference chart
that shows what script depends on what other script.

My Perl script (below) works in terms of doing what I need, however,
IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
that is it's max performance but something tells me (from other Perl
script experiences) this one just isn't up to speed.

I would appreciate any ideas or comments.

Thanks, Jorge

#!/usr/bin/perl -l

use strict;
use warnings;

&scripts_dir('/some/dir/that/holds/scripts');

sub scripts_dir {

# set some vars
my $dir = shift;
my(@paths, $paths, @scripts, $scripts, $string);
my($lines, $leaf, $header, $header2);
local($_);

# check dir can be opened for read
unless (opendir(DIR, $dir)) {
die "can't open $dir $!\n";
closedir(DIR);
return;
}

#
# read dir, skip over system files and bak files
# build array of script names
# build array of full-path script names
#
foreach (readdir(DIR)) {
next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;
push(@scripts, $_);
$paths = $dir."/".$_;
push(@paths, $paths);
}
closedir(DIR);

# open ouput file, format and print headers
open OUT, ">output" or die "cannot open output for writing $! ...
\n";
$header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
"USAGE");
$header2 = sprintf("%-25s%-25s%s", "===========", "========",
"=====");
print OUT $header;
print OUT $header2;

# loop through each script name
foreach $scripts(@scripts){

# loop through each script in directory
foreach my $paths(@paths){

# get last leaf of script being searched --
# if it matches itself; skip
$leaf = get_leaf($paths);
if($scripts eq $leaf) { next;}

# open each script for searching
open F, "$paths" or die "cannot open $paths for reading
$! ...\n";

while(my $lines = <F>) {

# -l switch in place
chomp($lines);

# search for matches to the commonly-used command
syntax
if($lines =~ /\$\($scripts / || $lines =~ /\`
$scripts /){

# format to line up with headers
$string = sprintf("%-25s%-25s%s", $scripts, $leaf,
$lines);

# print to file
print OUT $string;
}
}
}
}
# close I/O streams
close(F);
close(OUT);
}

#======================
# subroutine get_leaf
#======================
sub get_leaf
{
# get arg(s)
my($pathname) = @_;

# split on leafs
my @split_pathname = split( /[\\\/]/, $pathname);

# grab last leaf
my $leaf = pop( @split_pathname );

# return bare script name (leaf)
return( $leaf );
}

__END__

From: Uri Guttman on 20 May 2010 15:47

>>>>> "J" == Jorge <awkster(a)yahoo.com> writes:

J> #!/usr/bin/perl -l

J> use strict;
J> use warnings;

good

J> &scripts_dir('/some/dir/that/holds/scripts');

don't call subs with & as it isn't needed with () and it is perl4
style. there are other subtle issues that can bite you.

J> sub scripts_dir {

J> # set some vars
J> my $dir = shift;
J> my(@paths, $paths, @scripts, $scripts, $string);
J> my($lines, $leaf, $header, $header2);

don't declare vars before you need them. in many cases you can declare
then when first used.

J> local($_);

why? actually i recommend against using $_ as much as possible (it does
have its places) so you can used named vars which are easier to read and
make the code better

J> # check dir can be opened for read
J> unless (opendir(DIR, $dir)) {

use a lexical handle

unless (opendir(my $dirh, $dir)) {

J> die "can't open $dir $!\n";
J> closedir(DIR);

why close the handle if it never opened?

J> return;

you can just exit here instead.

J> }

better yet, use File::Slurp's read_dir.

J> #
J> # read dir, skip over system files and bak files
J> # build array of script names
J> # build array of full-path script names
J> #
J> foreach (readdir(DIR)) {
J> next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;

read_dir skips . and .. for you.

J> push(@scripts, $_);
J> $paths = $dir."/".$_;
J> push(@paths, $paths);
J> }
J> closedir(DIR);

that can all be done so much simpler like this (untested):

my @paths = map "$dir/$_", grep !/\.bak$/, read_dir $dir ;

J> # open ouput file, format and print headers
J> open OUT, ">output" or die "cannot open output for writing $!
J> ...

ditto for lexical handles

J> \n";
J> $header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
J> "USAGE");
J> $header2 = sprintf("%-25s%-25s%s", "===========", "========",
J> "=====");

you use the same sprintf format three times. put that into a variable
above so you can change it in one place.

J> print OUT $header;
J> print OUT $header2;

no need for two print calls.

print OUT $header, $header2 ;

J> # loop through each script name
J> foreach $scripts(@scripts){

foreach my $scripts (@scripts) {

that is how you can declare vars locally. and use more horizontal
whitespace. others need to read your code so think about them when you
write it. and YOU are an other too! :)

i noticed this below but the point is true here. don't use plural names
for singular things. $scripts has a single script name

J> # loop through each script in directory
J> foreach my $paths(@paths){

you used my there. why not in the previous loop?

J> # get last leaf of script being searched --
J> # if it matches itself; skip
J> $leaf = get_leaf($paths);
J> if($scripts eq $leaf) { next;}

slightly faster as you don't need a block entry on next. also a better
style for simple flow control like this.

next if $scripts eq $leaf ;

J> # open each script for searching
J> open F, "$paths" or die "cannot open $paths for reading

don't quote scalar vars as it can lead to subtle bugs. it is not needed here.

J> $! ...\n";

J> while(my $lines = <F>) {

$lines is a single line so make that singular. using a plural name
implies an array or array ref

J> # -l switch in place

that only affects one liners using -p or -n. no effect on regular code

J> chomp($lines);

J> # search for matches to the commonly-used command
J> syntax
J> if($lines =~ /\$\($scripts / || $lines =~ /\`
J> $scripts /){

this doesn't make sense. are you checking if the current script refers
to itself??

J> # format to line up with headers
J> $string = sprintf("%-25s%-25s%s", $scripts, $leaf,
J> $lines);

J> # print to file
J> print OUT $string;

since you just print the string, you can use printf directly.

J> }
J> }
J> }
J> }
J> # close I/O streams
J> close(F);
J> close(OUT);
J> }

J> #======================
J> # subroutine get_leaf
J> #======================
J> sub get_leaf
J> {
J> # get arg(s)
J> my($pathname) = @_;

J> # split on leafs
J> my @split_pathname = split( /[\\\/]/, $pathname);

gack!!

File::Basename does this for you and simpler.

J> # grab last leaf
J> my $leaf = pop( @split_pathname );

you could just return the popped value directly. hell, you can do that
in the split line too:

return( (split( /[\\\/]/, $pathname)[-1] );

as for speedups, i can't help much since i don't have your input
data. nothing seems oddly slow looking but your core loops are deep and
can be done better. slurping in the entire file (File::Slurp) and
scanning for those lines in one regex would be noticeably faster. and
your core logic is suspect as it doesn't seem to check for calling other
scripts from a given one.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: J. Gleixner on 20 May 2010 16:02

Jorge wrote:
> I have a Linux directory full of shell scripts (~100 scripts with
> average
> size of ~60 lines) with many of them calling other scripts as
> depencencies.
>
> Not being the author of the scripts and with the immediate need to
> become familiar with
> the overall scope of this script directory, I needed to create a cross-
> reference chart
> that shows what script depends on what other script.
>
> My Perl script (below) works in terms of doing what I need, however,
> IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
> that is it's max performance but something tells me (from other Perl
> script experiences) this one just isn't up to speed.

3 seconds isn't fast enough??.. just write it, run it, and move on..
sheesh..

>
> I would appreciate any ideas or comments.
>
> Thanks, Jorge
>
> #!/usr/bin/perl -l
>
> use strict;
> use warnings;
>
> &scripts_dir('/some/dir/that/holds/scripts');
Remove the '&' there.
>
> sub scripts_dir {
>
> # set some vars
> my $dir = shift;
> my(@paths, $paths, @scripts, $scripts, $string);
> my($lines, $leaf, $header, $header2);
> local($_);

Declare your vars in the smallest possible scope.
>
> # check dir can be opened for read
> unless (opendir(DIR, $dir)) {
> die "can't open $dir $!\n";

After die is called, nothing after this point is called..
> closedir(DIR);
> return;
> }
>
> #
> # read dir, skip over system files and bak files
> # build array of script names
> # build array of full-path script names
> #
> foreach (readdir(DIR)) {
> next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;
> push(@scripts, $_);
> $paths = $dir."/".$_;
> push(@paths, $paths);
> }
> closedir(DIR);
>
> # open ouput file, format and print headers
> open OUT, ">output" or die "cannot open output for writing $! ...
> \n";
use 3 argument open.

open( my $out, '>', 'output' ) || die "can't open output: $!";
> $header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
> "USAGE");
> $header2 = sprintf("%-25s%-25s%s", "===========", "========",
> "=====");
> print OUT $header;
> print OUT $header2;

Just use printf.

printf OUT "%-25s%-25s%s\n", "SCRIPT NAME", "FOUND IN", "USAGE";

or printf $out...
>
> # loop through each script name
> foreach $scripts(@scripts){
for my $script( @scripts )
>
> # loop through each script in directory
> foreach my $paths(@paths){
$path seems better than $paths
>
> # get last leaf of script being searched --
> # if it matches itself; skip
> $leaf = get_leaf($paths);
my $leaf = ( split( /\//, $paths ) )[-1];

> if($scripts eq $leaf) { next;}
next if $scripts eq $leaf;
>
> # open each script for searching
> open F, "$paths" or die "cannot open $paths for reading
> $! ...\n";
open( my $file, '<', $path ) || die "can't open $path: $!;
>
> while(my $lines = <F>) {
my $line instead of my $lines.
>
> # -l switch in place
> chomp($lines);
>
> # search for matches to the commonly-used command
> syntax
> if($lines =~ /\$\($scripts / || $lines =~ /\`
> $scripts /){
>
> # format to line up with headers
> $string = sprintf("%-25s%-25s%s", $scripts, $leaf,
> $lines);
>
> # print to file
> print OUT $string;
again printf
> }
> }

Should close F or $file here.

> }
> }
> # close I/O streams
> close(F);
^^^ should be just after the end of the while.. above.
> close(OUT);
> }
>
> #======================
> # subroutine get_leaf
> #======================
> sub get_leaf
> {
> # get arg(s)
> my($pathname) = @_;
>
> # split on leafs
> my @split_pathname = split( /[\\\/]/, $pathname);
>
> # grab last leaf
> my $leaf = pop( @split_pathname );
>
> # return bare script name (leaf)
> return( $leaf );
> }
>
> __END__
>
>

From: Jorge on 20 May 2010 16:35

I wasn't expecting a full critique but I'll take it -- learning is
good so thanks for your effort and time.

Most of your points are obvious (once read) and some I will have to
digest.

Thanks again

On May 20, 12:47 pm, "Uri Guttman" <u...(a)StemSystems.com> wrote:
> >>>>> "J" == Jorge <awks...(a)yahoo.com> writes:
>
> J> #!/usr/bin/perl -l
>
> J> use strict;
> J> use warnings;
>
> good
>
> J> &scripts_dir('/some/dir/that/holds/scripts');
>
> don't call subs with & as it isn't needed with () and it is perl4
> style. there are other subtle issues that can bite you.
>
> J> sub scripts_dir {
>
> J> # set some vars
> J> my $dir = shift;
> J> my(@paths, $paths, @scripts, $scripts, $string);
> J> my($lines, $leaf, $header, $header2);
>
> don't declare vars before you need them. in many cases you can declare
> then when first used.
>
> J> local($_);
>
> why? actually i recommend against using $_ as much as possible (it does
> have its places) so you can used named vars which are easier to read and
> make the code better
>
> J> # check dir can be opened for read
> J> unless (opendir(DIR, $dir)) {
>
> use a lexical handle
>
> unless (opendir(my $dirh, $dir)) {
>
> J> die "can't open $dir $!\n";
> J> closedir(DIR);
>
> why close the handle if it never opened?
>
> J> return;
>
> you can just exit here instead.
>
> J> }
>
> better yet, use File::Slurp's read_dir.
>
> J> #
> J> # read dir, skip over system files and bak files
> J> # build array of script names
> J> # build array of full-path script names
> J> #
> J> foreach (readdir(DIR)) {
> J> next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;
>
> read_dir skips . and .. for you.
>
> J> push(@scripts, $_);
> J> $paths = $dir."/".$_;
> J> push(@paths, $paths);
> J> }
> J> closedir(DIR);
>
> that can all be done so much simpler like this (untested):
>
> my @paths = map "$dir/$_", grep !/\.bak$/, read_dir $dir ;
>
> J> # open ouput file, format and print headers
> J> open OUT, ">output" or die "cannot open output for writing $!
> J> ...
>
> ditto for lexical handles
>
> J> \n";
> J> $header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
> J> "USAGE");
> J> $header2 = sprintf("%-25s%-25s%s", "===========", "========",
> J> "=====");
>
> you use the same sprintf format three times. put that into a variable
> above so you can change it in one place.
>
> J> print OUT $header;
> J> print OUT $header2;
>
> no need for two print calls.
>
> print OUT $header, $header2 ;
>
> J> # loop through each script name
> J> foreach $scripts(@scripts){
>
> foreach my $scripts (@scripts) {
>
> that is how you can declare vars locally. and use more horizontal
> whitespace. others need to read your code so think about them when you
> write it. and YOU are an other too! :)
>
> i noticed this below but the point is true here. don't use plural names
> for singular things. $scripts has a single script name
>
> J> # loop through each script in directory
> J> foreach my $paths(@paths){
>
> you used my there. why not in the previous loop?
>
> J> # get last leaf of script being searched --
> J> # if it matches itself; skip
> J> $leaf = get_leaf($paths);
> J> if($scripts eq $leaf) { next;}
>
> slightly faster as you don't need a block entry on next. also a better
> style for simple flow control like this.
>
> next if $scripts eq $leaf ;
>
> J> # open each script for searching
> J> open F, "$paths" or die "cannot open $paths for reading
>
> don't quote scalar vars as it can lead to subtle bugs. it is not needed here.
>
> J> $! ...\n";
>
> J> while(my $lines = <F>) {
>
> $lines is a single line so make that singular. using a plural name
> implies an array or array ref
>
> J> # -l switch in place
>
> that only affects one liners using -p or -n. no effect on regular code
>
> J> chomp($lines);
>
> J> # search for matches to the commonly-used command
> J> syntax
> J> if($lines =~ /\$\($scripts / || $lines =~ /\`
> J> $scripts /){
>
> this doesn't make sense. are you checking if the current script refers
> to itself??
>
> J> # format to line up with headers
> J> $string = sprintf("%-25s%-25s%s", $scripts, $leaf,
> J> $lines);
>
> J> # print to file
> J> print OUT $string;
>
> since you just print the string, you can use printf directly.
>
> J> }
> J> }
> J> }
> J> }
> J> # close I/O streams
> J> close(F);
> J> close(OUT);
> J> }
>
> J> #======================
> J> # subroutine get_leaf
> J> #======================
> J> sub get_leaf
> J> {
> J> # get arg(s)
> J> my($pathname) = @_;
>
> J> # split on leafs
> J> my @split_pathname = split( /[\\\/]/, $pathname);
>
> gack!!
>
> File::Basename does this for you and simpler.
>
> J> # grab last leaf
> J> my $leaf = pop( @split_pathname );
>
> you could just return the popped value directly. hell, you can do that
> in the split line too:
>
> return( (split( /[\\\/]/, $pathname)[-1] );
>
> as for speedups, i can't help much since i don't have your input
> data. nothing seems oddly slow looking but your core loops are deep and
> can be done better. slurping in the entire file (File::Slurp) and
> scanning for those lines in one regex would be noticeably faster. and
> your core logic is suspect as it doesn't seem to check for calling other
> scripts from a given one.
>
> uri
>
> --
> Uri Guttman ------ u...(a)stemsystems.com -------- http://www.sysarch.com--
> ----- Perl Code Review , Architecture, Development, Training, Support ------
> --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com---------

From: Jim Gibson on 20 May 2010 17:30

In article
<6bf31a76-1cd6-48be-8bea-a91b1c6b83b6(a)y21g2000vba.googlegroups.com>,
Jorge <awkster(a)yahoo.com> wrote:

> I have a Linux directory full of shell scripts (~100 scripts with
> average
> size of ~60 lines) with many of them calling other scripts as
> depencencies.
>
> Not being the author of the scripts and with the immediate need to
> become familiar with
> the overall scope of this script directory, I needed to create a cross-
> reference chart
> that shows what script depends on what other script.
>
> My Perl script (below) works in terms of doing what I need, however,
> IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
> that is it's max performance but something tells me (from other Perl
> script experiences) this one just isn't up to speed.
>
> I would appreciate any ideas or comments.
>
> Thanks, Jorge
>

Uri and J. have given you thorough critiques of your posted program. I
can only add a few additional suggestions.

Concerning speedup, you have implemented an O(N**2) algorithm. You are
reading each of N files N-1 times, looking for a different script name
each time you read each file. You should read each file only once, and
look for all of your script names in each line. You shouldn't exclude
the script your are reading from this search, since scripts can call
themselves. I would also not look in comments, since it is very common
to put script names in comments, and you don't want those.

For how to search for many things at once, see the advice in 'perldoc
-q many' "How do I efficiently match many regular expressions at once?"

You are concatenating the directory name and file name into a path,
then taking many steps to strip the directory name from the path to
retrieve the file name. Don't bother, since you are only working in one
directory and the $dir variable only contains one value. Add $dir to
the open statement or a temp variable:

my $filename = "$dir/$scripts";
open( my $f, '<', $filename ) ...

Then you don't need the @paths array at all.

[program snipped]

--
Jim Gibson

| Next | Last
Pages: 1 2 3
Prev: FAQ 4.74 How do I print out or copy a recursive data structure?
Next: Where to get the document for a library function in command line (linux)?