From: ela on


I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large, I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again. However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID. Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...


#!/usr/bin/perl

my ($listfile, $format, $accfile, $infofile) = @ARGV;
print '($listfile, $accfile, $infofile)'; <STDIN>;

print "Working on $listfile...\n";
$outname = $listfile . "_" . $infofile . ".xls";

open (OFP, ">$outname");

open(FP, $listfile);
$line = <FP>;
chomp $line;

if ($format ne "") {
@fields = split(/\t/, $line);
for ($i=0; $i<@fields; $i++) {
############## check fields ###############################
if ( $fields[$i] =~ /accession/) {
$acci = $i;
}
}
}

print OFP "$line\tgene info\n";

$nl = '\n';

while (<FP>) {
$line = $_;
if ($line eq "\n") {
print OFP $line;
next;
}
chomp $line;

if ($format eq "") {
@cells = split (/:/, $line);
$tag = $cells[0];
} else {
@cells = split (/\t/, $line);
$tag = $cells[$acci];
}

open(AFP, $accfile);

while (<AFP>) {
@cells = split (/\t/, $_);
if ($cells[5] =~ /$tag/) {
$des = $cells[1];
last;
}
}
close AFP;

if ($found == 0) {
print OFP "$line\tNo gene info available\n";
}
}
close FP;


From: Jens Thoms Toerring on
ela <ela(a)yantai.org> wrote:
> I'm new to database programming and just previously learnt to use loops to
> look up and enrich information using the following codes. However, when the
> tables are large,

Which tables? Do you mean 'files'?

> I find this process is very slow. Then, somebody told me I
> can build a database for one of the file real time and so no need to read
> the file from the beginning till the end again and again. However, perl DBI
> has a lot of sophisticated functions there and in fact my tables are only
> large but nothing special, linked by an ID. Is there any simple way to
> achieve the same purpose? I just wish the ID can be indexed and then
> everytime I access the record through memory and not through I/O...

> #!/usr/bin/perl

Please, please use

use strict;
use warnings;

It will tell you about a lot of potential problems.

> my ($listfile, $format, $accfile, $infofile) = @ARGV;
> print '($listfile, $accfile, $infofile)'; <STDIN>;

What's that at end of the line good for?

> print "Working on $listfile...\n";
> $outname = $listfile . "_" . $infofile . ".xls";

> open (OFP, ">$outname");

Better use the three-argument form of open and use normal
variables for file handles, this isn't Perl 4 anymore...

open my $ofp, '>', $outname
or die "Can't open $outfile for writing\n";

Also checking that opening a file succeeded shouldn't be left
out without very good reasons...

> open(FP, $listfile);
> $line = <FP>;
> chomp $line;

> if ($format ne "") {
> @fields = split(/\t/, $line);
> for ($i=0; $i<@fields; $i++) {
> ############## check fields ###############################
> if ( $fields[$i] =~ /accession/) {

Are you aware that this will also match e.g. 'disaccession_123'?

> $acci = $i;
> }
> }
> }

> print OFP "$line\tgene info\n";

> $nl = '\n';

> while (<FP>) {
> $line = $_;

Why don't you read directly into '$line' but instead do an
additional copy?

> if ($line eq "\n") {
> print OFP $line;
> next;
> }
> chomp $line;

> if ($format eq "") {
> @cells = split (/:/, $line);
> $tag = $cells[0];
> } else {
> @cells = split (/\t/, $line);
> $tag = $cells[$acci];
> }

> open(AFP, $accfile);

> while (<AFP>) {
> @cells = split (/\t/, $_);
> if ($cells[5] =~ /$tag/) {
> $des = $cells[1];
> last;
> }
> }
> close AFP;

> if ($found == 0) {
> print OFP "$line\tNo gene info available\n";
> }

Huh? '$found' is nowhere else used in your program. With
'use warnings' you would have gotten a warning that you
use the value of an uninitialized variable...

> }
> close FP;

The probably most time-consuming part of your program is that for
each line of the file with the name '$listfile' you read in at
least a certain portion on '$accfile', again and again. To get
around that you don't need a database, you just have to read it
in only once and store the relevant information e.g. in a hash.
If you would do something like

open my $afp, '<', $accfile)
or die "Can't open $accfile for reading\n";

my %ahash;
while ( my line = <$afp> ) {
my @cells = split /\t/, $line;
$ahash{ $cells[ 5 ] } = $cells[ 1 ];
}
close $afp;

somewhere at the begining then you would have all the infor-
mation you use from the '$accfile' file in the %ahash hash and
there would be no need to read the file again and again:

while ( my $line = <$fp> ) {
if ( $line eq "\n" ) {
print $ofp "\n";
next;
}
chomp $line;

if ( $format eq "" ) {
@cells = split /:/, $line;
$tag = $cells[ 0 ];
} else {
@cells = split /\t/, $line;
$tag = $cells[ $acci ];
}

$des = $ahash{ $tag } if exists $ahash{ $tag };
}

close $fp;

Putting things in a database won't do too much good here
since, unless you have an in-memory database, also the
database will put the information on the disk and has to
retrieve it from there (but for sure a lot faster then
re-reading a file for a bit of information lots of times;-)
The only case I can think of where using a database may be
beneficial here is when the '$accfile' is extremely large
and the '%ahash' would use up all the memory you have. In
that case putting things in a database (on disk then of
course) for relatively fast finding the value for a key
(i.e. what you have in the '$tag' variable) might be a rea-
sonable alternative.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt(a)toerring.de
\__________________________ http://toerring.de
From: wolf on
ela schrieb:
> I'm new to database programming and just previously learnt to use loops to
> look up and enrich information using the following codes. However, when the
> tables are large, I find this process is very slow. Then, somebody told me I
> can build a database for one of the file real time and so no need to read
> the file from the beginning till the end again and again. However, perl DBI
> has a lot of sophisticated functions there and in fact my tables are only
> large but nothing special, linked by an ID. Is there any simple way to
> achieve the same purpose? I just wish the ID can be indexed and then
> everytime I access the record through memory and not through I/O...
>
>
> #!/usr/bin/perl
>
> my ($listfile, $format, $accfile, $infofile) = @ARGV;
> print '($listfile, $accfile, $infofile)'; <STDIN>;
>
> print "Working on $listfile...\n";
> $outname = $listfile . "_" . $infofile . ".xls";
>
> open (OFP, ">$outname");
>
> open(FP, $listfile);
> $line = <FP>;
> chomp $line;
>
> if ($format ne "") {
> @fields = split(/\t/, $line);
> for ($i=0; $i<@fields; $i++) {
> ############## check fields ###############################
> if ( $fields[$i] =~ /accession/) {
> $acci = $i;
> }
> }
> }
>
> print OFP "$line\tgene info\n";
>
> $nl = '\n';
>
> while (<FP>) {
> $line = $_;
> if ($line eq "\n") {
> print OFP $line;
> next;
> }
> chomp $line;
>
> if ($format eq "") {
> @cells = split (/:/, $line);
> $tag = $cells[0];
> } else {
> @cells = split (/\t/, $line);
> $tag = $cells[$acci];
> }
>
> open(AFP, $accfile);
>
> while (<AFP>) {
> @cells = split (/\t/, $_);
> if ($cells[5] =~ /$tag/) {
> $des = $cells[1];
> last;
> }
> }
> close AFP;
>
> if ($found == 0) {
> print OFP "$line\tNo gene info available\n";
> }
> }
> close FP;
>
>

Hi ela,

without going too deeply into your code, let's just say that you should
always start you perl scripts with

#!/usr/bin/perl
use warnings;
use strict;

and if you can't make it run with these restrictions there is something
seriously flaky about the way you are persuing.

Apart from the perl aspect, there are some serious information issues
you need to address.

From what i can gather of your description, you are reading in a file
that contains some kind of gene information, and you want to index that
information so that retrieval of information is much faster rather than
iterating SEQUENTIALLY over the whole file(or series of files) every
time you need an answer.

Is my assumption thus far right ?


But to assess that, some real life info on what you are
actually trying to do is needed :p
How big is/are the files - that is .. how big will that index be ?

What is the actual index gonna be .. etc.

Only after that part becomes clear a solution is possible. And you need
to communicate that.


cheers, wolf




From: J�rgen Exner on
"ela" <ela(a)yantai.org> wrote:
>
>
>I'm new to database programming and just previously learnt to use loops to
>look up and enrich information using the following codes. However, when the
>tables are large, I find this process is very slow. Then, somebody told me I
>can build a database for one of the file real time and so no need to read
>the file from the beginning till the end again and again.

What I gathered from your code without going into details is that for
each line in OFP your are opening, reading through, and closing AFP.

I/O operations are by far the slowest operations and there is a trivial
solution that will probably speed up your program dramatically: instead
of reading AFP again and again and again just read it into an array once
at the beginning of your program and then loop over that array instead
of over the file.

Only if AFP is too large for that (serveral GB) then you may need to
look for a better algorithmic solution. This requires knowledge and
experience and a database may or may not help, depending upon what you
actually are trying to achive.

jue
From: ccc31807 on
On Aug 10, 3:39 am, "ela" <e...(a)yantai.org> wrote:
> I'm new to database programming and just previously learnt to use loops to
> look up and enrich information using the following codes. However, when the
> tables are large, I find this process is very slow. Then, somebody told me I
> can build a database for one of the file real time and so no need to read
> the file from the beginning till the end again and again. However, perl DBI
> has a lot of sophisticated functions there and in fact my tables are only
> large but nothing special, linked by an ID. Is there any simple way to
> achieve the same purpose? I just wish the ID can be indexed and then
> everytime I access the record through memory and not through I/O...

You have input, which you want to process and turn into output.

Your input consists of data contained in some kind of file. This is
exactly the kind of task that Perl excels at.

You have two choices: (1) you can use a database to store and query
your data, or (2) you can use your computer's memory to store and
query your data.

If you have a large amount of permanent data that you need to add to,
delete from, and change, your best strategy is to use a database. Read
your data file into your database. Most databases have external
commands (i.e., not SQL) for doing that, so it should be
straightforward and easy -- note that you do not use Perl for this,
and probably shouldn't.

If you have a small to moderate amount of data, whether permanent or
temporary, that you don't need to add to, delete from, or modify, your
best strategy is to use your computer's memory to store and query your
data. Simply open the file, read each line, destructure each line into
a key and value, and stuff it into a hash.

For example, suppose your data looks like this:
12345,George,Washington,First
23456,John,Adams,Second
34567,Thomas,Jefferson,Third
45678,James,Madison,Fourth

You can do this:
my %pres;
open PRES, '<', 'data.csv' or die "$!";
while(<PRES>)
{
chomp;
my ($id, $first, $last, $place) = split /,/;
$pres{$place} = "$id, $first, $last";
}
close PRES;

If you need a multilevel data structure, see documentation, starting
maybe with lists of lists.

CC.