From: Thomas Andersson on
Hmm, been playing around a bit and gotten further than I had thought.
I open a file and read in the next webpage to be processed (a id number) and
set up the page count to 1 (each ID to process can have any number of
pages).
I create my URL from page count and current ID (pid)
The idea I have is that it will loop as long as there is a page to grab by
increasing the page count (this plan was flawed I realised though, but
that's another problem).
As it is now it keeps grabbing the same page over and over thousands of
times (creating new files for each loop).

#Create URL for sid list from pid and page count.
my $pcnt = 1;
my $page = get
"http://csr.wwiionline.com/scripts/services/persona/sorties.jsp?page=$pcnt&pid=$pid";
while ($page) {
if ($page) {
print "Site is alive\n";
}
else {
print "Site is not accessible\n";
};

#Create filename and write file, then save grabbed webpage into it.
open FILE, ">", "c:\\scr\\$pid-pg$pcnt.txt" or die $!;
print FILE $page;
$pcnt += 1;
};

I guess the URL doesn't get updated by the increased pagecount, any
suggestions on how to fix that part?


From: Sherm Pendley on
"Thomas Andersson" <thomas(a)tifozi.net> writes:

> As it is now it keeps grabbing the same page over and over thousands of
> times (creating new files for each loop).

Not quite - the get() is outside of the loop, so it's grabbing the page
only once, and saving it over and over.

> #Create URL for sid list from pid and page count.
> my $pcnt = 1;

I'd put the "base" URL in a separate variable, to avoid repetition:

my $base = 'http://csr.wwiionline.com/scripts/services/persona/sorties.jsp';

> my $page = get
> "http://csr.wwiionline.com/scripts/services/persona/sorties.jsp?page=$pcnt&pid=$pid";

So, using the "base" url, this would become:

my $page = get "$base?page=$pcnt&pid=$pid";

> while ($page) {

The if() is redundant here; if $page is false, the while() will exit
and the if() won't be reached.

> print "Site is alive\n";

> #Create filename and write file, then save grabbed webpage into it.
> open FILE, ">", "c:\\scr\\$pid-pg$pcnt.txt" or die $!;

You can use forward slashes on Windows too - it's only the command
shell (aka "DOS Box") that requires backslashes. Also, it's a good idea
to include the filename you're trying to open when reporting an error,
because that can help you figure out why it failed.

my $outfile = "c:/scr/$pid-pg$pcnt.txt";
open FILE, ">", $outfile or die "Could not open $outfile: $!";

> print FILE $page;
> $pcnt += 1;

Now that you've updated $pcnt, you need to fetch the next page and
store it in $page.

$page = get "$base?page=$pcnt&pid=$pid";

> };
>
> I guess the URL doesn't get updated by the increased pagecount

Right. When you interpolate a variable into a string, it's a one-time
deal. The current value of the interpolated variable is used, but no
long-lasting relationship exists between them, so the string is not
updated when the interpolated variable's value changes.

For example, this will print the same thing ten times:

#!/usr/bin/perl
use warnings;
use strict;

my $num = 0;
my $string = "Num: $num\n";
for $num (1 .. 10) {
print $string;
}

Compare that with this, where a new value is assigned to $string each
time around the loop:

#!/usr/bin/perl
use warnings;
use strict;

for my $num (1 .. 10) {
my $string = "Num: $num\n";
print $string;
}

sherm--

--
Sherm Pendley <www.shermpendley.com>
<www.camelbones.org>
Cocoa Developer
From: Ben Morrow on

Quoth "Thomas Andersson" <thomas(a)tifozi.net>:
> Hmm, been playing around a bit and gotten further than I had thought.
> I open a file and read in the next webpage to be processed (a id number) and
> set up the page count to 1 (each ID to process can have any number of
> pages).
> I create my URL from page count and current ID (pid)
> The idea I have is that it will loop as long as there is a page to grab by
> increasing the page count (this plan was flawed I realised though, but
> that's another problem).
> As it is now it keeps grabbing the same page over and over thousands of
> times (creating new files for each loop).
>
> #Create URL for sid list from pid and page count.
> my $pcnt = 1;
> my $page = get
> "http://csr.wwiionline.com/scripts/services/persona/sorties.jsp?page=$pcnt&pid=$pid";

This happens once, before the loop, when $pcnt = 1.

> while ($page) {
> if ($page) {
> print "Site is alive\n";
> }
> else {
> print "Site is not accessible\n";
> };
>
> #Create filename and write file, then save grabbed webpage into it.
> open FILE, ">", "c:\\scr\\$pid-pg$pcnt.txt" or die $!;

This happens every time around the loop, with different values of $pcnt.

> print FILE $page;
> $pcnt += 1;
> };
>
> I guess the URL doesn't get updated by the increased pagecount, any
> suggestions on how to fix that part?

You seem to be expecting Perl variables to act like macros; they don't.
If you want to recreate the URL and re-fetch the new page every time you
go round the loop, you need the 'my $page = get...' line *inside* the
loop.

Also: get into the habit, now, of keeping you filehandles in proper
variables. It will make life easier later.

open my $FILE, ">", "..." or ...;

Ben

From: Thomas Andersson on
Sherm Pendley wrote:

> I'd put the "base" URL in a separate variable, to avoid repetition:
> my $base =
> 'http://csr.wwiionline.com/scripts/services/persona/sorties.jsp';

Excellent idea, just realised that the links I will collect from the page
also uses the same base. Yhanks for the examples, helps me a lot!

>> while ($page) {
> The if() is redundant here; if $page is false, the while() will exit
> and the if() won't be reached.

Sorry, didn't quite get what you were saying here?
One problem I've realised that kinda breaks this is that if you just up the
page count it will never fail and exit as you just keep getting empty sortie
pages back witha ever higher page number. (there's a string "No more
sorties found" on them though that I guess could be detected and used to
exit the loop).

> You can use forward slashes on Windows too - it's only the command
> shell (aka "DOS Box") that requires backslashes. Also, it's a good
> idea to include the filename you're trying to open when reporting an
> error, because that can help you figure out why it failed.

Ah, didn't realize, good to know, will definitely follow your suggestion
(might as well pick up good habbits early on).
Thanks for your good advice, I really apreciate it (and will likely come
back time and again for more ;) ).

Best Wishes
Thomas


From: Thomas Andersson on
> Also: get into the habit, now, of keeping you filehandles in proper
> variables. It will make life easier later.
>
> open my $FILE, ">", "..." or ...;

Will definitely try to pick up good habbits on coding and formatting so
thanks for advice.
But if I createa variable of the filehandler like this, won't it contain the
filepath then, so when I do the print $FILE it will print the filepath
instead of the content of the file as I want? Or am I missunderstanding?
(quite likely).

Best Wishes
Thomas