From: MRAB on
Brian D wrote:
> Thanks MRAB as well. I've printed all of the replies to retain with my
> pile of essential documentation.
>
> To follow up with a complete response, I'm ripping out of my mechanize
> module the essential components of the solution I got to work.
>
> The main body of the code passes a URL to the scrape_records function.
> The function attempts to open the URL five times.
>
> If the URL is opened, a values dictionary is populated and returned to
> the calling statement. If the URL cannot be opened, a fatal error is
> printed and the module terminates. There's a little sleep call in the
> function to leave time for any errant connection problem to resolve
> itself.
>
> Thanks to all for your replies. I hope this helps someone else:
>
> import urllib2, time
> from mechanize import Browser
>
> def scrape_records(url):
> maxattempts = 5
> br = Browser()
> user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
> 1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
> br.addheaders = [('User-agent', user_agent)]
> for count in xrange(maxattempts):
> try:
> print url, count
> br.open(url)
> break
> except urllib2.URLError:
> print 'URL error', count
> # Pretend a failed connection was fixed
> if count == 2:
> url = 'http://www.google.com'
> time.sleep(1)
> pass

'pass' isn't necessary.

> else:
> print 'Fatal URL error. Process terminated.'
> return None
> # Scrape page and populate valuesDict
> valuesDict = {}
> return valuesDict
>
> url = 'http://badurl'
> valuesDict = scrape_records(url)
> if valuesDict == None:

When checking whether or not something is a singleton, such as None, use
"is" or "is not" instead of "==" or "!=".

> print 'Failed to retrieve valuesDict'

From: Brian D on
On Dec 30, 7:08 pm, MRAB <pyt...(a)mrabarnett.plus.com> wrote:
> Brian D wrote:
> > Thanks MRAB as well. I've printed all of the replies to retain with my
> > pile of essential documentation.
>
> > To follow up with a complete response, I'm ripping out of my mechanize
> > module the essential components of the solution I got to work.
>
> > The main body of the code passes a URL to the scrape_records function.
> > The function attempts to open the URL five times.
>
> > If the URL is opened, a values dictionary is populated and returned to
> > the calling statement. If the URL cannot be opened, a fatal error is
> > printed and the module terminates. There's a little sleep call in the
> > function to leave time for any errant connection problem to resolve
> > itself.
>
> > Thanks to all for your replies. I hope this helps someone else:
>
> > import urllib2, time
> > from mechanize import Browser
>
> > def scrape_records(url):
> >     maxattempts = 5
> >     br = Browser()
> >     user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
> > 1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)'
> >     br.addheaders = [('User-agent', user_agent)]
> >     for count in xrange(maxattempts):
> >         try:
> >             print url, count
> >             br.open(url)
> >             break
> >         except urllib2.URLError:
> >             print 'URL error', count
> >             # Pretend a failed connection was fixed
> >             if count == 2:
> >                 url = 'http://www.google.com'
> >             time.sleep(1)
> >             pass
>
> 'pass' isn't necessary.
>
> >     else:
> >         print 'Fatal URL error. Process terminated.'
> >         return None
> >     # Scrape page and populate valuesDict
> >     valuesDict = {}
> >     return valuesDict
>
> > url = 'http://badurl'
> > valuesDict = scrape_records(url)
> > if valuesDict == None:
>
> When checking whether or not something is a singleton, such as None, use
> "is" or "is not" instead of "==" or "!=".
>
> >     print 'Failed to retrieve valuesDict'
>
>

I'm definitely acquiring some well-deserved schooling -- and it's
really appreciated. I'd seen the "is/is not" preference before, but it
just didn't stick.

I see now that "pass" is redundant -- thanks for catching that.

Cheers.
From: Steve Holden on
Brian D wrote:
[...]
> I'm definitely acquiring some well-deserved schooling -- and it's
> really appreciated. I'd seen the "is/is not" preference before, but it
> just didn't stick.
>
Yes, a lot of people have acquired the majority of their Python
education from this list - I have certainly learned a thing or two from
it over the years, and had some very interesting discussions.

is/is not are about object identity. Saying

a is b

is pretty much the same thing as saying

id(a) == id(b)

so it's a test that two expressions are references to the exact same
object. So it works with None, since there is only ever one value of
<type 'NoneType'>.

Be careful not to use it when there can be several different but equal
values, though.

> I see now that "pass" is redundant -- thanks for catching that.
>
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

From: Aahz on
In article <mailman.233.1262197919.28905.python-list(a)python.org>,
Philip Semanchuk <philip(a)semanchuk.com> wrote:
>
>While I don't fully understand what you're trying to accomplish by
>changing the URL to google.com after 3 iterations, I suspect that some
>of your trouble comes from using "while True". Your code would be
>clearer if the while clause actually stated the exit condition. Here's
>a suggestion (untested):
>
>MAX_ATTEMPTS = 5
>
>count = 0
>while count <= MAX_ATTEMPTS:
> count += 1
> try:
> print 'attempt ' + str(count)
> request = urllib2.Request(url, None, headers)
> response = urllib2.urlopen(request)
> if response:
> print 'True response.'
> except URLError:
> print 'fail ' + str(count)

Note that you may have good reason for doing it differently:

MAX_ATTEMPTS = 5
def retry(url):
count = 0
while True:
count += 1
try:
print 'attempt', count
request = urllib2.Request(url, None, headers)
response = urllib2.urlopen(request)
if response:
print 'True response'
except URLError:
if count < MAX_ATTEMPTS:
time.sleep(5)
else:
raise

This structure is required in order for the raise to do a proper
re-raise.

BTW, your code is rather oddly indented, please stick with PEP8.
--
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur." --Red Adair
From: Aahz on
In article <hilruv$nuv$1(a)panix5.panix.com>, Aahz <aahz(a)pythoncraft.com> wrote:
>In article <mailman.233.1262197919.28905.python-list(a)python.org>,
>Philip Semanchuk <philip(a)semanchuk.com> wrote:
>>
>>While I don't fully understand what you're trying to accomplish by
>>changing the URL to google.com after 3 iterations, I suspect that some
>>of your trouble comes from using "while True". Your code would be
>>clearer if the while clause actually stated the exit condition. Here's
>>a suggestion (untested):
>>
>>MAX_ATTEMPTS = 5
>>
>>count = 0
>>while count <= MAX_ATTEMPTS:
>> count += 1
>> try:
>> print 'attempt ' + str(count)
>> request = urllib2.Request(url, None, headers)
>> response = urllib2.urlopen(request)
>> if response:
>> print 'True response.'
>> except URLError:
>> print 'fail ' + str(count)
>
>Note that you may have good reason for doing it differently:
>
>MAX_ATTEMPTS = 5
>def retry(url):
> count = 0
> while True:
> count += 1
> try:
> print 'attempt', count
> request = urllib2.Request(url, None, headers)
> response = urllib2.urlopen(request)
> if response:
> print 'True response'
^^^^^
Oops, that print should have been a return.

> except URLError:
> if count < MAX_ATTEMPTS:
> time.sleep(5)
> else:
> raise
>
>This structure is required in order for the raise to do a proper
>re-raise.

--
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur." --Red Adair