From: Chris Rebert on
On Thu, Oct 15, 2009 at 12:39 AM, Raji Seetharaman <sraji.me(a)gmail.com> wrote:
> Hi all,
>
> Im learning web scraping with python from the following link
> http://www.packtpub.com/article/web-scraping-with-python
>
> To work with it,  mechanize to be installed
> I installed mechanize using
>
> sudo apt-get install python-mechanize
>
> As given in the tutorial, i tried the code as below
>
> import mechanize
> BASE_URL = "http://www.packtpub.com/article-network"
> br = mechanize.Browser()
> data = br.open(BASE_URL).get_data()
>
> Received the following error
>
> File "webscrap.py", line 4, in <module>
>     data = br.open(BASE_URL).get_data()
>   File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 209,
> in open
>     return self._mech_open(url, data, timeout=timeout)
>   File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 261,
> in _mech_open
>     raise response
> mechanize._response.httperror_seek_wrapper: HTTP Error 403: request
> disallowed by robots.txt

Apparently that website's tutorial and robots.txt are not in sync.
robots.txt is part of the Robot Exclusion Standard
(http://en.wikipedia.org/wiki/Robots_exclusion_standard) and is the
standard way websites specify which webpages should and should not be
accessed programmatically. In this case, that site's robots.txt is
forbidding access to the webpage in question from autonomous programs.

There's probably a way to tell mechanize to ignore robots.txt though,
given the standard is not enforced server-side; programs just follow
it voluntarily.

Cheers,
Chris
--
http://blog.rebertia.com