Wikipedia forbidden

To enhance my Theme Maps App I needed to fetch some data from the Wikipedia. This suddenly did not work anymore. The error on the console was "Error: could not open url" followed by the URL that failed.

I was puzzled because when clicking the URL, it opened well in the browser. Reading a wikipedia url was done quite similar in my three scripts wiki2geojson, wiki2image, and wiki2table. Also, running the script with the --help swich did not show any errors.

The error message was yielded in a try and catch block, hence I ha to figure out what actually happend in the try block. My first assumption was that maybe the urllib was not installed anymore (when I updated my system it got lost or replaced with some other library). However, as soon as I put the call fp = urllib.request.urlopen(url) outside the try block, the error appears on stderr and revealed that a HTTP 403 response was received.

Apparently, Wikipedia wants to keep away any crawlers and bots and protects it's site so that automated content grabbing is not possible. Since the browser access worked well, I tried whether I can set a custom user agent and pretend being a normal user. Gladly, that worked.

So the old code in the three scripts where the url is fetched was changed in the following manner:

try:
    req = urllib.request.Request(
        url, 
        data = None, 
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        }
    )
    fp = urllib.request.urlopen(req)
    data = fp.read()
    return data
except Exception as e:
    self.error = 'Error: could not open url {u} Code: {c}, Message: {m}'.format(u = url,c = e.code, m = str(e))
    return False

The return is a bit different in the three scripts but the procedure is the same. We create a request object, set a custom header, in this case the User-Agent and then send the request to the server.

Also, in the next line, when some HTTP error or whathever exception is received, I want it displayed at the console. Therefore, I catch the Exception object and use the information in the exception block.

Two scripts used a check whether the HTTP response was not 200. I deleted that block because of the exception that is thrown earlier, this case would never occur and the information is contained in the exception already.

There were a few other glitches to fix. The images are now located in a different structure. The search for images is changed from

for aimage in soap.select('a.image'):

into

for aimage in soap.select('.infobox-image, figure'):

and when there is no alt attribute we now look for the figcaption element.

Regular expressions with a special character group now need to have two backslashes instead of one, which created a warning only, but still worked.

In general, it's probably better to use the wikipedia API instead of relying on the output of the rendered page. Structure changes in the wiki source are less likely to happen than in the output.

Tags: Python