Download images from Wikipedia

Quiz questionnaires look well if there is no text only but also images. In the quiz that I am participating we play 3 rounds of questions. One of these rounds usually contains images and the audience must name what's on it. An example would be a list several known sportsmen like the following:

DOD photo by D. Myles Cullen - http://www.defense.gov/dodcmsshare/newsphoto/2014-04/hires_140417-D-VO565-002c.jpg
CC BY-SA 2.0: Tatiana from Moscow, Russia, https://www.flickr.com/people/94741637@N02
CC BY-SA 3.0: Кирилл Венедиктов - https://www.soccer.ru/galery/1055457/photo/733439

The audience is asked to write down the name of each president below the image or on a separate sheet with a numbered list. In former times I would check Wikipedia, manually look up each page with the sportsmen that I have on my list, search an image, download it and then use it in the questionnaire. This procedure task that can be automated.

The script can deal with wiki article names, complete wiki urls and take these also from a file. This allows to parse several articles at once and download images. The user may choose whether to download the thumbnail that is shown at the wiki page or the original uploaded image file. The latter may not work for all images because the click on the thumbnail image triggers some Javascript which cannot be done in this Python script.

In case of doubt, the script also may list all images and their caption, so that the user may decide after looking at the listing, which images to download.

Many wiki pages have a box at the right top that summarizes some facts about the entity. Pages about a person have this table with information nearly every time. This table contains facts like birthdate and place, nationality, occupation or why this person is famous for. If the person is famous enough and not just of some B or C prominence, there is also a portrait of that person. This is the image that I want to download. When looking at the structure of the whole website, this image is usually also the first image in the article block. A good guess would be then just download the first image without actually checking the contents until later after downloading a bunch of files. This is also the default behaviour of this script.

Let's assume I want to fetch the portrait of Michael Jordan. The straight forward approach is this:

./wiki2image.py "Michael Jordan"

By default, article names are looked up in the English Wikipedia. The article name can also be provided in the following ways:

  • Michael_Jordan (space to underscore)
  • Michael%20Jordan (space url encoded - but not with +)
  • https://en.wikipedia.org/wiki/Michael_Jordan (complete url)

This will always download the first image of the article. There are additional options:

  • --list returns a list of the image links and the image caption in case there is one
  • --num provide a number to download some other than the first image, in combination with the list the numbers are shown. To download all images --num all can be used.
  • --dir write the downloaded files into the given directory and not the current one. If the directory doesn't exit it will be created.
  • --orig download the original image and not the thumbnail.
  • --wiki provide the wiki base url where the articles are located. This would probably work also with any other Mediawiki installation, although I could not test this yet.

The script is organized in a class WorkLog that does the wiki parsing, extracting information and downloading the images and in a main routine that handles the command line arguments and sets up a WorkLog instance accordingly.

Before fetching an image I first have to download the wiki article. This is an HTML page with all the image links included. In the HTML I first have to look for all <img> elements. This is done in WorkLog.fetchImages(self, url). The url contains the current wiki article url. I need this later. To find the image elements I use the library BeautifulSoap here. This is a library that uses some kind of selectors that are similar to css selectors. Instead of looking for the images directly I look for anchor tags that have a class named image. Images inside an article are always embedded by an anchor element. This excludes all images that are not part of the article itself. Also, when listing the images, I can better locate and fetch the image caption.

Inside the anchor element, I now look for the image element. The src attribute contains the link to the image file. In case of wikipedia the url usually starts with https://upload.wikimedia.org. In this case we can take the content of the src attribute as it is. In case it's an absolute link without the domain or even a relative link, I first need to compose the download url from the current website url and path. This is done in WorkLog.getRealUrl(self, url, imgurl). Therefore, I need the url of the current article that we are parsing.

If the original image is supposed to be downloaded, I just remove the /thumb/ in the path to the image file, in the hope that this leads to the original image. This is a good guess and works in most of the cases but may fail with svg files because their thumbnails are jpegs.

To know which image to download, all matches of the parsing with BeautifulSoap must be counted. The class property WorkLog.download contains a list with the numbers of the images to download. If the list contains one element only, there are two special cases:

  • the element is -1, download all images
  • the element is 0, do not download the images but list them only (the list mode that is set with the argument --list).

In the normal case counting starts at 1 and the occurrence of images is counted and if the element is in the list then the image must be downloaded. By default, the list is filled with the element 1 which means to download the first images that is found in the wiki article.

In case there would be a quiz about US presidents, the approach would be slightly different. Wikipedia already contains a page List of presidents of the United States. Fortunately the page contains a table with each president and also a portrait of him. In this case we tell the script to download all the images at once:

./wiki2image.py https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States --num all --dir us_presidents

This creates a new directory us_presidents and downloads all images into it. Luckily the file name also contains the presidents name, so there is no guessing which portrait shows which president after the files have been downloaded. There are more images on that page, like the US flag, or the coat of arms. These images can be ignored or deleted. Another good aspect of these images is that if I use the thumbnails, they all have a very similar size which makes it easy to combine them on a questionnaire.

The whole script follows here. It contains a lot of documentation than make the code size grow. Some parts also deal with comfort, to vary the download modes. Therefore, some command line argument handling is done where the user can set several options. That prevents changes to the script code itself.

wiki2image.py DownloadView all
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

"""
This script tries to retrieve one or more images that are embedded in the
given Wikipedia article.

General call is: wiki2image.py <wiki_url> | <wiki_article>

If you want to retrieve an image of Winston Churchill, the article could
be provided by the complete url from wikipedia which is
https://en.wikipedia.org/wiki/Winston_Churchill or just by the article name
Winston_Churchill. You may also write "Winston Churchill" but must enclose
the name with quotes then.

You can provide several article names at once, images of these will be fetched
subsequently. You may also provide a local file name with the articles listed,
that should be fetched. Each article or the complete wiki url must be on a
separate line.

Optional parameters are:
--dir               Set output directory where to store the downloaded images.
--help              Print this help screen.
--list              List all image links that are found on the wiki page. Do
                    not download them.
--num <N>,[<N>]|all number of the image(s) in the article that should be
                    fetched. This is useful when you run --list before which
                    returns an enumerated list of all images found in that
                    wiki article. Several images can be downloaded at once by
                    separating the numbers with comma. Use "all" for down-
                    loading all images at once.
                    By default the first image is fetched.
--orig              Do not fetch the embedded thumbnail image but the original
                    image with the full size (default is thumbnail).
--verbose           Some verbose output during the script run.
--wiki <base url>   Base url of the wiki, for the english wikipedia this
                    would be https://en.wikipedia.org/wiki an article name is
                    added to that URL automatically.
"""

import os, sys, re, json
import urllib.request, urllib.parse
from bs4 import BeautifulSoup

usage = ("Usage: " + os.path.basename(sys.argv[0]) +
         " wiki_url|article\n" +
         "Type --help for more details."
        )


def dieNice(errMsg = ""):
    """Die nice with an error message and return exit code 1"""
    print("Error: {0}\n{1}".format(errMsg, usage))
    sys.exit(1)

class WorkLog:
    """
    Worklog class to download images from a wikipedia page.
    This basically holds all parameters that can be specified on the
    command line and processes an wiki article or a list of articles
    and fetched images.
    """

    def __init__(self):
        """Initialize the object

        Attributes:

        url (string): with the wiki url, default is https://en.wikipedia.org/wiki/
        articles (list): of wiki articles to process
        verbose (bool): to output more information during the processing
        download (list): of integers, which images to download
        mode (string): whether to download the thumbnail or original image

        """

        self.url = 'https://en.wikipedia.org/wiki/'
        """ Default wiki url is from the english Wikipedia"""

        self.articles = []
        """List of articles to process is empty at first"""

        self.verbose = False
        """Only report errors but nothing else"""

        self.download = [ 1 ]
        """Which images to download. By default download the first image
        that we can find. If download is [ -1 ] then download all images
        that are on the wiki page. If download is [ 0 ] then list all image
        urls and caption (if defined) on STDOUT but do not download the
        images"""

        self.mode = 'thumb'
        """By default download the thumbnails. Set it to "orig" to download
        the original image"""

        self.outputDir = ''
        """When set, the dir is created (in case it's not yet there) and all
        downloaded files will be stored in that directory"""

    def setUrl(self, url):
        """Set wiki url from parameter `url` in case later wiki article
        names are used only. Then this url is prepended.

        Parameters:
        url (string): wiki url.

        Returns:
        self:

        """

        self.url = url if url[-1:] == '/' else url + '/'
        return self

    def setDownload(self, num):
        """Set download number which image to download

        Parameter `num` containing a list which images to download.
        If the list is [ 0 ] then list images only. If list is [ -1 ]
        then download all images.

        Parameters:
        num (list): list of integers which images to download

        Return:
        self:

        """

        self.download = num if isinstance(num, list) else [ num ]
        return self

    def setMode(self, mode):
        """
        Set `mode` which image type to download (orig or thumb)

        Parameters:
        mode (string): type of image
        Returns:
        self:
        Exception:
        NameError: if an invalid value is submitted.

        """

        if mode == 'thumb' or mode == 'orig':
            self.mode = mode
            return self
        raise NameError('Invalid argument ' + mode)

    def setOutputDir(self, dir):
        """Set output `dir` where to store the downloaded images.
        If the directory does not exist, it will be created

        Parameters:
        dir (string): name of directory

        Returns:
        self:

        """

        self.outputDir = dir
        return self

    def addArticle(self, article):
        """Add `article` to list to be processed. This can be a wiki
        article name or a complete wiki url

        Parameters:
        article (string): name or url of wiki article

        Returns:
        self:
        """

        self.articles.append(article)
        return self

    def setVerbose(self):
        """Enable verbose mode"""

        self.verbose = True
        return self

    def readFile(self, filename):
        """Read the file with the list of wiki articles.
        Parameter `filename` contains the file to read.

        Parameters:
        filename (string): filename with wiki articles

        Returns:
        self:

        """

        try:
            fp = open(filename, "r")
        except:
            dieNice('could not open file {0}'.format(filename))
        for line in fp:
            self.articles.append(line.strip())
        fp.close()
        return self

    def process(self):
        """Process the list of wiki articles"""

        urlPattern = re.compile('^https?://.*?', re.IGNORECASE)
        for article in self.articles:
            if urlPattern.match(article):
                self.fetchImages(article)
            else:
                self.fetchImages(self.url + urllib.parse.quote(article))
        return self

    def getDataFromUrl(self, url):
        """Get data from a given `url`

        Parameters:
        url (string): url to download

        Returns:
        binary: data that is donwloaded

        """
        try:
            req = urllib.request.Request(
                url, 
                data = None, 
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
                }
            )
            fp = urllib.request.urlopen(req)
            data = fp.read()
            return data
        except Exception as e:
            print('Error: could not open url {u} Code: {c}, Message: {m}'.format(u = url, c = e.code, m = str(e)))
            return False


    def getRealUrl(self, url, imgurl):
        """Get the real url from the image url that is found in the article.

        Parameters:
        url (string): of the article
        imgurl (string): url of the image location inside the article

        Returns:
        string: of the absolute url that can be used to download the image

        """

        if imgurl[0:2] == '//': # just add the protocoll identfier
            return 'https:' + imgurl
        if imgurl[0:1] == '/': # it's an absolute path, add the domain
            return self.url[0:self.url.index('/', 10)] + imgurl
        if imgurl[0:4] != 'http':
            # it's a relative path replace the article name with the real path
            return url[0:url.rfind('/')] + '/' + imgurl
        # it's a complete url with protocol, no need to manipulate it
        return imgurl

    def downloadFile(self, url):
        """Download the content from the given `url` and store it in a file
        that has the same name as on the server (last part of the url).

        Parameters:
        url (string): url to download

        Returns:
        bool: True if download and writing to file was succesful.
        """

        # create output dir in case it does not exist
        if len(self.outputDir) > 0:
            if not os.path.isdir(self.outputDir):
                try:
                   os.mkdir(self.outputDir)
                except OSError:
                   print("Could not create output directory {0}".format(self.outputDir))
                   sys.exit(1)

        # from the last part of the url, take the file name
        name = url[url.rfind('/') + 1:]
        # and download the data
        data = self.getDataFromUrl(url)
        # we downloaded something
        if data != False and len(data) > 0:
            try:
                # try to save it in the dir and file
                fullpath = name if len(self.outputDir) == 0 else self.outputDir + '/' + name

                fp = open(fullpath, "w+b")
                fp.write(data)
                fp.close()
                if self.verbose:
                    print("Image downloaded successfully and stored in {0}".format(name))
                return True
            except:
                print("Error: could not store data into file {0}".format(name))
        return False

    def fetchImages(self, url):
        """Fetch images from a given article url. Parameter `url` contains
        the string with the wiki article url. This article is parsed and all
        image links are extracted.

        Parameters:
        url (string): of wiki article that is parsed for image links.

        Returns:
        bool: Success or Failure

        """

        name = url[url.rfind('/') + 1:]
        downloadSuccess = True

        if len(name) == 0:
            print("Error: empty article name, skip line")
            return False

        html = self.getDataFromUrl(url)
        if html == False:
            return False

        if self.verbose:
            print("Parsing url {0}".format(url))

        soap = BeautifulSoup(html.decode('utf-8'), 'html.parser')
        cnt = 0
        for aimage in soap.select('.infobox-image, figure'):
            cnt += 1

            # check if we have selected a single image > 0
            if len(self.download) == 1  and self.download[0] > 0 and self.download[0] != cnt:
                continue
            # we do have a list, check if the current image is in the list that we want to have
            if len(self.download) > 1:
                try:
                    self.download.index(cnt)
                except:
                    continue;

            # proceed with downloading the image
            imgurl = aimage.find('img').get('src')
            if self.mode == 'orig':
                imgurl = imgurl[0:imgurl.rfind('/')].replace('/thumb/', '/')

            # list only but do not download the image, therefore look for the image caption
            if self.download[0] == 0:
                alt = aimage.find('img').get('alt')
                if alt is None or len(alt) == 0:
                    caption = aimage.find('figcaption')
                    if caption is not None:
                       alt = caption.get_text()

                if len(alt) > 0:
                    print("{0}: {1}\n\t{2}\n".format(cnt, self.getRealUrl(url, imgurl), alt))
                else:
                    print("{0}: {1}\n".format(cnt, self.getRealUrl(url, imgurl)))

                continue
            # download the current image, list is eigther [ -1 ] for all images
            # or we just happen to habe cnt being in the list of images to download.
            downloadSuccess = self.downloadFile(self.getRealUrl(url, imgurl))

        return downloadSuccess


def main():
    """Evaluate command line arguments, build up worklog and start
    processing the wiki articles"""

    # available options that can be changed via the command line
    options = ['wiki', 'orig', 'num', 'list', 'verbose', 'help', 'dir']

    # the worklog that handles the wiki articles and processes them.
    worklog = WorkLog()

    # try to fetch the command line args
    currentCmd = ''
    for i in range(len(sys.argv)):
        if i == 0:
            continue
        arg = sys.argv[i]
        # we have a command identified by -- remember it in currentCmd
        # in case this command needs an argument, or just set the
        # appropriate parameter in the worklog or execute some action
        if arg[0:2] == '--':
            currentCmd = arg[2:]
            if not(currentCmd in options):
                dieNice("Invalid argument %s" % currentCmd)
            if currentCmd == 'help':
                print(__doc__)
                sys.exit(0)
            elif currentCmd == 'verbose':
                worklog.setVerbose()
                currentCmd = ''
            elif currentCmd == 'list':
                worklog.setDownload(0)
                currentCmd = ''
            elif currentCmd == 'orig':
                worklog.setMode('orig')
                currentCmd = ''
        # we have an argument, what was the previous command, do this
        # action in the worklog.
        elif len(currentCmd) > 0:
            if currentCmd == 'wiki':
                worklog.setUrl(arg)
            elif currentCmd == 'num':
                if arg.find(',') > -1:
                    collection = []
                    for num in arg.split(','):
                        collection.append(int(num))
                    worklog.setDownload(collection)
                elif arg == 'all':
                    worklog.setDownload(-1)
                else:
                    worklog.setDownload(int(arg))
            elif currentCmd == 'dir':
                worklog.setOutputDir(arg)
            currentCmd = ''
        else:
            # check if current arg is a file
            if os.path.isfile(arg):
                worklog.readFile(arg)
            else:
                worklog.addArticle(arg)

    # process the data now
    worklog.process()

if __name__ == "__main__":
    main()