4000er summits in the Swiss Alps

Some time ago I enhanced the Theme Maps App by Switzerland. However, the topic list of Switzerland is still quite empty. I would like to fill it. As Switzerland is located in the middle of the Alps, the step to choose a list of mountains that exceed the altitute of 4000m, was a close thought.

The problem remains: where to get the data from? Wikipedia is always a great source to start looking at. Indeed, there exists a page titled List of mountains of the Alps over 4000 metres. Let's start from there.

Wikitables

The wikipedia is full of lists and tables. Mediawiki has a special syntax for tables and the resulting HTML always looks the same. Therefore, I started with the idea to fetch the tables within a page.

The next step is to bring the table content in a certain format that can be further used. Also, many times I am not interested in all data, but in some columns or rows only.

In the past (in other articles) I have written scripts that extract data from Wikipedia. This time I am writing a script that fetches the table content from a page.

wiki2tables.py script

This script works similar to others that I have done before. So the styling and structure looks recognizable. The script knowes two modes. First, just give an article and get a list of tables and their column names. Once we know this, the second mode is to tell which table to fetch, set a few params which columns to fetch and in which format the content should be returned.

When fetching tables, we basically need to look for <table> elements. Inside a table there are table headers <th> and table cells <td>, both embedded in a table row <tr>. The first tricky part that I encountered was, a table header may span over several rows (mostly not more than two), and a table cell may span via several rows. Also table header cells can be found at the begining of a table row. All this can be seen in the table at the paragraph Number of Alpine four-thousanders and distribution.

Therefore the script identifies a table header (e.g. the column names) by looking at the table rows. If no <td> is found, it is assumed that this is a header. Also (at the moment) in the header only, we check if there's a colspan attribute. If that's the case, the cell content is copied to the neighbouring cell that is included in the span. Header cells with the same index are merged, the content is concat by a new line character.

Whenever we find a <td> in a table row, we assume that we left the header section. Still we need to check for <td> and <th> elements to include the header cells at the begining of the row.

Inside a cell there might be any HTML. By default, tags are stripped but the content is preserved. For our special use case, we actually need the anchor tags but we do want to get rid of foot notes, hence the <sup> tags. The script supports this in a granular way, whether a tag should be preserved or deleted including it's children.

To be versatile and because it came almost for free, I added a few formatters, json, html and csv. The latter was a bit more complicated, to produce a correct output.

wiki2table.py Download View all

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

"""
This script fetches table contents from a given Wikipedia article.

General call is: wiki2table.py <wiki_url> | <wiki_article>

Optional parameters are:
--del <delimiter>   Delimiter for CSV output, default is ';'. Use "tab" for the
                    tab character. Other allowed values are ',', '|', ';' and ':'.
--format <format>   Output format, can be one of 'csv', 'html', 'json'.
                    Default is 'json'.
--cols <N>[,<N>]    Index number the column(s) to be fetched. Index starts with
                    1. The order of the column cannot be selected here, it's
                    from low to higher index.
                    If this option is not given, all columns are fetched.
--help              Print this help screen.
--noquotes          Do not use quotes in the CSV output. Be careful, this may lead
                    to weird behaviour when the column value contains the delimiter
                    char in it's value.
--rmtag <t>[,<t>]   Tags to remove from the output. The entire tag and its content
                    will be removed. If you want to keep the content of the tag
                    use the option --tags.
--table <N>         Index number of the table to be fetched. Index starts with 1.
                    If this option is not given, a list of all tables with
                    their columns will be printed out.
--tag <t>[,<t>]     Tags to keep in the output. Default is to delete all tags but
                    keep their content. <br> with be replaced by a new line char.
                    If you want to delete the tag with it's content (e.g. the <sup>
                    tag), then use the --rmtag option.
--tagtext <t>[,<t>] Tags to keep it's content e.g. child tags or text, but delete
                    the tag itself. This is useful for <span> tags.
--wiki <base url>   Base url of the wiki, for the english wikipedia this
                    would be https://en.wikipedia.org/wiki an article name is
                    added to that URL automatically.
"""

from bs4 import BeautifulSoup
import urllib.request, urllib.parse
import sys
import json

class Table:
    """
    This class handles the table html and dervies columns and rows from it.
    """

    def __init__(self, html: BeautifulSoup):
        """
        Initialize the Table object with the html of the table.

        Parameters:
        html (BeautifulSoup): The html of the table.
        """

        self.html = html
        self.cols = None
        self.rows = None
        self.colsToFetch = []
        self.keepTags = []
        self.keepTagsText = []
        self.deleteTags = []

    def setColToFetch(self, col: int):
        """
        Adds a column number to be fetched.
        
        Parameters:
        col (int): The column number to be fetched.

        Returns:
        Self: The current object.
        """
        if col not in self.colsToFetch:
            self.colsToFetch.append(col)
        return self
    
    def setKeepTag(self, tag: str):
        """
        Adds a tag to be kept in the output.

        Parameters:
        tag (str): The tag to be kept.

        Returns:
        Self: The current object.
        """
        tag = tag.lower()
        if (tag not in self.keepTags):
            self.keepTags.append(tag)
        return self
    
    def setKeepTagText(self, tag: str):
        """
        Adds a tag which is itself deleted but the children are kept in the output.

        Parameters:
        tag (str): The tag from which the content is kept.

        Returns:
        Self: The current object.
        """
        tag = tag.lower()
        if (tag not in self.keepTagsText):
            self.keepTagsText.append(tag)
        return self
    
    def setTagToDelete(self, tag: str):
        """
        Adds a tag to be deleted (including children) from the output.

        Parameters:
        tag (str): The tag to be deleted.

        Returns:
        Self: The current object.
        """
        tag = tag.lower()
        if (tag not in self.deleteTags):
            self.deleteTags.append(tag)
        return self
    
    def cleanHtml(self, html: BeautifulSoup) -> str:
        """
        Cleans the cell content from tags that are not wanted in the output.

        Parameters:
        html (BeautifulSoup): The html of the cell.

        Returns:
        str: The cleaned cell content.
        """
        for tag in html.find_all():
            # By default we acutally replace the <br> with a space.
            if tag.name == 'br' and 'br' not in self.keepTags:
                tag.replace_with(' ')
                continue
            if tag.name in self.keepTagsText:
                # Replace the tag with it's content (e.g. innerHTML).
                tag.replace_with(BeautifulSoup(tag.decode_contents(), 'html.parser'))
                continue
            # If we do not want to keep the tag, we have to handle it in two ways.
            if tag.name not in self.keepTags:
                # Delete the tag and *all* it's content.
                if (tag.name in self.deleteTags):
                    tag.extract()
                # Replace the tag with it's textual content, this might delete child tags
                # but preserves their text content.
                else:
                    tag.replace_with(tag.get_text().strip())

        return html.decode_contents().strip()

    def fetchAllCols(self) -> list:
        """
        Fetche all columns of the table.

        Returns:
        list: A list of column names.
        """
        if (self.cols != None):
            return self.cols
        self.cols = []
        mergedCols = {}
        soap = BeautifulSoup(self.html, 'html.parser')
        for row in soap.find_all('tr'):
            # If there are any normal cells, we have left the section of the header row(s).
            if (len(row.find_all('td')) > 0):
                break
            cnt = 0
            # Loop through the header cells.
            for col in row.find_all('th'):
                colValue = self.cleanHtml(col)
                cnt += 1
                # If we have several column headers, merge the content of the same column header.
                if cnt not in mergedCols:
                    mergedCols[cnt] = colValue
                else:
                    mergedCols[cnt] += "\n" + colValue
                # Check if the column has a colspan attribute, then copy the same content to neighboring cells.
                if col.get('colspan') != None and int(col.get('colspan')) > 1:
                    for i in range(1, int(col.get('colspan'))):
                        cnt += 1
                        if cnt not in mergedCols:
                            mergedCols[cnt] = colValue
                        else:
                            mergedCols[cnt] += "\n" + colValue

        # Put the merged columns into the cols list.
        for key in mergedCols:
            self.cols.append(mergedCols[key])
        return self.cols
    
    def fetchCols(self) -> list:
        """
        Filter the columns that are fetched. This is used to filter the columns that are fetched
        by the user.

        Returns:
        list: A list of column names.
        """
        seletedCols = []
        cols = self.fetchAllCols()
        for i, col in enumerate(cols):
            if len(self.colsToFetch) == 0 or (i + 1) in self.colsToFetch:
                seletedCols.append(col)
        return seletedCols

    def fetchRows(self) -> list:
        """
        Fetches the rows of the table.
        
        Returns:
        list: A list of rows with their cell contents.
        """
        if (self.rows != None):
            return self.rows
        self.rows = []
        rowspan = {}
        soap = BeautifulSoup(self.html, 'html.parser')
        for row in soap.find_all('tr'):
            # Skip rows that do not contain any normal cells.
            if len(row.find_all('td')) == 0:
                continue
            # Sometimes the first row is a header row. However, we are in the table rows now.
            cells = row.find_all(['th', 'td'])
            if len(cells) == 0:
                continue
            rowdata = []
            for cnt in range(0, len(self.fetchAllCols())):
                if cnt in rowspan and len(rowspan[cnt]) > 0:
                    rowdata.append(rowspan[cnt].pop())
                    continue
                if (len(cells) == 0):
                    rowdata.append(BeautifulSoup('', 'html.parser'))
                    continue
                cell = cells[0]
                cells = cells[1:]
                rowdata.append(cell)
                if cell.get('rowspan') != None and int(cell.get('rowspan')) > 1:
                    rowspan[cnt] = [cell] * (int(cell.get('rowspan')) - 1)
                if cell.get('colspan') != None and int(cell.get('colspan')) > 1:
                    for i in range(2, int(cell.get('colspan'))):
                       cells.insert(0, cell)
            
            selectedCells = []
            for i, cell in enumerate(rowdata):
                if len(self.colsToFetch) == 0 or (i + 1) in self.colsToFetch:
                    selectedCells.append(self.cleanHtml(cell))
            self.rows.append(selectedCells)
        return self.rows
    

class Wiki2Table:
    """
    This class is the base class for the wiki2table script. It fetches the html of the wiki article,
    parses the html to fine all or a specific table and passes the whole table object to the Table
    class.
    """

    def __init__(self, article: str = None):
        """I
        Initialize the Wiki2Table object.

        Parameters:
        article (str): The article name to be fetched.
        """
        self.article = article
        self.url = 'https://en.wikipedia.org/wiki/'
        self.tables = []
        self.error = ''

    def setArticle(self, article: str):
        """
        Set the article to be fetched.
        
        Parameters:
        arcticle (str): The article name to be fetched.
    
        Returns:
        Self: The current object
        
        """
        self.article = article
        return self

    def setUrl(self, url):
        """
        Set the base URL of the wiki.

        Parameters:
        url (str): The base URL of the wiki.

        Returns:
        Self: The current object
        """
        self.url = url
        return self

    def hasError(self) -> bool:
        """
        Check if an error occured when fetching the table(s) of an article.

        Returns:
        bool: True if an error occured, otherwise False
        """
        return len(self.error) > 0
    
    def getError(self):
        """
        Get the error message if an error occured.

        Returns:
        str: The error message
        """
        return self.error
    
    def getHtml(self) -> bytes:
        """Get data from a given wikiarticle

        Returns:
        bytes: data that is downloaded.

        """

        if not self.article:
            self.error = 'No article given'
            return b''

        url = self.article if self.article.lower().find('https://') == 0 else self.url + urllib.parse.quote(self.article)
        try:
            req = urllib.request.Request(
                url, 
                data = None, 
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
                }
            )
            fp = urllib.request.urlopen(req)
            data = fp.read()
            return data
        except Exception as e:
            self.error = 'Error: could not open url {u} Code: {c}, Message: {m}'.format(u = url,c = e.code, m = str(e))
            return b''


    def findTables(self) -> list:
        """
        Find all tables in the article.

        Returns:
        list: A list of Table objects.
        """
        html = self.getHtml()
        if not html:
            return []
        soap = BeautifulSoup(html.decode('utf-8'), 'html.parser')
        for atable in soap.select('table.wikitable'):
            table = Table(str(atable))
            self.tables.append(table)

        if len(self.tables) == 0:
            self.error = 'No tables found in {0}'.format(self.article)

        return self.tables
    
    def getTable(self, index: int):
        """
        Fetches the html of one table, specified by the index.

        Parameters:
        index (int): The index of the table to be fetched.

        Returns:
        Table: The Table object or None if the table was not found.
        """
        index -= 1
        self.findTables()
        if index >= 0 and index < len(self.tables):
            return self.tables[index]
        self.error = 'Table {0} not found in {1}'.format(index+1, self.article)
        return None

class Formatter:
    """
    This class is the base class for the formatters.
    """

    def __init__(self, table: Table):
        self.table = table
    
    @classmethod
    def out(self) -> str:
        return ''

class FormatterHtml(Formatter):
    """
    This class formats the table object into an HTML table.
    """

    def out(self) -> str:
        """
        Return the HTML table as a string.

        Returns:
        str: The HTML table as a string.
        """
        html = '<table><thead><tr>' + ''.join(['<th>{0}</th>'.format(col) for col in self.table.fetchCols()]) + '</tr></thead><tbody>'
        html += ''.join(['<tr>' + ''.join(['<td>{0}</td>'.format(cell) for cell in row]) + '</tr>' for row in self.table.fetchRows()])
        return html + '</tbody></table>'

class FormatterJson(Formatter):
    """
    This class formats the table object into a JSON object. Column names are the keys
    for each element.
    """

    def out(self) -> str:
        """
        Return the json object as a string.

        Returns:
        str: The json object as a string.
        """
        obj = []
        for row in self.table.fetchRows():
            obj.append(dict(zip(self.table.fetchCols(), row)))
        return json.dumps(obj)
    
class FormatterCsv(Formatter):
    """
    This class formats the table object into a CSV file. A delimiter char can be selected. Quote
    char is always the double quote char. The values are put into quotes automatically unless this
    is switched off.
    """

    def __init__(self, table: Table):
        """
        Initialize the FormatterCsv object.
        
        Parameters:
        table (Table): The table object to be formatted.
        """
        self.table = table
        self.delimiter = ';'
        self.noQuotes = False

    def setDelimiter(self, delimiter: str):
        """
        Set the delimiter char for the CSV output.
        
        Parameters:
        delimiter (str): The delimiter char to be used.

        Returns:
        Self: The current object.
        """
        self.delimiter = delimiter
        return self
    
    def setNoQuotes(self, noQuotes: bool):
        """
        Set the noQuotes flag. If set to True, no quotes are used in the CSV output.

        Parameters:
        noQuotes (bool): True if no quotes should be used, otherwise False.

        Returns:
        Self: The current object.
        """
        self.noQuotes = noQuotes
        return self

    def getColValue(self, col: str) -> str:
        """
        Format the column value. Check if quotes are needed and add them if necessary.
        
        Parameters:
        col (str): The column value to be formatted.

        Returns:
        str: The formatted column value.
        """
        quotes = ''
        if self.noQuotes == True:
            return col
        if col.find('"') > -1:
            col = col.replace('"', '""')
            quotes = '"'
        if col.find(self.delimiter) > -1 or col.find('\n') > -1:
            quotes = '"'
        return quotes + col + quotes

    def out(self) -> str:
        """
        Return the CSV output as a string.

        Returns:
        str: The CSV data.
        """
        csv = ''
        for col in self.table.fetchCols():
            csv += self.getColValue(col) + self.delimiter
        csv += '\n'
        for row in self.table.fetchRows():
            for col in row:
                csv += self.getColValue(col) + self.delimiter
            csv += '\n'
        return csv

def main():
    """
    The main function of the script. It parses the command line arguments and invokes the
    classes to fetch and process the table data.
    """

    # available options that can be changed via the command line
    options = ['format', 'help', 'wiki', 'table', 'cols', 'del', 'tag', 'tagtext', 'rmtag', 'noquotes']

    wiki = Wiki2Table()
    tableNum = 0
    selectCols = ''
    output = 'json'
    csvdel = ';'
    currentCmd = ''
    keepTags = ''
    keepTagsText = ''
    rmTags = ''
    noCsvQuotes = False

    for i in range(len(sys.argv)):
        if i == 0:
            continue
        arg = sys.argv[i]
        # We have a command identified by -- remember it in currentCmd
        # in case this command needs an argument, or just set the
        # appropriate parameter without argument.
        if arg[0:2] == '--':
            currentCmd = arg[2:]
            if not(currentCmd in options):
                print("Invalid argument %s" % currentCmd)
                sys.exit(1)
            if currentCmd == 'help':
                print(__doc__)
                sys.exit(0)
            if currentCmd == 'noquotes':
                noCsvQuotes = True
                currentCmd = ''
        # We have an argument, what was the previous command, do this
        # action?
        elif len(currentCmd) > 0:
            if currentCmd == 'wiki':
                wiki.setUrl(arg)
            elif currentCmd == 'cols':
                selectCols = arg
            elif currentCmd == 'del':
                if arg not in [';', ',', '|', 'tab', ':']:
                    print("Invalid delimiter %s" % arg)
                    sys.exit(1)
                csvdel = arg if arg != 'tab' else '\t'
            elif currentCmd == 'table':
                tableNum = int(arg)
            elif currentCmd == 'format':
                if arg not in ['csv', 'html', 'json']:
                    print("Invalid format %s" % arg)
                    sys.exit(1)
                output = arg
            elif currentCmd  == 'tag':
                keepTags = arg
            elif currentCmd == 'tagtext':
                keepTagsText = arg
            elif currentCmd == 'rmtag':
                rmTags = arg

            currentCmd = ''
        else:
            wiki.setArticle(arg)

    # A specific table is requested. Fetch the table and output it.
    if tableNum > 0:
        table = wiki.getTable(tableNum)
        if wiki.hasError():
            print(wiki.getError())
            sys.exit(1)
        
        if selectCols != '':
            for num in selectCols.split(','):
                table.setColToFetch(int(num))
        if keepTags != '':
            for tag in keepTags.split(','):
                table.setKeepTag(tag)
        if keepTagsText != '':
            for tag in keepTagsText.split(','):
                table.setKeepTagText(tag)
        if rmTags != '':
            for tag in rmTags.split(','):
                table.setTagToDelete(tag)
        if output == 'csv':
            formatter = FormatterCsv(table)
            formatter.setDelimiter(csvdel)
            formatter.setNoQuotes(noCsvQuotes)
        elif output == 'html':
            formatter = FormatterHtml(table)
        else:
            formatter = FormatterJson(table)
        print(formatter.out())
    # No specific table is requested. Print out the list of tables.
    else:
        tables = wiki.findTables()
        if wiki.hasError():
            print(wiki.getError())
            sys.exit(1)

        cnt = 0
        for table in tables:
            cnt += 1
            print("Table {0} has columns: {1}".format(cnt, table.fetchCols()))
            
if __name__ == '__main__':
    main()

The script is self explanatory and also has a --help switch to see all possible arguments.

Fetch the summits

To fetch the summits, we use the first table of that page. As explained before, we do not need all columns. What we basically need is the name of the mountain, the altitude and where it's located (because we want the mountains of Switzerland only).

Therefore, I use the following command to fetch the data:

python3 wiki2table.py List_of_mountains_of_the_Alps_over_4000_metres \
  --table 1 --cols 1,3,4,6 --format csv --noquotes \
  --tag a --rmtag sup --tagtext span > summits.csv

We fetch here from the first table, the leading number, the name of the mountain, the altitude and the country. For a better handling we export the data as csv. Also, we want to keep the anchor tag and delete the footnodes. The anchor tag is required to fetch the geographic coordinates from the wiki page of the mountain itself. Because the page title sometimes differs from the mountains name we need the anchor tag.

In our case, the csv delimiter should not be contained in a cell, we ommit the quotes of the values in the csv file.

Reformat the list

The csv file from the command to fetch the table now contains all mountains. First, we filter for Switzerland.

cat summits.csv | grep Switzerland > summits-ch.csv

To check that we have all mountains of Switzerland, we should have 48 entries:

wc -l summits-ch.csv

The next part is more tricky. From the anchor tag of the secound column, we need to extract the article url to the mountain, fetch the coordinates, the mountain name itself and put this all together with the altitude into another list.

The following shell script helps here:

cat summits-ch.csv | cut -d\; -f1,2,3 | while read l; do
  num=$(echo $l | cut -d\; -f 1)
  name=$(echo $l | cut -d\; -f 2 | perl -lpe 's/.*>([^<]*)<\/a>.*/\1/g')
  url=$(echo $l | cut -d\; -f 2 | perl -lpe 's/.*href="([^"]*)".*/\1/g')
  ele=$(echo $l | cut -d\; -f 3)
  eleint=$(echo $ele | tr -d ,)
  coords='[0, 0, 0]'
  if [[ $url =~ "/wiki/" ]]; then 
    out=$(python3 wiki2geojson.py --wiki https://en.wikipedia.org $url | \
      perl -lpe 's/.*"coordinates": (\[.*?\]).*/\1/g')
    if [[ ! $out =~ ^Error ]]; then
      coords=$(echo $out | perl -lpe "s/0\]$/${eleint}]/")
    fi
  fi
  cat <<END_JSON
    {
      "type": "Feature",
      "geometry": {
        "type": "Point", "coordinates": $coords
      }, "properties": {
        "rank": $num,
        "name": "$name",
        "elevation": "$ele m",
        "wikipedia": "https://en.wikipedia.org$url"
      }
    },
END_JSON
done

Put this content into a shell script, and run it like:

bash summits.sh > summits.geojson

The result is more or less a geojson (not valid). To make it a valid json file, at the top of the file add the following:

{ "type": "FeatureCollection",
  "features": [

On the bottom of the file, replace the last trailing , by ]}. Try to load the file in your browser and you should not experience any error.

Reformatting explained

What happens in that shell script? The staring point was a csv file. The first cat extracts the first three columns, to ommit the country, and reads the file line by line. Inside the loop we extract each column value, transform it a bit (especially the name) to fill our variables:

$num for the overall rank.
$name for the name of the mountain (taken from the text element inside the anchor). If no anchor element is there, the whole string is matched and returned.
$url for the wiki url (taken from the anchors href attribute). If there is no anchor element, the whole string is matched. This gives us a invalid url.
$ele for the elevation, as it comes from the original table.
$eleint for the geojson coordinates we need an integer value for the elevation. So remove the thousand delimiter e.g. the ,.
$coords fill this as a placeholder with a valid coordinates string.

With the above variables we now try to fetch the coordinates from the wiki page that is referred in $url. The variable contains the path only, so we need to prefix this with the wiki domain. The script wiki2geojson.py to fetch the coordinates is described in the article Fetch geo data from Wikipedia where I have done this before. The output of that script is some geojson, within one line, which makes it easy to extract the coordinate string. See the fetch geo data article how exactly the string looks like. If this was successfull, the real elevation is set instead of the 0 from the script. This coordinate string can be used as it is when writing the whole geojson for that item later.

Finally we use a "heredoc" in the bash, to define a string, put the variables into it and when the string is evaluated, the values of the variable gets interpolated into the string.

I used Perl instead of sed. This is because the regular expressions are easier to write in Perl once it gets a bit more complicated. Perl allows more shortcuts, like \d and needs less encoding of special chars that ortherwise, would be interpreted different by the shell.

Missing data

Once you have the valid geojson file, we check how many coordinates could not be fetched. Unfortunately there are a few:

cat summit.geojson | grep "\[0, 0, 0\]" | wc -l

returned me 17. By looking at the error, I was puzzled a litte. The wiki article https://en.wikipedia.org/wiki/T%C3%A4schhorn looks good and actually contains coordinates as well. Soon, I found out that my script wiki2geojson.py was trying to encode the submitted url again. Hence, when the provided article name or url already contains the percent char, I assume it's already encoded and do not encode it again. After fixing this and a second run of the above shell script, and fixing the geojson again, the result was not much better.

I figured out that some table cells contained a <span> tag which was removed by the wiki2table.py script as intended. However, that also removed the enclosed anchor tag. I added the switch --tagtext in the fetch table script that allows us to eliminate the span tag itself but keep all contents, like "get me the innerHTML of this tag". This solved the problem.

Running all steps from above once more and checking for 0,0 coordinates now, finally returned 0 lines. Voila, the json should be complete and contains all mountains with coordinates.

Tags: Python, Quiz, Bash