Nichtsemester - Datamining with shell tools — Canjanix

Nichtsemester - Datamining with shell tools

Because of the corona virus and the public shutdown, preschools, schools, universities and other public educational institutions are affected as well. They are closed for the audience hence the normal face to face education cannot be continued. But anyway, Youtube has been around for far more than a decade. Most (tech) conferences, such as the Chaos Communication Congress or FOSDEM use the possibility to record each session and stream these over the internet and archive them for later viewing. This is an established way to go. However, some university lecturers seem to be unable or unwilling to record their lectures and want to take a break. The summer semester of 2020 shall not to be counted as an official one in the curriculum and no exams may be taken. Critical voices claim that because there is no content in many non STEM subjects, lecturers want to avoid the critics because there would be evidence if the sessions are recorded and preserved for further reference.

On the website nichtsemester.de some institutions of different universities started a petition not to count this semester in the students' curriculum. The list of supporters that signed the petition is listed on this website. In this article I want to see whether there is a accumulation of people in non STEM subjects and if certain universities are represented more than others. In the article I want to do some very light data mining, taking the data from this supporters website and show how to fetch and reformat the data to get some insight into them.

First we fetch a copy of the particular web page. This can be done easily with wget:

wget -O website.html https://www.nichtsemester.de/cbxpetition/offener-brief/

The following script extracts the data from the webpage and transforms it into an easy to use formatting. The list of persons that signed the petition is an ordered list in the html. Each entry usually consists of a number (the position in the list), a name including title, a university name, and a faculty name of that university. Each part of an entry is separated by a comma. At first, I came up with some bash code to extract the relevant data from the html:

less website.html | \
grep '<li dir="ltr" role="presentation">' | \
sed -e 's/<li dir="ltr" role="presentation">//g;s/<\/li>//g' > list.csv

This worked well but did not get me all the 1379 entries. I discovered that at some point the data is not the text inside the <li> element anymore. There is an additional <span> inside this list element. At this point I wrote a little script to fix some issues of the data (some entries contained too many commas, some entries were badly formatted, some entries had the title separated by a comma from the name). The resulting csv file looks much cleaner now. The Python script is named fetch_data.py that does the data extraction from the website html:

fetch_data.py Download View all

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

"""
This script reads the content of
https://www.nichtsemester.de/cbxpetition/offener-brief/
and evaluates the list of persons that signed the public call.
"""

import os, sys, re
import urllib.request, urllib.parse
from bs4 import BeautifulSoup

list = []

try:
    #fp = urllib.request.urlopen('https://www.nichtsemester.de/cbxpetition/offener-brief/')
    fp = open('website.html', 'rb')
    data = fp.read()
    
except:
    print("Error: could not open url")
    sys.exit(1)
# uncommend the following lines when downloading to check the http response code
#if fp.getcode() != 200:
#    print("Error: url {0} returned error code {1}".format(url, fp.getcode()))
#    sys.exit(1)

soap = BeautifulSoup(data.decode('utf-8'), 'html.parser')
cnt = 0
for li in soap.select('ol li'):
    cnt += 1

    # no matter if there are tags inside the <li> element, here we get
    # all text without the tags, these are stripped    
    line = li.get_text()
        
    # fix some individual errors
    line = line.replace('Jun.-Prof. Dr. Irina Gradinari FernUniversität in Hagen', 'Jun.-Prof. Dr. Irina Gradinari, FernUniversität in Hagen')
    line = line.replace('Prof. Jon Hughes, BSc, PhD, Justus Liebig University, Giessen, Plant Physiology', 'Prof. Jon Hughes BSc PhD, Justus Liebig University Giessen, Plant Physiology')
    line = line.replace('Ev. Hochschule Rheinland-Westfalen-Lippe.', 'Ev. Hochschule Rheinland-Westfalen-Lippe,')
    line = line.replace('Dr. Nicole Hirschfelder, Abteilung für Amerikanistik, Eberhard Karls Universität Tübingen', 'Dr. Nicole Hirschfelder, Eberhard Karls Universität Tübingen, Abteilung für Amerikanistik')
    line = line.replace('Rose-Anabel Beermann, wissenschaftliche', 'Rose-Anabel Beermann, "keine Angabe" , wissenschaftliche')
    line = line.replace('Minkenberg. HS Düsseldorf', 'Minkenberg, HS Düsseldorf')
    
    # create a new list with the rebuild entry of the current list element
    # first add the ordered number. This makes it easy to compare the output
    # with the original data from the webpage.
    entry = [str(cnt)]
    
    # split the list item into pieces, delimiter is the comma
    tuple = line.split(',')

    # add name, should be the first field
    entry.append(tuple.pop(0).strip())
    
    # no further comma? missing university and position/faculty?
    if len(tuple) == 0:
       entry.append('')
       entry.append('')
       list.append(entry)
       continue
       
    # add university, if too short or contains at least two . then apparently this
    # is still part of the name
    if len(tuple[0]) < 8 or tuple[0].count('.') > 1:
        entry[1] += ' ' + tuple.pop(0).strip()
        # no further information?
        if len(tuple) == 0:
           entry.append('')
           list.append(entry)
           continue

    # add university
    entry.append(tuple.pop(0).strip())
    
    # all further elements will be added to position/faculty
    entry.append(' '.join(tuple))
    entry[3] = entry[3].strip()

    list.append(entry)

# print out the whole list, this time with the column delimiter being a |
for i in list:
    print('|'.join(i))

During the development I was checking the output constantly and added some lines that had not been there at first, only to circumvent the bad formatted data. After executing the script to extract the data, it's stored in a list for further usage:

./fetch_data.py > liste.csv

Query the data

With the neatly formatted data we are now able to run some queries to view and count results. The general processing of the file may look like this:

less list.csv | cut -d \| -fX | grep pattern | wc -l

What happens here is that less is writing the file to stdout. We pipe the result to the command cut. This tool gets the argument -d to tell that the delimiter of the various fields is a pipe (with the backslash we escape it on the command line) and with -f we tell which key to extract. The X stands for the number of the column to extract. This is:

1 for the counter
2 for the name field
3 for the university field and
4 for the faculty/position field.

With grep we filter only lines that match our search term. You may use -i here to do a case-insensitive search, -E for using a regular expression and many other commands. Finally, we use wc (word count) with the argument -l to count the matching lines. A very useful hint is to run the command from left to right by adding one filtering command after each run. In this way you can verify whether you used the correct column, that the results return the expected matches, before finally counting them. You even may use several grep commands after another to refine the matched lines from the previous grep.

Below here I have included a few examples:

# how many professors
less list.csv | cut -d \| -f 2 | grep 'Prof.' | wc -l
# how many doctors
less list.csv | cut -d \| -f 2 | grep 'Dr.' | wc -l
# how many from the Humbold University of Berlin
less list.csv | cut -d \| -f 3 | grep Berlin | grep Humbold | wc -l
# in contrast to a univerity that teaches mostly STEM subjects
less list.csv | cut -d \| -f 3 | grep KIT | wc -l
# Anglistik or Amerikanistik
less list.csv | cut -d \| -f 4 | grep -iE 'Anglist|Amerikanist' | wc -l

If you want to see the whole matching line, e.g. to know in which faculties the petitioners from the Humbold University belong to, then run this query:

less list.csv | grep Berlin | grep Humbold

Run this command first with the wc -l command attached in a piped command list to see if you got the same amount of results as above. The grep command in this case tries to match all columns. Naturally Berlin and Humbold should match in the university column only. However, if you have a late successor of Wilhelm or Alexander Humbold who is engaged at some other institution and signed the petition, this row would match as well. In this case because of column 2 which contains the name.

Finally, we want to have a ranking institutions from where the most signers come from:

less list.csv | cut -d \| -f 3 | sort | uniq -c | sort -nr > uni_ranking.csv

A short explanation of the new commands here is: after extracting the university column we sort the entries in alphabetical order. This list is piped to the uniq tool that prints out every line once. However, this only works with lines of the same content that follow each other. This is why the list needs to be sorted first. The argument -c counts the lines with the same content. This result again is a list of the number of matches followed by the institution string. This list again is sorted numerical and in reverse order to get a ranking list with the most matching universities first.

The result is not exactly correct because if the institution is spelled different (but naturally it's the same) the counted results also appear in different lines. The entries:

1 Goethe-University Frankfurt
1 Goethe-Universität Frankfurt/M. / HSD Düsseldorf
1 Goethe-Universität Frankfurt am Main

refer all to the same university. Also, here you notice that for people that are occupied at two different universities, the automatic counting doesn't work either. However, for a general overview this is sufficient.

Students

Below the list of the petitioners there is also a list of supporters (mostly students) that we may also evaluate. The tricky part here is that on the original page the last 500 entries are listed only. A click on a link below the list loads the next 500 supporters via an Ajax request and appends the result on the webpage. You may click the link a couple of times but at least I noticed that Firefox was really slow if the entries grew beyond a certain amount. With Firebug I looked at the ajax request to see what is done there and copied the request from the network tab at the list of http requests.

The following curl command "simulates" the ajax request from the browser. I actually don't know for what the security parameter is used here, but I send it with my requests as well with the same way as I found it in the request with Firebug:

#!/bin/bash
# content of the script ajax.sh

if [[ -z $1 ]]; then
    echo "page number missing"
    exit
fi

page=$1

curl -XPOST \
-H "X-Request-Width: XMLHttpRequest" \
-a "User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0" \
-d "action=cbxpetition_load_more_signs&security=73ce5279e8&petition_id=5&page=${page}&perpage=500&order=DESC&orderby=id" \
--url https://www.nichtsemester.de/wp-admin/admin-ajax.php > page_${page}.json

To run the requests for all ajax calls you may run the script in a loop like this:

for i in `seq 1 31`; do ./ajax.sh $i > page_$i.json ; done

The result that is stored in these json files is the response from the ajax request. On the webpage some javascript would take the json and put the string into the DOM tree of the page. I need to transform this json again into some data structure that can be processed easier later on.

The following Python fetch_sig.py script transforms the data into a csv file. This works very similar as the script above, except that it's a little simpler:

fetch_sig.py Download View all

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

"""
This script reads the content of
https://www.nichtsemester.de/cbxpetition/offener-brief/
and evaluates the list of persons that signed the public call.
"""

import os, sys, re, json
from bs4 import BeautifulSoup

list = []

try:
    fp = open(sys.argv[1], 'rb')
    data = json.load(fp)
    
except:
    print("Error: could not open json file")
    sys.exit(1)

soap = BeautifulSoup(data['listing'], 'html.parser')
cnt = 0
for item in soap.select('.signature-card'):
    cnt += 1

    person = item.select('.signature-name')
    
    if person == None:
        continue
        
    # create a new list with the rebuild entry of the current list element
    entry = [str(cnt)]
    
    tuple = person[0].get_text().split(',')
    entry.append(tuple.pop(0).strip())
    if len(tuple) == 0:
        entry.append('')
    else:
        entry.append(','.join(tuple).strip())
    
    date = item.select('.signature-date-time')
    if date:
        entry.append(date[0].get_text().replace(' um', '').replace(' Uhr.', ':00'))
    else:
        entry.append('')
    
    list.append(entry)

# print out the whole list, this time with the column delimiter being a |
for i in list:
    print('|'.join(i))

The script is now used in a loop to transform all json files at once. The output is added to the same csv file.

for i in `seq 1 31`; do ./fetch_sig.py page_$i.json >> students.csv ; done

The csv file contains the following columns:

1 = ascending number for each entry
2 = name of the person
3 = institution where the person is occupied
4 = date of sign up

Unfortunately column 3 contains mix of the institution together with the faculty and/or position. This makes it impossible to get a ranking of the university as it is done with the other data. Other queries e.g. for a certain university, will work though although maybe not as smooth as above. To get more detailed information about institutions some semantic analytics of the content in column 3 must be done to figure out what is the name of an institution and what is part of an occupation or position. This can be done e.g. with regular expressions and dictionary lookups matching the content with a list of known universities. However, this is much more work, which I am not interested in for this data at this point.

According to the numbers on the website I also miss 501 entries, counting my list of signatures and the list of the petitioners. I didn't worry much about it because for the student evaluation you would have to invest some more time to get good and reliable figures. I wanted to show an easy approach on how to get hold of the data and to have a first insight view without spending much time in it.

Tags: Bash, Python