Using Immo Scout to find an apartment with the help of Python and the Bash

To find a new place to live, Immo Scout is one of the choices that comes into mind when investigating the real estate market. Depending on the region it may happen that the market is very volatile. Offers that were online at one day, may have been removed on the day after. Therefore, when you are looking at offers store the link of the ad, keep it in a list and check whether the URI is still valid. I personally had a text file with the URI of the ad at one line and additional information on separate lines. The format did vary. To actually check if an offer is still online a quick check is done with an HTTP head request. If the HTTP response code is 200, the offer is still valid. Trying to access ads that have been taken down result in an HTTP response with code 410 GONE.

cat list.txt | grep "https:" | while read i
do t=$(curl -Is $i | grep "200 OK") ; if [[ "$t " != " " ]]; then echo $i; fi
done

This prints out all URIs that are still valid. The problem that remains here is that you manually need to clear your file from obsolete URIs. The good part is that the list of URIs remains in the same order as they are in the original file.

Fetch details of an ad

To actually fetch the details of an ad I wrote a small python script, that is fed with the URI to the ad, fetches the HTML and tries to extract information of the apartment or house. The output can be written to a csv file that in turn can be used in a spreadsheet to have a nice table.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
This script fetches information about real estate object from immoscout.
The argument can be one or more urls to a single object for rental or
sale. Instead of an url also a local file can be provided that contains
the content of a url to an object.

The following arguments can be provided:
-f format    the output format of the data. By default this is json.
             Other formats are gpx and csv.
-o filename  name of the file where to write the output. If no file is
             given, the output is written to stdout.
"""

import urllib.request
import sys
import re
from os.path import exists
from lxml import etree
from io import StringIO

def fetchUrlAndParse(url: str) -> etree.ElementTree:
    if exists(url):
        f = open(url, "r")
        html   = f.read()
        data   = StringIO(html)
    else:
        opener = urllib.request.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0')]
        html = opener.open(url).read()
        data   = StringIO(html.decode('utf-8'))

    parser = etree.HTMLParser()
    tree   = etree.parse(data, parser)
    return tree

def fetchPageData(url: str) -> dict:
    tree = fetchUrlAndParse(url)
    factsheet = {}
    address = ''
    name = ''
    point = None
    link = url
    for table in tree.xpath('//table[contains(@class, "DataTable")]'):
        for tr in table.xpath('.//tr'):
            tds = tr.xpath('.//td')
            if len(tds) > 1:
                key = tds[0].text.strip() if tds[0].text is not None else ''
                if len(key) > 0:
                    val = tds[1].text.strip() if tds[1].text is not None else ''
                    factsheet[key] = val
    for h2 in tree.xpath('//h2'):
        if h2.text is not None and h2.text.strip() == 'Standort':
            address = etree.tostring(h2.getnext()).decode("utf-8").strip().replace('<br/>', '\n')
            address = re.sub('<[^<]+?>', '', address)
            break
    name = tree.xpath('//title')[0].text.strip()
    for a in tree.xpath('//a'):
        link = a.attrib['href']
        if link.find('maps') > -1:
           link = link[link.find('q=') + 2:]
           loc = link.split(',')
           if len(loc) == 2:
               point = {'lat': float(loc[0]), 'lon': float(loc[1])}
           break
    for l in tree.xpath('//link[@rel="canonical"]'):
        link = l.attrib['href']
        break

    return {
        'url': link,
        'name': name,
        'loc': point,
        'address': address,
        'facts': factsheet
    }

def formatGpx(output):
    gpx = '<?xml version="1.0" encoding="UTF-8" standalone="no" ?>' \
    '<gpx version="1.1" creator="canjanix.net">' \
    '<metadata> <!-- Metadaten --> </metadata>'
    for obj in output:
        gpx += '<wpt lat="%f" lon="%f"><src>%s</src><name>%s</name><desc>%s</desc><cmt>[CDATA[%s]]</cmt></wpt>' % \
            (obj['loc']['lat'], obj['loc']['lon'], obj['url'], obj['name'], obj['address'], obj['facts'])
    gpx += '</gpx>'
    return gpx

def formatCsv(output, delimiter = '|'):
    import re
    regex = re.compile(r"\r?\n", re.MULTILINE)
    cols = ['url', 'name', 'lat', 'lon', 'address'] # collect all possible column names
    for obj in output:
        for key in obj['facts'].keys():
            if key not in cols:
                cols.append(key)
    csvData = delimiter.join(cols) + "\n"
    for obj in output:
        for x in range(0, len(cols) - 1):
            k = cols[x]
            if k == 'lat' or k == 'lon':
                value = str(obj['loc'][k])
            elif k in ['url', 'name', 'address']:
                value = re.sub(regex, ' ', obj[k]).replace(delimiter, ' ')
            else:
                try:
                    value = re.sub(regex, ' ', obj['facts'][cols[x]]).replace(delimiter, ' ')
                    if (value == ''):
                        value = 'x'
                except KeyError:
                    value = ''
            csvData += value + delimiter
        csvData = csvData[0:-1] + "\n"
    return csvData

def main():
    urls       = []
    outputFmt  = 'json'
    outputFile = None
    currentArg = ''
    for i in range(len(sys.argv)):
        if i == 0:
            continue
        arg = sys.argv[i]
        # we have a command identified by - remember it in currentArg
        # in case this command needs an argument, or just set the
        # appropriate variable if this is a switch only.
        if arg == '--help':
            print(__doc__)
            sys.exit(0)
        elif arg[0:1] == '-':
            if arg in [ '-f', '-o' ]:
                currentArg = arg
            else:
                print('Invalid argument %s' % arg)
                sys.exit(1)
        elif len(currentArg) > 0:
            if currentArg == '-o':
                outputFile = arg
            else:
                if arg in [ 'json', 'gpx', 'csv' ]:
                    outputFmt = arg
                else:
                    print('Invalid argument %s' % arg)
                    sys.exit(1)

            currentArg = ''
        else:
            urls.append(arg)
    # argument processing is done here
    output = []
    for url in urls:
        data = fetchPageData(url)
        output.append(data)

    if outputFmt == 'gpx':
        output = formatGpx(output)
    elif outputFmt == 'csv':
        output = formatCsv(output)
    if outputFile is not None:
        print(output, file=open(outputFile, 'w+'))
    else:
        print(output)

if __name__ == "__main__":
    main()

The script can be used with several urls at once. The output is written to stdout or in a file given by the argument -o. This is quite useful when you use the -f argument to format the output as csv or gpx and you wish to collect information of several objects at once.

immoscout_datatable

There are several tables that have a class DataTable. Inside the table we look at the table rows and there at the first and second table cell (td element). The data is stored in a dictionary, the key is the left cell, the value is the right cell. Apart from the dictionary that contains all the facts of the real estate from the table, we collect some information that is scattered around elsewhere in the page. This is the title of the ad, the location (geo coordinates) and the canonical URL.

The gpx output is good when you need to export the geographic locations and want to import them into a GPS device. This makes navigation easier when you have them as targets in your device already and don't need to type in the address.

The CSV export must join all possible dictionary keys from all objects into one list and use them as the column header. If an object did not have a key defined, this cell remains empty.

I tested the script on the immoscout24.ch site. The german Immo Scout24 looks a bit different in its layout. Therefore, the Python script above would not fetch all information correctly. Furthermore, when running this script an HTTP 405 response is returned. Immo Scout tries to make screen scrapping difficult. Therefore, they also do not use any elements of structured data on their page.

There is a nice project at https://pypi.org/project/immoscrapy/ that tries to query Immo Scout. However, I didn't try this one out yet.