Working with XML in PHP

What a plain XML file looks like, is probably known by most developers. XML is also underneath or the basis of several other files, such as RSS feeds, Open Office files, KML files and much more. XML is also used for a lot of data transfers (e.g. from databases) where json lacks some details. That doesn't mean that XML is superior over JSON, only that in some cases the XML comes more natural into use. As of RSS feeds or sitemaps, there is simply a standard that exists and describes exactly how the XML must look like. On additional advantage of XML is, that there are DTDs that exactly describe the structure of an XML. You can validate an XML agains a DTD and see whether the XML is valid in this context.

All examples from this article are bundled in a ZIP archive so that after downloading and unpacking, these should run out of the box.

SimpleXML

The SimpleXML library of PHP is the easiest way to deal with XML. It's pretty much straight forward to use and with a few lines of code, you master the parsing and handling of the XML.

This is a sample to read e.g. the urls of a sitemap.xml file in PHP:

$xmlfile='http://www.sampledomain.de/sitemap.xml';
$xml = simplexml_load_file(rawurlencode($xmlfile));
$res = $xml->xpath('//url');
if (is_array($res)) {
    foreach ($res as $node) {
        echo '<a href="'.$node->loc.'">'.$node->loc.'</a><br/>' . "\n";
    }
}

The script reads the <url> elements of the sitemap. Inside the url element there is a <loc> element that contains the url. The sample script prints a list of HTML links with the urls.

Namespaces

XML files may contain namespaces. Then it's a bit more tricky to parse the XML or to do a xpath expression. Results will be unexpected if you do not handle the namespaces correctly.

This is a sitemap xml that it's stored in sitemap1.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>http://example.com/example.html</loc>
    <image:image>
      <image:loc>http://example.com/img1.jpg</image:loc>
    </image:image>
  </url> 
  <url>
    <loc>http://example.com/example2.html</loc>
    <image:image>
      <image:loc>http://example.com/photo.jpg</image:loc>
   </image:image>
  </url> 
</urlset>

and we want to fetch all urls.

Solution 1, get rid of the namespaces:

$data = file_get_contents('sitemap1.xml');
$data = str_replace('xmlns=', 'ns=', $data);
$xml = simplexml_load_string($data);
foreach ($xml->xpath('//url') as $node) {
    echo $node->loc . PHP_EOL;
}
foreach ($xml->xpath('//image:loc') as $img) {
    echo $img . PHP_EOL;
}

Solution 2, register the namespaces:

$data = file_get_contents('sitemap1.xml');
$xml = simplexml_load_string($data);
foreach ($xml->getDocNamespaces() as $k => $v) {
    if ($k == '') $k = 'c'; // hack to fill the empty prefix
    $xml->registerXPathNamespace($k, $v);
}
foreach ($xml->xpath('//c:url') as $node) {
    echo $node->loc . PHP_EOL;
}
foreach ($xml->xpath('//image:loc') as $img) {
    echo $img . PHP_EOL;
}

Dashes

XML elements may contain dashes. Because these do not work when using the class variable syntax in PHP, these XML element names must be escaped in a certain way.

If there is an XML snippet like:

  <registrationDate>2012-10-25T11:53:49-04:00</registrationDate>
  <ref>http://whois.arin.net/rest/customer/C03191699</ref>
  <city>Piscataway</city>
  <iso3166-1>
    <code2>US</code2>
    <code3>USA</code3>
    <name>UNITED STATES</name>
    <e164>1</e164>
  </iso3166-1>
  <handle>C03191699</handle>
  <name>Ahrefs Inc.</name>
  <postalCode>08854</postalCode>
  <iso3166-2>NJ</iso3166-2>

The nodes can be accessed the following way:

$whoisArr['city']       = (string)$xml->city;
$whoisArr['name']       = (string)$xml->name;
$whoisArr['postalCode'] = (string)$xml->postalCode;
$whoisArr['country']    = (string)$xml->{'iso3166-1'}->code2;
$whoisArr['state']      = (string)$xml->{'iso3166-2'};
$streetAddress= '';
foreach ($xml->streetAddress as $node) {
    $streetAddress .= $node->line . "\n";
}
$whoisArr['streetAddress'] = trim($streetAddress);

Attributes

XML elements often contain attributes. The contents or even existence of the attributes is important as well when parsing the XML.

Given the following snippet:

<?xml version="1.0">
<comments>
  <comment id="973544" author="julia" created="Tue, 17 Jan 2023 01:31:04 +0800">
    I love You!
  </comment>
  <comment id="973658" author="romeo" created="Tue, 17 Jan 2023 16:41:42 +0800">
    I love you, too! Julia
  </comment>
  <comment id="973665" author="julia" created="Tue, 17 Jan 2023 17:17:16 +0800">
    Thanks Romeo
  </comment>
</comments>

The chat can be displayed like:

Julia: I love You!
Romeo: I love you, too! Julia
Julia: Thanks Romeo

by using the following php code:

$xml = simplexml_load_string(file_get_contents('chat.xml'));

foreach ($xml->comment as $comment) {
  echo ucfirst($comment['author']) . ': ' . trim($comment) . PHP_EOL;
}

Dom Object Model

An alternative to SimpleXML could be the Dom Object Model. The users, that are familiar with Javascript probably know the DOM functions very well. In PHP they work the same as in Javascript. The above chat example could be parsed like this:

$doc = new DOMDocument();
$doc->load('chat.xml');
$comments = $doc->getElementsByTagName('comment');
for ($i = 0; $i < $comments->length; $i++) {
    echo $comments->item($i)->getAttribute('author') . ': ' . trim($comments->item($i)->textContent) . PHP_EOL;
}

echo "or..\n";

foreach ($doc->getElementsByTagName('comment') as $comment) {
    echo $comment->getAttribute('author') . ': ' . trim($comment->textContent) . PHP_EOL;
}

Some things need to be aware when working with the DOM Object model. While I am talking about XML only, HTML is a specific case. All these steps that can be used with XML, can be also used on HTML documents. The problem is that most HTML is no clean XML e.g. there are some syntax errors in XML while these are just fine in HTML. Also the DOM Object Model does not handle well HTML5 documents. The encoding of special chars gets corrupted. Likewise, if you write out a partial XML with $doc->saveXML(); the resulting string contains the XML header <?xml ...>. When merging string into a larger XML document, these headers need to be removed.

Parsing large XML Documents

SimpleXML loads the XML at once and parses it. The parsed structure is kept in the memory. You may realize that the bigger the XML gets the lesser chances are that the script does its job without running into the "Cannot allocate memory" error because of insufficient memory.

At one of my jobs, we got an export of a database and needed to import the data into our system. While the json file of the data was only about 0.5GB the XML result was significantly larger (more than 1GB). Here, SimpleXML simply was no option. Neither as it to use the json and load it at once.

Fortunately PHP has other methods to parse XML. I will use the simple chat.xml file from above and try to achieve the same output in dialogue style (like in the plot of a play).

XMLReader

The first variant uses XMLReader that doesn't look much more complicated as the SimpleXML version:

$xml = new XMLReader();
$xml->open('chat.xml');

while ($xml->read()) {
    if ($xml->nodeType === XMLReader::ELEMENT && $xml->name === 'comment') {
        echo $xml->getAttribute('author') . ': ' . trim($xml->readInnerXml()) . PHP_EOL;
    }
}

The trick here is that $xml->read() only reads chunks from the file and the pointer is moved forward in the chunk to the next element, attribute or text node. You can build up stacks with the required information or even split the huge XML file into smaller portions by focusing on some children of the root node (or at least elements) that are at the very top of the hierarchy, and everything inside these elements just write it out into separate files.

XMLParser

XMLParser is very similar to the XMLReader. The main difference is that the parser works like an event driven system. You set up handler functions, that are triggered when an opening element, a closing element, or a text node is found.

The HTMLParser in Python that is used in the script to fetch the geographic information from a wikipedia page works in the same way.

<?php
$author;

function startTag(XMLParser $parser, string $name, array $attr) {
    global $author;
    if ($name === 'COMMENT') {
        $author = $attr['AUTHOR'];
    }
}

function endTag(XMLParser $parser, string $name) {
    global $author;
    if ($name === 'COMMENT') {
        $author = null;
    }
}

function printComment(XMLParser $parser, string $data) {
    global $author;
    if (!empty($author)) {
        echo $author . ': ' . trim($data) . PHP_EOL;
    }
}

$stream = fopen('chat.xml', 'r');
$parser = xml_parser_create();
xml_set_element_handler($parser, 'startTag', 'endTag');
xml_set_character_data_handler($parser, 'printComment'); 
// set up the handlers here
while (($data = fread($stream, 4096))) {
    xml_parse($parser, $data); // parse the current chunk
}
xml_parse($parser, '', true); // finalize parsing
xml_parser_free($parser);
fclose($stream);

Again like with the XMLReader you must keep track of portions that you have parsed and what comes next. It's a state machine that you are implementing here. The global $author variable doesn't look that nice. Therefore, I wrote a second version that encapsulates the custom parsing into an own class that also keeps track of the depth of an element.

<?php
class MyParser {

    protected $output;
    protected $inComment;
    protected $level;

    public function flush()
    {
        $this->output = '';
        $this->inComment = null;
        $this->level = 0;

    }

    protected function handleStartTag(XMLParser $parser, string $name, array $attr) :void
    {
        $this->level++;
        if ($name === 'COMMENT') {
            $this->output .= $attr['AUTHOR'];
            $this->inComment = true;
        }
    }

    protected function handleEndTag(XMLParser $parser, string $name) :void
    {
        $this->level--;
        if ($name === 'COMMENT') {
            $this->inComment = null;
        }
    }

    protected function handleData(XMLParser $parser, string $data) :void
    {
        if ($this->level === 2 && $this->inComment) {
            $this->output .= ': ' . trim($data) . PHP_EOL;
        }
    }

    public function parse(string $xmlFile) :string
    {
        $this->flush();
        $stream = fopen($xmlFile, 'r');
        // create new XMLParser instance
        $parser = xml_parser_create();
        // and setup event handlers for the parser.
        xml_set_element_handler($parser, [$this, 'handleStartTag'], [$this, 'handleEndTag']);
        xml_set_character_data_handler($parser, [$this, 'handleData']); 
        // start parsing
        while (($data = fread($stream, 4096))) {
            xml_parse($parser, $data);
        }
        // free resources
        xml_parse($parser, '', true);
        xml_parser_free($parser);
        fclose($stream);
        // return the collected data string
        return $this->output;
    }
}
echo (new MyParser())->parse('chat.xml');

Merge XML documents

To merge XML documents simple xml is not sufficient. This must be done using the DOM functions. Here is a small sample:

<?
$thing = simplexml_load_string(
    '<thing>
       <node n="1"><child f="1"/></node>
    </thing>'
);

$dom_thing = dom_import_simplexml($thing);
$dom_node  = dom_import_simplexml($thing->node->child);
$dom_new   = $dom_thing->appendChild($dom_node->cloneNode(true));

$new_node  = simplexml_import_dom($dom_new);
$new_node['n'] = 2;

echo $thing->asXML();

In the archive there is a more complex example that merges two XML files from phpunit into one result.

Real World example: Query the Microsoft CRM

In ome of my jobs I had to deal with the Microsoft CRM. We needed to query the CRM by company names and wanted to see, if this company was already known to us and contained in the CRM.

This GET query is sent to the Microsoft CRM:

http://crm.example.de/XRMServices/2011/OrganizationData.svc/AccountSet?$filter=substringof(%27example%27,Name)&$select=Name,AccountNumber,Address1_Composite,StatusCode

and results in this XML:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="http://crm.example.de/XRMServices/2011/OrganizationData.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">AccountSet</title>
  <id>http://crm.example.de/XRMServices/2011/OrganizationData.svc/AccountSet</id>
  <updated>2016-12-14T15:41:29Z</updated>
  <link rel="self" title="AccountSet" href="AccountSet" />
  <entry>
    <id>http://crm.example.de/XRMServices/2011/OrganizationData.svc/AccountSet(guid'f6c308ea-5301-e211-adde-201f28c98549')</id>
    <title type="text">Example AG</title>
    <updated>2016-12-14T15:41:29Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="Account" href="AccountSet(guid'f6c308ea-5301-e211-adde-201f28c98549')" />
    <category term="Microsoft.Crm.Sdk.Data.Services.Account" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:StatusCode m:type="Microsoft.Crm.Sdk.Data.Services.OptionSetValue">
          <d:Value m:type="Edm.Int32">1</d:Value>
        </d:StatusCode>
        <d:Name>Example AG</d:Name>
        <d:AccountNumber>12141643021</d:AccountNumber>
        <d:Address1_Composite>Kaiserstr. 12&#xD;
&#xD;
76133 Karlsruhe&#xD;
Deutschland</d:Address1_Composite>
      </m:properties>
    </content>
  </entry>
  <entry>
    <id>http://crm.example.de/XRMServices/2011/OrganizationData.svc/AccountSet(guid'800d2feb-3060-e411-b535-000056905fd8')</id>
    <title type="text">Example GmbH</title>
    <updated>2016-12-14T15:41:29Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="Account" href="AccountSet(guid'800d2feb-3060-e411-b535-000056905fd8')" />
    <category term="Microsoft.Crm.Sdk.Data.Services.Account" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:StatusCode m:type="Microsoft.Crm.Sdk.Data.Services.OptionSetValue">
          <d:Value m:type="Edm.Int32">2</d:Value>
        </d:StatusCode>
        <d:Name>Example GmbH</d:Name>
        <d:AccountNumber>112233345</d:AccountNumber>
        <d:Address1_Composite>Wilhelmstraße 5&#xD;
&#xD;
76133 Karlsruhe&#xD;
Deutschland</d:Address1_Composite>
      </m:properties>
    </content>
  </entry>
</feed>

that is stored in crm_result.xml.

The PHP array should now contain the data from the properties and the guid. As you note the namespaces in the result, we need to register them first before parsing the XML and extracting the data we need. The following code works:

$result = array();
$xml = simplexml_load_file('crm_result.xml');
if (is_object($xml)) {
    foreach ($xml->getDocNamespaces() as $k => $v) {
        if ($k == '') $k = 'c'; // hack to fill the empty prefix
        $xml->registerXPathNamespace($k, $v);
    }
    $entries = $xml->xpath('//c:entry');
    if (is_array($entries)) {
        foreach ($entries as $entry) {
            // use the content of the $entry to build a new XML tree
            $node = simplexml_load_string(
                //  get rid of the namespaces in the elements like in <d:StatusCode>
                preg_replace('~(</?)\w+:~', '$1',
                    // and in the attributes like m:type="Edm.Int32"
                    preg_replace('~ \w+:([\w\d]+=")~', ' $1', $entry->asXML())
                )
            );
            $result[] = array(
                'guid' => mb_substr((string)$entry->id, -38, 36),  // guids have a fixed size
                'name' => (string)$entry->title,
                'status' => (string)$node->content->properties->StatusCode->Value,
                'id' => $node->content->properties->AccountNumber->__toString(),
                'address' => str_replace('&#xD;', "\n", $node->content->properties->Address1_Composite),
            );
        }
    }
    var_dump($result);
}

Very important, the return values of any Xpath or accessing a node in an SimpleXMLElement is again a SimpleXMLElement. Therefore, the object must be cast into a string to be used later on.

This approach serves a reasonable result for this XML but may fail on other XMLs. It's not a general solution but should show a way how to access the nodes.

The following script achieves the same result, except that it eliminates all namespaces on the xml string before loading it with SimpleXml:

$result = array();
$xmlData = preg_replace(
    // get rid of the namespaces in the elements like in <d:StatusCode>
    '~(</?)\w+:~',
    '$1',
    // use the callback to remove all attributes like m:type="Edm.Int32"
    // but keep the namespace declaration at the root element
    preg_replace_callback(
        '~ (\w+):([\w\d]+=")~',
        function($match) {
            return ($match[1] == 'xmlns')
                ? $match[0] : ' '.$match[2];
        },
        file_get_contents('crm_result.xml')
    )
);
$xml = simplexml_load_string($xmlData);

if (is_object($xml)) {
    if (isset($xml->entry)) {
        foreach ($xml->entry as $i => $entry) {
            $result[] = array(
                'guid' => mb_substr((string)$entry->id, -38, 36), // guids have a fixed size
                'name' => (string)$entry->title,
                'status' => (string)$entry->content->properties->StatusCode->Value,
                'id' => $entry->content->properties->AccountNumber->__toString(),
                'address' => str_replace('&#xD;', "\n", $entry->content->properties->Address1_Composite),
            );
        }
    }
    var_dump($result);
}

Tags: PHP