Working with XML as an Array, Version 2

Hello. I want to share with you my experience in parsing XML files up to four gigabytes in size. I will teach you how to do it quickly.

In a nutshell, for fast file parsing, you need to use XMLReder in conjunction with yield.

Read about my implementation of this bundle below.

XMLReder

XMLReder is the only class in PHP that can read a file in parts. PHP has SimpleXML, DOMDocument, but they work on the entire file.

And while we are reading the file from disk, our processor is idle, while we are turning the file into a SimpleXML instance – the database is idle, but I must say that the XML file is being parsed precisely in order to insert a record into the database. And all this downtime is time wasted.

yield

The next thing to help us is the yield statement. We read one element from the file, parsed it, formed the SQL insert command, executed the command, read the next element from the file, and so on, until the end. No one is idle, everything is about the same loaded.

Now put it all together and get FastXmlToArray.

FastXmlToArray

FastXmlToArray is a class with a static method convert(), you can submit either a link to an XML file (XML URI) or the actual XML string as input. The output will be a PHP array with all the attributes and values ​​of the root element and all nested elements.

$xml =<<<XML
<outer any_attrib="attribute value">
    <inner>element value</inner>
    <nested nested-attrib="nested attribute value">nested element value</nested>
</outer>
XML;
$result =
    \SbWereWolf\XmlNavigator\FastXmlToArray::prettyPrint($xml);
echo json_encode($result, JSON_PRETTY_PRINT);

Console output

{
  "outer": {
    "@attributes": {
      "any_attrib": "attribute value"
    },
    "inner": "element value",
    "nested": {
      "@value": "nested element value",
      "@attributes": {
        "nested-attrib": "nested attribute value"
      }
    }
  }
}

The killer feature of this class is the static extractElements() method, which takes an XMLReader and returns the same: an array with all the attributes and values ​​of this element and all nested elements.

    /**
     * @param XMLReader $reader
     * @param string $valueIndex index for element value
     * @param string $attributesIndex index for element attributes collection
     * @return array
     */
    public static function extractElements(
        XMLReader $reader,
        string $valueIndex = IFastXmlToArray::VALUE,
        string $attributesIndex = IFastXmlToArray::ATTRIBUTES,
    ): array;

The essential difference is that convert() processes the root element of the XML document immediately, and extractElements() works with an arbitrary element, not necessarily with the root element, you can pass any.

Usage example

Let’s parse what is:

<?xml version="1.0" encoding="utf-8"?>
<CARPLACES>
    <CARPLACE
            ID="11361653"
            OBJECTID="20326793"
    />
    <CARPLACE
            ID="94824"
            OBJECTID="101032823"
    />
</CARPLACES>

Suppose we are only interested in the CARPLACE elements in the entire document.

Let’s translate XMLReader to the first element of CARPLACE

$reader = XMLReader::XML($xml);
$mayRead = true;
while ($mayRead && $reader->name !== 'CARPLACE') {
    $mayRead = $reader->read();
}

Loop through all CARPLACE elements until we get to an element with a different name or until the document runs out

while ($mayRead && $reader->name === 'CARPLACE') {
    $elementsCollection = FastXmlToArray::extractElements(
        $reader,
    );
    $result = FastXmlToArray::createTheHierarchyOfElements(
        $elementsCollection,
    );
    echo json_encode([$result], JSON_PRETTY_PRINT);

    while (
        $mayRead &&
        $reader->nodeType !== XMLReader::ELEMENT
    ) {
        $mayRead = $reader->read();
    }
}

What are we doing here?

We get an element with all its properties and a list of nested elements (in this example, there are no nested ones)

$elementsCollection = FastXmlToArray::extractElements(
    $reader,
);

We form a hierarchical array (in this example, from one element and its attributes)

$result = FastXmlToArray::createTheHierarchyOfElements(
    $elementsCollection,
);

Output the resulting array to the console

echo json_encode([$result], JSON_PRETTY_PRINT);

Scroll “file” to next element or end of “file”

while (
    $mayRead &&
    $reader->nodeType !== XMLReader::ELEMENT
) {
    $mayRead = $reader->read();
}

The console will have something like this:

[
    {
        "n": "CARPLACE",
        "a": {
            "ID": "11361653",
            "OBJECTID": "20326793"
        }
    }
][
    {
        "n": "CARPLACE",
        "a": {
            "ID": "94824",
            "OBJECTID": "101032823"
        }
    }
]

Another example

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<QueryResult
        xmlns="urn://x-artefacts-smev-gov-ru/services/service-adapter/types">
    <smevMetadata
            b="2">
        <MessageId
                c="re">c0f7b4bf-7453-11ed-8f6b-005056ac53b6
        </MessageId>
        <Sender>CUST01</Sender>
        <Recipient>RPRN01</Recipient>
    </smevMetadata>
    <Message
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:type="RequestMessageType">
        <RequestMetadata>
            <clientId>a0efcf22-b199-4e1c-984a-63fd59ed9345</clientId>
            <linkedGroupIdentity>
                <refClientId>a0efcf22-b199-4e1c-984a-63fd59ed9345</refClientId>
            </linkedGroupIdentity>
            <testMessage>false</testMessage>
        </RequestMetadata>
        <RequestContent>
            <content>
                <MessagePrimaryContent>
                    <ns:Query
                            xmlns:ns="urn://rpn.gov.ru/services/smev/cites/1.0.0"
                            xmlns="urn://x-artefacts-smev-gov-ru/services/message-exchange/types/basic/1.2"
                    >
                        <ns:Search>
                            <ns:SearchNumber
                                    Number="22RU006228DV"/>
                        </ns:Search>
                    </ns:Query>
                </MessagePrimaryContent>
            </content>
        </RequestContent>
    </Message>
</QueryResult>

An example of a request for information that came via SMEV, we are only interested in the ns: Query element here

The code will be similar:

$mayRead = true;
$reader = XMLReader::XML($xml);
while ($mayRead && $reader->name !== 'ns:Query') {
    $mayRead = $reader->read();
}

while ($reader->name === 'ns:Query') {
    $elementsCollection = FastXmlToArray::extractElements(
        $reader,
    );
    $result = FastXmlToArray::createTheHierarchyOfElements(
        $elementsCollection,
    );

    echo json_encode([$result], JSON_PRETTY_PRINT);

    while (
        $mayRead &&
        $reader->nodeType !== XMLReader::ELEMENT
    ) {
        $mayRead = $reader->read();
    }
}
$reader->close();

Console output:

[
    {
        "n": "ns:Query",
        "a": {
            "xmlns:ns": "urn:\/\/rpn.gov.ru\/services\/smev\/cites\/1.0.0",
            "xmlns": "urn:\/\/x-artefacts-smev-gov-ru\/services\/message-exchange\/types\/basic\/1.2"
        },
        "s": [
            {
                "n": "ns:Search",
                "s": [
                    {
                        "n": "ns:SearchNumber",
                        "a": {
                            "Number": "22RU006228DV"
                        }
                    }
                ]
            }
        ]
    }
]

In fact, of course, we are only interested in “Number”: “22RU006228DV”, and in the production code it would be while ($reader->name === ‘ns:SearchNumber’), for the sake of clarity, I resulted in getting a larger piece.

Performance measurements

Before rewriting my library to work with XML, I looked at what Open Source has to offer us.

On Packagist, under a hundred packages, I looked at the first 30, looked at the source, everything was about the same, but the time was different, as it turned out later, it was a mistake of my benchmark.

In fact, the time is about the same everywhere. With the difference that functions work faster than static methods, and static methods work faster than class instance methods, and the slowest ones are just PHP scripts.

During development, I tried all 4 options, and the same code works in different “formats” with a difference of microseconds, if this is important to you, then write your own XML converter in a procedural style, you will win a couple of microseconds.

My performance measurements:

91 mcs 200 ns \Mtownsend\XmlToArray\XmlToArray::convert()
82 mcs 0 ns xmlstr_to_array()
139 mcs 600 ns getNextElement
95 mcs 700 ns \SbWereWolf\XmlNavigator\Converter->prettyPrint
105 mcs 200 ns \SbWereWolf\XmlNavigator\FastXmlToArray::prettyPrint
107 mcs 0 ns \SbWereWolf\XmlNavigator\Converter->xmlStructure
91 mcs 900 ns \SbWereWolf\XmlNavigator\FastXmlToArray::convert

The numbers differ from launch to launch, but the overall picture is something like this.

My use case was to turn 280 gigabytes of XML files into a database of the Federal Information Address System (FIAS DB).

I hope now it is clear why I was worried about the time of parsing XML files.

What is surprising is that 280 gigabytes of XML files are turned into 190 gigabytes of a database, it feels like the DBMS does not have any optimization, if you throw out the element names and attribute names and all the markup from XML, what will remain there? For my taste, the volume should have been halved at least, but no. 190 gigabytes is without indexes, probably all 220 will be with indexes.

In the first versions, the parser ate 4 gigabytes of RAM, loaded the database by 25% of the processor, I don’t remember how much the PHP process ate RAM. Inserting 100,000 records took from 2 minutes, reading a file could take up to 10 minutes.

The latest version of the parser inserts another 100,0000 records exactly every 9 seconds. PHP and the database each eat up no more than 8 megabytes of RAM and no more than 12% of the processor.

Conclusion

If you want to try this parser in action, then you can install the package through Composer

composer require sbwerewolf/xml-navigator

More information in README.md

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *