Working with XML as an Array, Version 2
Hello. I want to share with you my experience in parsing XML files up to four gigabytes in size. I will teach you how to do it quickly.
In a nutshell, for fast file parsing, you need to use XMLReder in conjunction with yield.
Read about my implementation of this bundle below.
XMLReder
XMLReder is the only class in PHP that can read a file in parts. PHP has SimpleXML, DOMDocument, but they work on the entire file.
And while we are reading the file from disk, our processor is idle, while we are turning the file into a SimpleXML instance – the database is idle, but I must say that the XML file is being parsed precisely in order to insert a record into the database. And all this downtime is time wasted.
yield
The next thing to help us is the yield statement. We read one element from the file, parsed it, formed the SQL insert command, executed the command, read the next element from the file, and so on, until the end. No one is idle, everything is about the same loaded.
Now put it all together and get FastXmlToArray.
FastXmlToArray
FastXmlToArray is a class with a static method convert(), you can submit either a link to an XML file (XML URI) or the actual XML string as input. The output will be a PHP array with all the attributes and values of the root element and all nested elements.
$xml =<<<XML
<outer any_attrib="attribute value">
<inner>element value</inner>
<nested nested-attrib="nested attribute value">nested element value</nested>
</outer>
XML;
$result =
\SbWereWolf\XmlNavigator\FastXmlToArray::prettyPrint($xml);
echo json_encode($result, JSON_PRETTY_PRINT);
Console output
{
"outer": {
"@attributes": {
"any_attrib": "attribute value"
},
"inner": "element value",
"nested": {
"@value": "nested element value",
"@attributes": {
"nested-attrib": "nested attribute value"
}
}
}
}
The killer feature of this class is the static extractElements() method, which takes an XMLReader and returns the same: an array with all the attributes and values of this element and all nested elements.
/**
* @param XMLReader $reader
* @param string $valueIndex index for element value
* @param string $attributesIndex index for element attributes collection
* @return array
*/
public static function extractElements(
XMLReader $reader,
string $valueIndex = IFastXmlToArray::VALUE,
string $attributesIndex = IFastXmlToArray::ATTRIBUTES,
): array;
The essential difference is that convert() processes the root element of the XML document immediately, and extractElements() works with an arbitrary element, not necessarily with the root element, you can pass any.
Usage example
Let’s parse what is:
<?xml version="1.0" encoding="utf-8"?>
<CARPLACES>
<CARPLACE
ID="11361653"
OBJECTID="20326793"
/>
<CARPLACE
ID="94824"
OBJECTID="101032823"
/>
</CARPLACES>
Suppose we are only interested in the CARPLACE elements in the entire document.
Let’s translate XMLReader to the first element of CARPLACE
$reader = XMLReader::XML($xml);
$mayRead = true;
while ($mayRead && $reader->name !== 'CARPLACE') {
$mayRead = $reader->read();
}
Loop through all CARPLACE elements until we get to an element with a different name or until the document runs out
while ($mayRead && $reader->name === 'CARPLACE') {
$elementsCollection = FastXmlToArray::extractElements(
$reader,
);
$result = FastXmlToArray::createTheHierarchyOfElements(
$elementsCollection,
);
echo json_encode([$result], JSON_PRETTY_PRINT);
while (
$mayRead &&
$reader->nodeType !== XMLReader::ELEMENT
) {
$mayRead = $reader->read();
}
}
What are we doing here?
We get an element with all its properties and a list of nested elements (in this example, there are no nested ones)
$elementsCollection = FastXmlToArray::extractElements(
$reader,
);
We form a hierarchical array (in this example, from one element and its attributes)
$result = FastXmlToArray::createTheHierarchyOfElements(
$elementsCollection,
);
Output the resulting array to the console
echo json_encode([$result], JSON_PRETTY_PRINT);
Scroll “file” to next element or end of “file”
while (
$mayRead &&
$reader->nodeType !== XMLReader::ELEMENT
) {
$mayRead = $reader->read();
}
The console will have something like this:
[
{
"n": "CARPLACE",
"a": {
"ID": "11361653",
"OBJECTID": "20326793"
}
}
][
{
"n": "CARPLACE",
"a": {
"ID": "94824",
"OBJECTID": "101032823"
}
}
]
Another example
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<QueryResult
xmlns="urn://x-artefacts-smev-gov-ru/services/service-adapter/types">
<smevMetadata
b="2">
<MessageId
c="re">c0f7b4bf-7453-11ed-8f6b-005056ac53b6
</MessageId>
<Sender>CUST01</Sender>
<Recipient>RPRN01</Recipient>
</smevMetadata>
<Message
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="RequestMessageType">
<RequestMetadata>
<clientId>a0efcf22-b199-4e1c-984a-63fd59ed9345</clientId>
<linkedGroupIdentity>
<refClientId>a0efcf22-b199-4e1c-984a-63fd59ed9345</refClientId>
</linkedGroupIdentity>
<testMessage>false</testMessage>
</RequestMetadata>
<RequestContent>
<content>
<MessagePrimaryContent>
<ns:Query
xmlns:ns="urn://rpn.gov.ru/services/smev/cites/1.0.0"
xmlns="urn://x-artefacts-smev-gov-ru/services/message-exchange/types/basic/1.2"
>
<ns:Search>
<ns:SearchNumber
Number="22RU006228DV"/>
</ns:Search>
</ns:Query>
</MessagePrimaryContent>
</content>
</RequestContent>
</Message>
</QueryResult>
An example of a request for information that came via SMEV, we are only interested in the ns: Query element here
The code will be similar:
$mayRead = true;
$reader = XMLReader::XML($xml);
while ($mayRead && $reader->name !== 'ns:Query') {
$mayRead = $reader->read();
}
while ($reader->name === 'ns:Query') {
$elementsCollection = FastXmlToArray::extractElements(
$reader,
);
$result = FastXmlToArray::createTheHierarchyOfElements(
$elementsCollection,
);
echo json_encode([$result], JSON_PRETTY_PRINT);
while (
$mayRead &&
$reader->nodeType !== XMLReader::ELEMENT
) {
$mayRead = $reader->read();
}
}
$reader->close();
Console output:
[
{
"n": "ns:Query",
"a": {
"xmlns:ns": "urn:\/\/rpn.gov.ru\/services\/smev\/cites\/1.0.0",
"xmlns": "urn:\/\/x-artefacts-smev-gov-ru\/services\/message-exchange\/types\/basic\/1.2"
},
"s": [
{
"n": "ns:Search",
"s": [
{
"n": "ns:SearchNumber",
"a": {
"Number": "22RU006228DV"
}
}
]
}
]
}
]
In fact, of course, we are only interested in “Number”: “22RU006228DV”, and in the production code it would be while ($reader->name === ‘ns:SearchNumber’), for the sake of clarity, I resulted in getting a larger piece.
Performance measurements
Before rewriting my library to work with XML, I looked at what Open Source has to offer us.
On Packagist, under a hundred packages, I looked at the first 30, looked at the source, everything was about the same, but the time was different, as it turned out later, it was a mistake of my benchmark.
In fact, the time is about the same everywhere. With the difference that functions work faster than static methods, and static methods work faster than class instance methods, and the slowest ones are just PHP scripts.
During development, I tried all 4 options, and the same code works in different “formats” with a difference of microseconds, if this is important to you, then write your own XML converter in a procedural style, you will win a couple of microseconds.
My performance measurements:
91 mcs 200 ns \Mtownsend\XmlToArray\XmlToArray::convert()
82 mcs 0 ns xmlstr_to_array()
139 mcs 600 ns getNextElement
95 mcs 700 ns \SbWereWolf\XmlNavigator\Converter->prettyPrint
105 mcs 200 ns \SbWereWolf\XmlNavigator\FastXmlToArray::prettyPrint
107 mcs 0 ns \SbWereWolf\XmlNavigator\Converter->xmlStructure
91 mcs 900 ns \SbWereWolf\XmlNavigator\FastXmlToArray::convert
The numbers differ from launch to launch, but the overall picture is something like this.
My use case was to turn 280 gigabytes of XML files into a database of the Federal Information Address System (FIAS DB).
I hope now it is clear why I was worried about the time of parsing XML files.
What is surprising is that 280 gigabytes of XML files are turned into 190 gigabytes of a database, it feels like the DBMS does not have any optimization, if you throw out the element names and attribute names and all the markup from XML, what will remain there? For my taste, the volume should have been halved at least, but no. 190 gigabytes is without indexes, probably all 220 will be with indexes.
In the first versions, the parser ate 4 gigabytes of RAM, loaded the database by 25% of the processor, I don’t remember how much the PHP process ate RAM. Inserting 100,000 records took from 2 minutes, reading a file could take up to 10 minutes.
The latest version of the parser inserts another 100,0000 records exactly every 9 seconds. PHP and the database each eat up no more than 8 megabytes of RAM and no more than 12% of the processor.
Conclusion
If you want to try this parser in action, then you can install the package through Composer
composer require sbwerewolf/xml-navigator
More information in README.md