Parsing XML in Golang
The relevance of XML in 2022 would be in question, but there are still many legacy systems that can provide data in this format, so we have to work with it. XML is popular in the travel industry. For example, GDS (international booking systems, you can read more about them in wikipedia) or information system Darwin associations of rail transport companies in the UK are actively using it. Therefore, I hope that this article will be useful to someone. It discusses a couple of approaches to parsing xml in Golang: regular and streaming, custom field parsing, and working with different encodings. We will use the package encoding/xml from the standard library. If you have already worked with encoding/jsonthen there will be a lot of similarities, but there are still some differences.
Generating Go structures from XML
First of all, we need to describe go structures corresponding to xml files. You can do it manually using structural tags (detailed information on how to describe tags for XML can be found in the documentation for the method Marshall). But for speed and simplicity, we will use one of the many xml go structure generators, for example https://github.com/miku/zek/ (note that there is an online version of this generator: https://www.onlinetool.io/xmltogo/). If this is your first time encountering this approach, then I recommend that you first read, for example, the article “Using struct tags in Go”
Parsing XML into Go structure
Let’s start with a simple xml file and a regular Unmarshal into a golang structure. I took an example file from the site w3schools. Let me remind you that first we need to describe the go struct corresponding to the xml structure. To do this, we will use structure tags (similar to json, if you have not worked with json in golang before, you can read about it here and herein Russian here and here).
Let’s look at an example XML file:
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>
And here is the go structure that describes it:
type BreakfastMenu struct {
XMLName xml.Name `xml:"breakfast_menu"`
Food []struct {
Name string `xml:"name"`
Price string `xml:"price"`
Description string `xml:"description"`
Calories string `xml:"calories"`
} `xml:"food"`
}
Further everything is rather trivial. The code for Unmarshal (parsing from xml to golang structure) will look like this:
menu := new(BreakfastMenu)
err := xml.Unmarshal([]byte(data), menu)
if err != nil {
fmt.Printf("error: %v", err)
return
}
According to Marshal (the reverse operation of Unmarshal, from the go structure in xml):
xmlText, err := xml.MarshalIndent(menu, " ", " ")
if err != nil {
fmt.Printf("error: %v\n", err)
return
}
You can see the full text of the program under the cut and try to run it on link (note that we use MarshalIndent instead of Marshall – this function allows you to display xml in a more readable form: add indents and line breaks.
Golang XML Marshal/Unmarshal example
package main
import (
"encoding/xml"
"fmt"
)
type BreakfastMenu struct {
XMLName xml.Name `xml:"breakfast_menu"`
//Text string `xml:",chardata"`
Food []struct {
//Text string `xml:",chardata"`
Name string `xml:"name"`
Price string `xml:"price"`
Description string `xml:"description"`
Calories string `xml:"calories"`
} `xml:"food"`
}
func main() {
menu := new(BreakfastMenu)
err := xml.Unmarshal([]byte(data), menu)
if err != nil {
fmt.Printf("error: %v", err)
return
}
fmt.Printf("--- Unmarshal ---\n\n")
for _, foodNode := range menu.Food {
fmt.Printf("Name: %s\n", foodNode.Name)
fmt.Printf("Price: %s\n", foodNode.Price)
fmt.Printf("Description: %s\n", foodNode.Description)
fmt.Printf("Calories: %s\n", foodNode.Calories)
fmt.Printf("---\n")
}
xmlText, err := xml.MarshalIndent(menu, " ", " ")
if err != nil {
fmt.Printf("error: %v\n", err)
return
}
fmt.Printf("\n--- Marshal ---\n\n")
fmt.Printf("xml: %s\n", string(xmlText))
}
var data = `
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>
`
Thus, everything is simple here, but there are a couple of problems: if the xml file is large, then we will need a large amount of RAM, and if we receive the file over the network, we will need to wait for the full content to be received before start parsing. Let’s look at the second approach, which will allow us to solve these problems.
XML Stream Parsing
For streaming XML parsing, we can use the type decoderwhich allows you to parse an xml file in a stream and expects the stream to be encoded in UTF-8 (literal quote from the documentation: “A Decoder represents an XML parser reading a particular input stream. The parser assumes that its input is encoded in UTF-8.”)
With the full text of the program, you can under the cut and run from link.
An example of streaming XML parsing in Golang
package main
import (
"bytes"
"encoding/xml"
"fmt"
)
const foodElementName = "food"
type BreakfastMenu struct {
Food []Food `xml:"food"`
}
type Food struct {
Name string `xml:"name"`
Price string `xml:"price"`
Description string `xml:"description"`
Calories string `xml:"calories"`
}
func main() {
var (
menu BreakfastMenu
food Food
)
xmlData := bytes.NewBufferString(data)
d := xml.NewDecoder(xmlData)
for t, _ := d.Token(); t != nil; t, _ = d.Token() {
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == foodElementName {
d.DecodeElement(&food, &se)
menu.Food = append(menu.Food, food)
}
}
}
fmt.Printf("--- Unmarshal ---\n\n")
for _, foodNode := range menu.Food {
fmt.Printf("Name: %s\n", foodNode.Name)
fmt.Printf("Price: %s\n", foodNode.Price)
fmt.Printf("Description: %s\n", foodNode.Description)
fmt.Printf("Calories: %s\n", foodNode.Calories)
fmt.Printf("---\n")
}
}
var (
data = `
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>
`
)
Let’s look at the main changes in the code and analyze them in more detail:
d := xml.NewDecoder(xmlData)
for t, _ := d.Token(); t != nil; t, _ = d.Token() {
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == foodElementName {
d.DecodeElement(&food, &se)
menu.Food = append(menu.Food, food)
}
}
}
First instantiated xml.decoder using the function xml.NewDecoder. Next, iterates over the xml tokens using the method Token. It returns type Token or nil if the end of file has been reached. Strictly speaking, the method returns two values: Token and an error, if any. Returns nil if end of file is reached. Token and io.EOF as an error.
The xml.Token type is an interface and announced how ‘type Token any’ (in its turn any declared as an empty interface: ‘type any = interface{}’). Before introducing generics in Golang xml.Token was declared as an empty interface:’type Token interface{}’. Thus, it can contain any type of data and, according to documentation, can be one of the following types: StartElement, EndElement, CharData, Comment, ProcInst, or Directive. We will only be interested in the beginning of the element, i.e. type StartElement. As soon as we meet it, we check that it is the “food” node. If so, then we decode the structure in Go using the Decode method.
Custom parsing (custom unmarshal)
Sometimes you need to describe your decoder for a particular field. Often this happens in the case of parsing dates, times, or enums. This can be done using a custom data type that must implement the interface Unmarshaler package encoding/xml. The interface contains only one method: UnmarshalXMLlet’s look at an example of its implementation:
type userDate time.Time
const userDateFormat = "2006-01-02"
func (ud *userDate) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
dateString := ""
err := d.DecodeElement(&dateString, &start)
if err != nil {
return err
}
dat, err := time.Parse(userDateFormat, dateString)
if err != nil {
return err
}
*ud = userDate(dat)
return nil
}
In short, the method takes an instance of the current xml.decoder and an xml element (we already used the xml.StartElement data type in stream parsing), which are used to decode the element into a string. Then we parse the string into the type time.Time (we use our own date format template: userDateFormat) and assign values to the receiver udhaving previously converted the type to userDate. You can see the full text of the program under the cut and try to run it on link.
Custom unmarshal XML Golang
package main
import (
"encoding/xml"
"fmt"
"time"
)
type userDate time.Time
const userDateFormat = "2006-01-02"
type FilmsDB struct {
XMLName xml.Name `xml:"films"`
Film []Film `xml:"film"`
}
type Film struct {
Title string `xml:"title"`
ReleaseDate userDate `xml:"releaseDate"`
}
func (ud *userDate) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
dateString := ""
err := d.DecodeElement(&dateString, &start)
if err != nil {
return err
}
dat, err := time.Parse(userDateFormat, dateString)
if err != nil {
return err
}
*ud = userDate(dat)
return nil
}
func (ud userDate) String() string {
return time.Time(ud).Format(time.RFC822)
}
func main() {
filmsDB := new(FilmsDB)
err := xml.Unmarshal([]byte(data), filmsDB)
if err != nil {
fmt.Printf("error: %v", err)
return
}
fmt.Printf("--- Unmarshal ---\n\n")
for _, film := range filmsDB.Film {
fmt.Printf("Title: %s\n", film.Title)
fmt.Printf("Release Date: %s\n", film.ReleaseDate)
fmt.Printf("---\n")
}
}
var (
data = `
<?xml version="1.0" encoding="UTF-8"?>
<films>
<film>
<title>Johnny Mnemonic</title>
<releaseDate>1995-05-26</releaseDate>
</film>
</films>
`
)
Text encodings
I would also like to say a few words about encodings in xml. Rarely, but still sometimes you may encounter an encoding other than UTF-8. In this case, you can set the desired encoding using the field CharsetReader at the decoderwhich is a function and is expected to convert from xml file encoding to utf-8 (signature: ‘CharsetReader func(charset string, input io.Reader) (io.Reader, error)’ )
The easiest way to set CharsetEncoder is to use NewReaderLabel from the package x/net/html/charset. According to the passed charset (aka label in the signature NewReaderLabel) using the method lookup it matches the encoding from this tables. The passed charset parameter is taken from encoding xml file parameter. The code will be something like this:
filmsDB := new(FilmsDB)
r := bytes.NewReader([]byte(data))
d := xml.NewDecoder(r)
d.CharsetReader = charset.NewReaderLabel
err := d.Decode(&filmsDB)
if err != nil {
fmt.Printf("error: %v", err)
return
}
The full code is under the cut, and you can run it by link. Note that ‘encoding=”windows-1251″‘ for XML and title in windows-1251 encoding.
Working with XML Encodings in Golang
package main
import (
"bytes"
"encoding/xml"
"fmt"
"time"
"golang.org/x/net/html/charset"
)
type userDate time.Time
const userDateFormat = "2006-01-02"
type FilmsDB struct {
XMLName xml.Name `xml:"films"`
Film []Film `xml:"film"`
}
type Film struct {
Title string `xml:"title"`
ReleaseDate userDate `xml:"releaseDate"`
}
func (ud *userDate) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error {
dateString := ""
err := d.DecodeElement(&dateString, &start)
if err != nil {
return err
}
dat, err := time.Parse(userDateFormat, dateString)
if err != nil {
return err
}
*ud = userDate(dat)
return nil
}
func (ud userDate) String() string {
return time.Time(ud).Format(time.RFC822)
}
func main() {
filmsDB := new(FilmsDB)
r := bytes.NewReader([]byte(data))
d := xml.NewDecoder(r)
d.CharsetReader = charset.NewReaderLabel
err := d.Decode(&filmsDB)
if err != nil {
fmt.Printf("error: %v", err)
return
}
fmt.Printf("--- Unmarshal ---\n\n")
for _, film := range filmsDB.Film {
fmt.Printf("Title: %s\n", film.Title)
fmt.Printf("Release Date: %s\n", film.ReleaseDate)
fmt.Printf("---\n")
}
}
var (
jhonnyMnemonicASCII = []byte{0xc4, 0xe6, 0xee, 0xed, 0xed, 0xe8, 0x2d, 0xcc, 0xed, 0xe5, 0xec, 0xee, 0xed, 0xe8, 0xea}
)
var (
data = `
<?xml version="1.0" encoding="windows-1251"?>
<films>
<film>
<title>` + string(jhonnyMnemonicASCII) + `</title>
<releaseDate>1995-05-26</releaseDate>
</film>
</films>
`
)
Conclusion
In the article, I tried to consider the main ways of parsing xml and some related issues. It does not claim to be complete and is not comprehensive, but this usually turns out to be quite enough to solve most of the typical problems. I hope that these examples will make life a little easier for someone and speed up your development. In the next section I will leave some useful links.
useful links
https://pkg.go.dev/encoding/xml#Marshal – documentation on the Marshal method, here you can read how to describe go structures for xml
https://www.digitalocean.com/community/tutorials/how-to-use-struct-tags-in-go-ru – about structure tags in Golang
https://www.onlinetool.io/xmltogo/ – online go structure generator from xml file
https://habr.com/ru/company/vk/blog/463063/ – a good article about interfaces in general and about an empty interface in particular