Monday, September 18, 2017

A short rant on XML - the Good, the Bad, and the Ugly

[editor's note: This blog post has been in "Drafts" for 11 years. In the spirit of just getting stuff out there, I'm publishing it basically as-is. Look for a follow-up blog post next week with some additional observations on structured data transfer from the 21st century]

So, let's see if I can keep myself to less than ten pages of text this time...

XML is the eXtensible Markup Language. It's closely related to both HTML, the markup language used to make the World Wide Web, and SGML, a document format that you've probably never dealt with unless you're either a government contractor, or you used the Internet back in the days before the Web. For the pedants out there, I do know that HTML is actually an SGML "application" and that XML is a proper subset of SGML. Let's not get caught up in the petty details at this point.

XML is used for a variety of different tasks these days, but the most common by far is as a kind of "neutral" format for exchanging structured data between different applications. To keep this short and simple, I'm going to look at XML strictly from the perspective of a data storage and interchange format.


The good

Unicode support

XML Documents can be encoded using the Unicode character encoding, which means that nearly any written character in any language can be easily represented in an XML document.

Uniform hierarchical structure

XML defines a simple tree structure for all the elements in a file - there's one root element, it has zero or more children, which each have zero or more children, ad infinitum. All elements must have an open and close tag, and elements can't overlap. This simple structure makes it relatively easy to parse XML documents.

Human-readable (more or less)

XML is a text format, so it's possible to read and edit an XML document "by hand" in a text editor. This is often useful when you're learning the format of an XML document in order to write a program to read or translate it. Actually writing or modifying XML documents in a text editor can be incredibly tedious, though a syntax-coloring editor makes it easier.

Widely supported

Modern languages like C# and Java have XML support "built in" in their standard libraries. Most other languages have well-supported free libraries for working with XML. Chances are, whatever messed up environment you have to work in, there's an XML reader/writer library available.


The bad

Legacy encoding support

XML Documents can also be encoded in whatever wacky character set your nasty legacy system uses. You can put a simple encoding="Ancient-Elbonian-EBCDIC" attribute in the XML declaration element, and you can write well-formed XML documents in your favorite character encoding. You probably shouldn't expect that anyone else will actually be able to read it, though.

Strictly hierarchical format

Not every data set you might want to interchange between two systems is structured hierarchically. In particular, representing a relational database or an in-memory graph of objects is problematic in XML. A number of approaches are used to get around this issue, but they're all outside the scope of standardized XML (obviously), and different systems tend to solve this problem in different ways, neatly turning the "standardized interchange format for data" into yet another proprietary format, which is only readable by the software that created it.

XML is verbose

A typical XML document can be 30% markup, sometimes more. This makes it larger than desired in many cases. There have been several attempts to define a "binary XML" format (most recently by the W3C group), but they really haven't caught on yet. For most applications where size or transmission speed is an issue, you probably ought to look into compressing the XML document using a standard compression algorithm (gzip, or zlib, or whatever), then decompressing it on the other end. You'll save quite a bit more that way than by trying to make the XML itself less wordy.

Some XML processing libraries are extremely memory-intensive

There are two basic approaches to reading an XML document. You can read the whole thing into memory and re-construct the structure of the file into a tree of nodes in memory, and then the application can use standard pointer manipulation to scan through the tree of nodes, looking for whatever information it needs, or further-transforming the tree into the program's native data structures. One XML processing library I've used loaded the whole file into memory all at once, then created a second copy of all the data in the tags. Actually, it could end up using up to the size of the file, plus twice the combined size of all the tags.

Alternatively, the reader can take a more stream-oriented approach, scanning through the file from beginning to end, and calling into the application code whenever an element starts or ends. This can be implemented with a callback to your code for every tag start/end, which gives you a simple interface, and doesn't require holding large amounts of data in memory during the parsing.

No random access

This is just fallout from the strict hierarchy, but it's extremely labor intensive to do any kind of data extraction from a large XML document. If you only want a subset of nodes from a couple levels down in the hierarchy, you've still got to step your way down there, and keep scanning throught the rest of the file to figure out when you've gone up a level.


The ugly

By far, the biggest problems with XML don't have anything to do with the technology itself, but with the often perverse ways in which it's misapplied to the wrong problems. Here are a couple of examples from my own experience.

Archiving an object graph, and the UUID curse

XML is a fairly reasonable format for transferring "documents", as humans understand them. That is, a primarily linear bunch of text, with some attributes that apply to certain sections of the text.

These days, a lot of data interchange between computer programs is in the form of relational data (databases), or complex graphs of objects, where you'll frequently need to make references back to previous parts of the document, or forward to parts that haven't come across yet.

The obvious way to solve this problem is by having a unique ID that you can reference to find one entity from another. Unfortunately, the "obvious" way to ensure that a key is unique is to generate a globally-unique key, and so you end up with a bunch of 64-bit or 128-bit GUIDs stuck in your XML, which makes it really difficult to follow the links, and basically impossible to "diff' the files, visually.

One way to avoid UUID proliferation is to use "natural unique IDs, if your data has some attribute that needs to be unique anyway.

What's the worst possible way to represent a tree?

I doubt anybody's ever actually asked this question, but I have seen some XML structures that make a pretty good case that that's how they were created. XML, by its heirarchical nature, is actually a really good fit for hierarchical data. Here is one way to store a tree in XML:

<pants color="blue" material="denim">
  <pocket location="back-right">
    <wallet color="brown" material="leather">  
      <bill currency="USD" value="10"></bill>  
      <bill currency="EURO" value="5"></bill>  
    </wallet>
  </pocket>  
</pants>  

And here's another:

<object>
  <id>
    20D06E38-60C1-433C-8D37-2FDBA090E197
  </id>
  <class>
    pants
  </class>
  <color>
    blue
  </color>
  <material>
     denim
  </material>
</object>
<object>
  <id>
    1C728378-904D-43D8-8441-FF93497B10AC
  </id>
  <parent>
    20D06E38-60C1-433C-8D37-2FDBA090E197
  </parent>
  <class>
    pocket
  </class>
  <location>
    right-back
  </location>
</object>
<object>
  <id>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </id>
  <parent>
    1C728378-904D-43D8-8441-FF93497B10AC
  </parent>
  <class>
    wallet
  </class>
  <color>
    brown
  </color>
  <material>
    leather
  </material>
</object>
<object>
  <id>
    E197AA8D-842D-4434-AAC9-A57DF4543E43
  </id>
  <parent>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </parent>
  <class>
    bill
  </class>
  <currency>
    USD
  </currency>
  <denomination>
    10
  </denomination>
</object>
<object>
  <id>
    AF723BDD-80A1-4DAB-AD16-5B37133941D0
  </id>
  <parent>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </parent>
  <class>
    bill
  </class>
  <currency>
    EURO
  </currency>
  <denomination>
    10
  </denomination>
</object>

So, which one of those is easier to read? And did you notice that I added another 5 Euro to my wallet, while translating the structure? Key point here: try to have the structure of your XML follow the structure of your data.


No comments: