Another Day In The Code Mines: Java

Showing posts with label Java. Show all posts

Monday, September 18, 2017

A short rant on XML - the Good, the Bad, and the Ugly

[editor's note: This blog post has been in "Drafts" for 11 years. In the spirit of just getting stuff out there, I'm publishing it basically as-is. Look for a follow-up blog post next week with some additional observations on structured data transfer from the 21st century]

So, let's see if I can keep myself to less than ten pages of text this time...

XML is the eXtensible Markup Language. It's closely related to both HTML, the markup language used to make the World Wide Web, and SGML, a document format that you've probably never dealt with unless you're either a government contractor, or you used the Internet back in the days before the Web. For the pedants out there, I do know that HTML is actually an SGML "application" and that XML is a proper subset of SGML. Let's not get caught up in the petty details at this point.

XML is used for a variety of different tasks these days, but the most common by far is as a kind of "neutral" format for exchanging structured data between different applications. To keep this short and simple, I'm going to look at XML strictly from the perspective of a data storage and interchange format.

The good

Unicode support

XML Documents can be encoded using the Unicode character encoding, which means that nearly any written character in any language can be easily represented in an XML document.

Uniform hierarchical structure

XML defines a simple tree structure for all the elements in a file - there's one root element, it has zero or more children, which each have zero or more children, ad infinitum. All elements must have an open and close tag, and elements can't overlap. This simple structure makes it relatively easy to parse XML documents.

Human-readable (more or less)

XML is a text format, so it's possible to read and edit an XML document "by hand" in a text editor. This is often useful when you're learning the format of an XML document in order to write a program to read or translate it. Actually writing or modifying XML documents in a text editor can be incredibly tedious, though a syntax-coloring editor makes it easier.

Widely supported

Modern languages like C# and Java have XML support "built in" in their standard libraries. Most other languages have well-supported free libraries for working with XML. Chances are, whatever messed up environment you have to work in, there's an XML reader/writer library available.

The bad

Legacy encoding support

XML Documents can also be encoded in whatever wacky character set your nasty legacy system uses. You can put a simple encoding="Ancient-Elbonian-EBCDIC" attribute in the XML declaration element, and you can write well-formed XML documents in your favorite character encoding. You probably shouldn't expect that anyone else will actually be able to read it, though.

Strictly hierarchical format

Not every data set you might want to interchange between two systems is structured hierarchically. In particular, representing a relational database or an in-memory graph of objects is problematic in XML. A number of approaches are used to get around this issue, but they're all outside the scope of standardized XML (obviously), and different systems tend to solve this problem in different ways, neatly turning the "standardized interchange format for data" into yet another proprietary format, which is only readable by the software that created it.

XML is verbose

A typical XML document can be 30% markup, sometimes more. This makes it larger than desired in many cases. There have been several attempts to define a "binary XML" format (most recently by the W3C group), but they really haven't caught on yet. For most applications where size or transmission speed is an issue, you probably ought to look into compressing the XML document using a standard compression algorithm (gzip, or zlib, or whatever), then decompressing it on the other end. You'll save quite a bit more that way than by trying to make the XML itself less wordy.

Some XML processing libraries are extremely memory-intensive

There are two basic approaches to reading an XML document. You can read the whole thing into memory and re-construct the structure of the file into a tree of nodes in memory, and then the application can use standard pointer manipulation to scan through the tree of nodes, looking for whatever information it needs, or further-transforming the tree into the program's native data structures. One XML processing library I've used loaded the whole file into memory all at once, then created a second copy of all the data in the tags. Actually, it could end up using up to the size of the file, plus twice the combined size of all the tags.

Alternatively, the reader can take a more stream-oriented approach, scanning through the file from beginning to end, and calling into the application code whenever an element starts or ends. This can be implemented with a callback to your code for every tag start/end, which gives you a simple interface, and doesn't require holding large amounts of data in memory during the parsing.

No random access

This is just fallout from the strict hierarchy, but it's extremely labor intensive to do any kind of data extraction from a large XML document. If you only want a subset of nodes from a couple levels down in the hierarchy, you've still got to step your way down there, and keep scanning throught the rest of the file to figure out when you've gone up a level.

The ugly

By far, the biggest problems with XML don't have anything to do with the technology itself, but with the often perverse ways in which it's misapplied to the wrong problems. Here are a couple of examples from my own experience.

Archiving an object graph, and the UUID curse

XML is a fairly reasonable format for transferring "documents", as humans understand them. That is, a primarily linear bunch of text, with some attributes that apply to certain sections of the text.

These days, a lot of data interchange between computer programs is in the form of relational data (databases), or complex graphs of objects, where you'll frequently need to make references back to previous parts of the document, or forward to parts that haven't come across yet.

The obvious way to solve this problem is by having a unique ID that you can reference to find one entity from another. Unfortunately, the "obvious" way to ensure that a key is unique is to generate a globally-unique key, and so you end up with a bunch of 64-bit or 128-bit GUIDs stuck in your XML, which makes it really difficult to follow the links, and basically impossible to "diff' the files, visually.

One way to avoid UUID proliferation is to use "natural unique IDs, if your data has some attribute that needs to be unique anyway.

What's the worst possible way to represent a tree?

I doubt anybody's ever actually asked this question, but I have seen some XML structures that make a pretty good case that that's how they were created. XML, by its heirarchical nature, is actually a really good fit for hierarchical data. Here is one way to store a tree in XML:

<pants color="blue" material="denim">
  <pocket location="back-right">
    <wallet color="brown" material="leather">  
      <bill currency="USD" value="10"></bill>  
      <bill currency="EURO" value="5"></bill>  
    </wallet>
  </pocket>  
</pants>

And here's another:

<object>
  <id>
    20D06E38-60C1-433C-8D37-2FDBA090E197
  </id>
  <class>
    pants
  </class>
  <color>
    blue
  </color>
  <material>
     denim
  </material>
</object>
<object>
  <id>
    1C728378-904D-43D8-8441-FF93497B10AC
  </id>
  <parent>
    20D06E38-60C1-433C-8D37-2FDBA090E197
  </parent>
  <class>
    pocket
  </class>
  <location>
    right-back
  </location>
</object>
<object>
  <id>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </id>
  <parent>
    1C728378-904D-43D8-8441-FF93497B10AC
  </parent>
  <class>
    wallet
  </class>
  <color>
    brown
  </color>
  <material>
    leather
  </material>
</object>
<object>
  <id>
    E197AA8D-842D-4434-AAC9-A57DF4543E43
  </id>
  <parent>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </parent>
  <class>
    bill
  </class>
  <currency>
    USD
  </currency>
  <denomination>
    10
  </denomination>
</object>
<object>
  <id>
    AF723BDD-80A1-4DAB-AD16-5B37133941D0
  </id>
  <parent>
    AFBD4915-212F-4B47-B6B8-A2663025E350
  </parent>
  <class>
    bill
  </class>
  <currency>
    EURO
  </currency>
  <denomination>
    10
  </denomination>
</object>

So, which one of those is easier to read? And did you notice that I added another 5 Euro to my wallet, while translating the structure? Key point here: try to have the structure of your XML follow the structure of your data.

Tuesday, November 11, 2008

Just In Time compilation vs. the desktop and embedded worlds

Okay, rant mode on. As I was waiting for Eclipse to launch again today, it occured to me that one of the enduring mysteries of Java (and C#/.NET) for me is the continued dominance of just-in-time compilation as a runtime strategy for these languages, wherever they're found. We've all read the articles that claim that Java is "nearly as fast as C++", we also all know that that's a bunch of hooey, particularly with regard to startup time. Of course, if Eclipse wasn't periodically crashing on me with out-of-memory errors, then I'd care less about the startup time - but that's another rant. Back to startup time and JIT compilation...

If you're creating a server-based application, the overhead of the JIT compiler is probably pretty nominal - the first time through the code, it's a little pokey, but after that, it's plenty fast, and you're likely throttled by network I/O or database performance, anyway. And in theory, the JIT compiler can make code that's optimal for your particular hardware, though in practice, device-specific optimizations are pretty minimal.

On the other hand, if you're writing a desktop application (or worse yet, a piece of embedded firmware), then startup time, and first-time through performance, matters. In many cases, it matters rather a lot.

There are a number of advantages to writing code in a "managed", garbage-collected language, especially for desktop applications - memory leaks are much reduced, you eliminate buffer overflows, and there is the whole Java standard library full of useful code that you don't have to write for yourself. I'm willing to put up with many of the disadvantages of Java to gain the productivity and safety advantages. But waiting for the Java interpreter to recompile the same application over and over offends me on some basic level.

On a recent project, we used every trick in the book to speed up our startup time, including a "faked" startup splash screen, lazy initialization of everything we could get away with, etc, etc. Despite all that effort (and unecessary complication in the code base), startup time was still one of the most common complaints from users.

Quite a bit of profiling was done, and in our case, much of the startup time was taken up deep inside the JIT, where there was little we could do about it. Why oh why doesn't anybody make a Java (or .NET) implementation that keeps the safe runtime behavior, and implements a simple all-at-once compilation to high-performance native code? Maybe somebody does, but I haven't heard of them.

For that matter, why don't the reference implementations of these language runtimes just save all that carefully-compiled native code so they can skip all that effort the next time? The .NET framework even has a global cache for shared assemblies. Why those, at least, aren't pre-compiled during installation, I can't even imagine.

Update:
I was helpfully reminded of NGen, which will pre-compile .NET assemblies to native code. I had forgotten about that, since so much of my most recent C# work was hosted on Mono, which does things a bit differently. Mono has an option for AOT (ahead of time) compilation, which works, sort of, but could easily be the subject of another long article.

Sunday, October 19, 2008

Returning to Java, after 10 years away

I'm once again writing Java code professionally, something that I haven't done in nearly 10 years (no, really - I had to stop and think it through because I didn't believe it, either). A couple of thoughts did occur to me, after I'd figured out the time frames involved.

I was a little taken aback by the very idea that Java is more than 10 years old. It just seems weird that a new programming language could go from introduction to being a major part of the world's IT infrastructure and college curriculums, in less time than I've been living here in California.

Java sure has evolved a lot in the last 10 years. There have been major changes to the language, the libraries, and the tools. I'd bet that some of my 10-year old Java code would throw deprecation warnings for nearly every line of code...

On the other hand, my final thought is along the lines of "Oh, my god. So much has changed, but Java is still irritating in nearly all the ways that made me crazy ten years ago! What have these people been up to for the last decade?"

Oh, and I was the first person I knew to "quit" Java, much like I was the first person to "quit" World of Warcraft. Hopefully backsliding on the Java thing doesn't mean I'm about to backslide on the WoW thing - I can't afford the lost time. I've got to learn about how you do things in Java again.

One good thing for my loyal readers (if any exist) is that I have a bunch of stored-up vitriol about Java that I can just uncork and pour out, so I should be updating more frequently.