Archive for the ‘XML’ Category

Entmoot of markup

Monday, May 26th, 2008

Balisage is the Entmoot of markup. The 2008 schedule is available at http://www.balisage.net/At-A-Glance.html.

When Ant is subsidiary to <oXygen/>

Friday, May 16th, 2008

The <oXygen/> documentation has an example of setting up an “External Tool” to run Ant. The example is simple enough to illustrate its point, but there’s more that can be done, especially if you write the Ant build file knowing that it will be run from <oXygen/>.

This example is from the <oXygen/> “project” that I used for organising the exercises for my “Testing XSLT” tutorial at XTech 2008. (more…)

BOM in UTF-8: good, bad, or ugly?

Wednesday, October 3rd, 2007

The usefulness or otherwise of U+FEFF (ZERO WIDTH NON-BREAKING SPACE and BYTE ORDER MARK) in UTF-8 has been subject to reinterpretation over the years. It wasn’t mentioned in the original XML 1.0 Recommendation but was added later, rather like how its use was added to the Unicode Standard.

In the Unicode Standard 2.0, there was no mention of U+FEFF with UTF-8, either in the section on the BOM or in the appendix defining UTF-8.

In the Unicode Standard 3.0, section 13.6, “Specials”, includes:

Although there are never any questions of byte-order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked.

In the Unicode Standard 5.0, section 3.10, “Unicode Encoding Schemes”, includes:

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the <EF BB BF> byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

So in the Unicode Standard it’s gone from irrelevant to useful to “Oh, if you must”.

(BTW, in other reinterpretations, “Unicode Encoding Scheme” results from splitting the meaning of “UTF”, and the use of U+FEFF to indicate non-breaking is deprecated these days.)

The Unicode FAQ both lists its use as a signature and says to avoid its use where “byte oriented protocols expect ASCII characters at the beginning of a file“. However, I don’t think that XML necessarily counts as one such byte oriented protocol.

Windows drive names with Cygwin xsltproc & xmllint

Monday, September 24th, 2007

Cygwin may be the only way to stay sane while using Windows, but it has its own Unix-like notion for drive names, e.g., “/cygdrive/c/” instead of “c:“. Which is fine, except when you want to use both Java XML tools, which understand only the “c:” form, and Cygwin tools, which tend to understand only the “/cygdrive/c/” form.

The Cygwin xsltproc and xmllint complain when you use them with files containing Windows drive names in system identifiers, so the second time it happened, I wrote a simple XML catalog file to map the Windows drive names to the Cygwin paths.

Put this as the contents of /etc/xml/catalog (not catalog.xml!) and the Cygwin xsltproc, etc., will handle Windows drive names:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<rewriteSystem
systemIdStartString="file:///C:/"
rewritePrefix="file:///cygdrive/c/"/>
</catalog>

You will have to add a suitable rewriteSystem for each additional drive that you use.

…a linear syntax for unranked, ordered and labeled trees.

Thursday, April 5th, 2007

It’s often interesting to see people’s sound-bite definitions of XML. The following, from Section 2.2 of the Static Validation of XSLT Master’s thesis by Mads Kristian Østerby Olesen, isn’t as forthright as Richard Gabriel’s but is interesting for assuming that you know about trees (in the computer sense) even if you don’t know XML:

XML is just a linear syntax for unranked, ordered and labeled trees.

XML—…fundamental Lisp data structures reinterpreted by people with bad taste brainwashed by inflexibility

Tuesday, August 15th, 2006

I was most impressed by this characterisation of XML in “The Art of Lisp & Writing” by Richard Gabriel (http://www.dreamsongs.com/ArtOfLisp.html):

XML—which amounts to some fundamental Lisp data structures reinterpreted by people with bad taste brainwashed by inflexibility.