These tools are used to convert XML and HTML to and from a line-oriented format more amenable to processing by classic Unix pipeline processing tools, like grep, sed, awk, cut, shell scripts, and so forth.
Documentation (reference.md) is available, and examples (examples.md) are illustrative.
-
Fetch and install the
gnome-xmllibrary (libxml).
I'm using version 1.8.6. Other versions might or might not work. Make surexml-configis on your path. -
Fetch and unpack the source tarball for my tools from this repository.
Look for a file namedxml2-version.tar.gz. -
Run
make.
You should now have several binaries:xml2,2xml,csv2,2csv.
Symbolic links are used to offer alternative names:html2and2html. -
Copy the binaries and links somewhere.
-
Namespace support is absent.
-
Whitespace isn't always preserved, and the rules for preserving and generating whitespace are complex.
-
It's possible to preserve all whitespace, but the resulting flat files are big and ugly. In most cases, whitespace is meaningless, used only to make the XML human-readable. Even in HTML, whitespace is sometimes significant and sometimes not, with no easy way to tell which is which.
-
XML is fundamentally hierarchical, not record-oriented.
-
The usefulness of record-oriented Unix tools to this domain will always be limited to simple operations like basic search and replacement, no matter how many syntactic transformations we make. More complex processing requires XML-specific tools like XSLT.
-
The transformation is complex.
The syntax used by these tools is relatively intuitive, but difficult to describe precisely. (My own documentation relies only on examples.) This makes it difficult to formally reason about data, so subtle errors are easy to make.
Author: Dan Egnor (ofb.net/~egnor)
Converted manually by Lorenzo L. Ancora, from HTML to MarkDown. All legal rights remain with the original author and this documentation is distributed non-profit.