Skip to content

semmtech/xml-to-rdf

Repository files navigation

DOM4 XML → RDF Transformer

Tool for converting XML Documents into Linked Data (RDF)

A focused, zero-configuration utility built on Apache Jena and the W3C DOM specification to deterministically transform any XML file into a structured RDF graph.

Overview

This tool solves the challenge of integrating structured XML data into the Semantic Web. By using a deterministic DOM4 mapping, it converts the physical structure of your XML document (elements, attributes, text) into discoverable, linkable, and queryable RDF triples.

Key Use Cases

  • Semantic Integration: Make existing XML data consumable by triple stores and knowledge graphs.
  • Structural Analysis: Use SPARQL to query the exact structure (node positions, attributes, namespaces) of your XML documents.
  • Pipeline Integration: Easily integrate into data processing workflows as a standalone CLI or Docker image.

Scope

This converter is an agnostic structural transformer, focusing purely on the physical representation of the XML document.

It successfully parses an input XML Document and produces an RDF model that represents:

  • Core Structure: DOM nodes as RDF resources (elements, attributes, comments, text).
  • Metadata: XML declaration, encoding, node relationships, and simple typed metadata (indexes, URIs).
  • Vocabularies: It utilizes dedicated vocabularies for mapping:
    • W3CDOM4: Terms for the DOM structure (elements, attributes, etc.).
    • W3CXML: Terms for XML declaration and related metadata.

Note: This tool is intended as a structural converter, not an ontology mapper or domain-specific extractor. It maps structure, not meaning.

Structure

Key components

Files you will care about

Inner workings

  • TransformCommand parses an XML file into a W3C DOM and calls the transform() method on the transformer.
  • Xml2RdfTransformer initializes a Jena model, adds ontology imports for the DOM4/XML vocabularies and walks the DOM:
    • Elements → DOM4 element resources
    • Attributes → DOM4 attr resources
    • Text / Comments → character data nodes
  • Vocabularies are defined within the com.semmtech.transform.xml2rdf.vocabulary package.

Contributions

Suggestions are welcome in the discussions.

Please ensure to following the existing code structure:

Legal and license

The code within this repository is made available under the GNU Affero General Public License version 3.0.

Contact

Open a discussion topic in this repository with reproduction steps and sample input & expected output. Keep in mind that this project is maintained on a best-effort basis whenever someone has time available. This also means that there will likely be delays in answering discussion posts and reviewing requests and issues.

For consultancy on implementation, additional features or a full data-to-linked-data integration process, please contact a sales representative through the form on www.semmtech.com.

References

About

Tool for converting XML documents into RDF using the DOM4 representation and W3C XML metadata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors