A focused, zero-configuration utility built on Apache Jena and the W3C DOM specification to deterministically transform any XML file into a structured RDF graph.
This tool solves the challenge of integrating structured XML data into the Semantic Web. By using a deterministic DOM4 mapping, it converts the physical structure of your XML document (elements, attributes, text) into discoverable, linkable, and queryable RDF triples.
- Semantic Integration: Make existing XML data consumable by triple stores and knowledge graphs.
- Structural Analysis: Use SPARQL to query the exact structure (node positions, attributes, namespaces) of your XML documents.
- Pipeline Integration: Easily integrate into data processing workflows as a standalone CLI or Docker image.
This converter is an agnostic structural transformer, focusing purely on the physical representation of the XML document.
It successfully parses an input XML Document and produces an RDF model that represents:
- Core Structure: DOM nodes as RDF resources (elements, attributes, comments, text).
- Metadata: XML declaration, encoding, node relationships, and simple typed metadata (indexes, URIs).
- Vocabularies: It utilizes dedicated vocabularies for mapping:
- W3CDOM4: Terms for the DOM structure (elements, attributes, etc.).
- W3CXML: Terms for XML declaration and related metadata.
Note: This tool is intended as a structural converter, not an ontology mapper or domain-specific extractor. It maps structure, not meaning.
Key components
- Converter entrypoint:
com.semmtech.transform.xml2rdf.TransformCommand— CLI program that parses arguments, calls the converter and writes RDF output. - Core translator:
com.semmtech.transform.xml2rdf.Xml2RdfTransformer— builds the Jena Model, traverses DOM and emits RDF. - Vocabularies:
com.semmtech.transform.xml2rdf.vocabulary.W3CDOM4— DOM4 RDF terms used in the output.com.semmtech.transform.xml2rdf.W3CXML— XML declaration and related terms.
Files you will care about
- Running instructions: RUNNING.md — build and run examples (Docker and JAR).
- Example input: example.xml - example of a food menu in XML (from www.w3schools.com)
- Maven build / packaging: pom.xml
- Container: Dockerfile
TransformCommandparses an XML file into a W3C DOM and calls thetransform()method on the transformer.Xml2RdfTransformerinitializes a Jena model, adds ontology imports for the DOM4/XML vocabularies and walks the DOM:- Elements → DOM4 element resources
- Attributes → DOM4 attr resources
- Text / Comments → character data nodes
- Vocabularies are defined within the
com.semmtech.transform.xml2rdf.vocabularypackage.
Suggestions are welcome in the discussions.
Please ensure to following the existing code structure:
- CLI in
com.semmtech.transform.xml2rdf.TransformCommand; and - core logic in
com.transformer.Xml2RdfConverter.
The code within this repository is made available under the GNU Affero General Public License version 3.0.
Open a discussion topic in this repository with reproduction steps and sample input & expected output. Keep in mind that this project is maintained on a best-effort basis whenever someone has time available. This also means that there will likely be delays in answering discussion posts and reviewing requests and issues.
For consultancy on implementation, additional features or a full data-to-linked-data integration process, please contact a sales representative through the form on www.semmtech.com.
- Running guide: RUNNING.md
- Docker image: https://hub.docker.com/r/semmtech/xml-to-rdf
- Java code: src/main/java/com/semmtech/transform/xml2rdf/
- Example input/output:
- Apache Jena: https://jena.apache.org/index.html