Skip to content

metaphacts/metaphacts-etl-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metaphacts ETL pipeline

The Extract-Transform-Load (ETL) pipeline provides a means to convert structured data to RDF, perform post-processing steps, and ingest it into a graph database.

The pipeline follows the principles described in Concepts and is based on an opinionated selection of components and tools:

  • Amazon Web Services (AWS) as cloud environment
  • a selection of AWS services such as S3, CloudFormation, StepFunctions, Lambda, EC2, etc. for various parts
  • RDF Mapping Language (RML) as declarative mapping language with Carml as mapping engine
  • Ontotext GraphDB as RDF database. Please note that a valide GraphDB license is needed in order to run the pipeline end-to-end. The GraphDB license is needed for the data ingestion part of the ETL pipeline (i.e. to load teh data into GraphDB).

Features

The ETL pipeline has the following features:

  • read source files from a S3 bucket
  • convert source files to RDF using RML mappings
  • supported formats are CSV, XML, JSON, JSONL, also in compressed (gzipped) form
  • the RDF files are written to an S3 bucket, one RDF file per source file
  • the RDF files are ingested into a graph using the GraphDB Preload tool
  • adding new files into the source bucket after the initial ingestion will add them as incremental updates

Setup and Operation

See ETL Pipeline Setup for how to set up and run the pipeline.

Architecture

The following diagram shows the architecture of the ETL pipeline:

See Architecture for a detailed description.

Copyright

All content in this repository is (c) 2023 by metaphacts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors