This module provides tools for ingesting structured datasets from YAML data hierarchy files through the OpenGIN services. It processes hierarchical data structures (ministers, departments, categories, subcategories, and datasets) and creates corresponding entities and relationships in the OpenGIN system.
The ingestion system reads YAML data hierarchy files that describe the structure of datasets organized in a flexible hierarchy:
- Ministers → Categories → Subcategories → Datasets
- Ministers → Categories → Datasets
- Ministers → Departments → Categories → Subcategories → Datasets
- Ministers → Departments → Categories → Datasets
It also supports ingesting citizen profiles — personal datasets attached to individual citizen entities. Profiles are year-independent and are processed separately using the --profiles flag.
Each dataset is stored as a JSON file and is ingested as an attribute on the appropriate parent entity (category, subcategory, or citizen).
Before running the ingestion script, ensure you have completed the following setup steps:
Make sure the OpenGIN services are up and running. The ingestion script requires:
- Read Service: For querying existing entities and relationships
- Ingestion Service: For creating and updating entities
Restore the 0.0.1 data backup to ensure you have the base entities (ministers, departments, etc.) that the ingestion script will reference and build upon.
Create a virtual environment and install the required dependencies.
Option 1: Using Mamba or Conda (Recommended)
# Create the environment from environment.yml
mamba env create -f environment.yml
# Or using Conda:
# conda env create -f environment.yml
# Activate the environment
mamba activate datasets_env
# Or using Conda:
# conda activate datasets_envOption 2: Using Python venv
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThe ingestion script requires the following environment variables:
READ_BASE_URL: Base URL for the OpenGin Read ServiceINGESTION_BASE_URL: Base URL for the OpenGin Ingestion Service
You can set these in your environment or create a .env file in the ingestion/ directory:
export READ_BASE_URL="http://localhost:8081"
export INGESTION_BASE_URL="http://localhost:8080"Or create a .env file:
READ_BASE_URL=http://localhost:8081
INGESTION_BASE_URL=http://localhost:8080
If using a .env file, make sure you have python-dotenv installed (included in requirements.txt).
Once all prerequisites are met, you can run the ingestion script:
# From the project root directory
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml
# Or with an explicit year override
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml --year 2020
# Ingest citizen profiles (no year required)
python -m ingestion.ingest_data_yaml data/people/profiles_hierarchy.yaml --profiles| Argument | Type | Description |
|---|---|---|
yaml_file |
required | Path to the YAML manifest file |
--year |
optional | Override the year extracted from the filename |
--profiles |
optional flag | Process a profiles YAML — skips year extraction and only processes citizen entries |
Used for ingesting statistical datasets organised by ministers, departments, and categories. A year is required (either extracted from the filename or provided via --year).
# Ingest 2020 data
python -m ingestion.ingest_data_yaml data/statistics/2020/data_hierarchy_2020.yaml
# Ingest 2021 data
python -m ingestion.ingest_data_yaml data/statistics/2021/data_hierarchy_2021.yamlIn this mode the script will:
- Extract the year from the filename (or use
--year) - Process governments, ministers, departments, categories, and their datasets
Used for ingesting citizen profile datasets. Profiles are personal data records (e.g. name, political party, date of birth) attached to existing citizen entities. They are not year-specific.
# Ingest citizen profiles
python -m ingestion.ingest_data_yaml data/people/profiles_hierarchy.yaml --profilesIn this mode the script will:
- Skip year extraction entirely
- Skip government and minister processing
- Only process citizen entries and attach their profile datasets as attributes
Each citizen entry in the profiles YAML can be identified by name or id. Note, if both are provided, the id takes priority.
citizen:
# By name — looks up the citizen entity by name
- name: Harini Amarasuriya
profile:
- profiles/Harini Amarasuriya
# By id — looks up the citizen entity by ID, resolves the name from the returned entity
- id: 2151-38_cit_14
profile:
- profiles/Anura Karunathilaka
# Both — id takes priority, name is ignored for lookup
- id: 2151-38_cit_14
name: Anura Karunathilaka
profile:
- profiles/Anura KarunathilakaNote
The citizen entity must already exist in the system. If it cannot be found by the given name or id, the entry is skipped with an error log. When looking up by id, the entity must be of kind Person/citizen — any other kind will also be skipped.
- Parse YAML Manifest: Reads the YAML file to extract the hierarchical structure
- Find Entities: Uses the Read Service to find existing ministers and departments by name and year
- Create Categories: Creates category and subcategory entities as needed
- Process Datasets: Reads dataset JSON files and adds them as attributes to parent entities
- Create Relationships: Establishes relationships between entities (e.g.,
AS_CATEGORY)
- Parse YAML Manifest: Reads the profiles YAML file to extract citizen entries
- Find Citizens: Uses the Read Service to look up existing citizen entities — by
idif provided, otherwise byname - Process Profiles: Reads each citizen's profile
data.jsonand attaches it as an attribute on the citizen entity using the citizen's own start/end time period
ingestion/
├── ingest_data_yaml.py # Main ingestion script
├── .env # Environment variables
├── exception/ # Exception handling
│ └── exceptions.py
├── models/ # Data models and schemas
│ └── schema.py
├── services/ # Service layer
│ ├── entity_resolver.py # Entity lookup and resolution
│ ├── ingestion_service.py # OpenGin Ingestion API client
│ ├── read_service.py # OpenGin Read API client
│ └── yaml_parser.py # YAML parsing utilities
├── utils/ # Utility functions
│ ├── date_utils.py # Date/time calculations
│ ├── http_client.py # HTTP client for API calls
│ └── util_functions.py # General utilities
└── requirements.txt # Python dependencies
Make sure you're running the command from the project root directory (/Users/LDF/Documents/datasets/), not from within the ingestion/ folder.
If you encounter import errors, ensure all dependencies are installed:
pip install -r requirements.txtThe script will exit with an error if READ_BASE_URL or INGESTION_BASE_URL are not set. Make sure these are configured before running.
If you see connection errors, verify that:
- OpenGin services are running
- The base URLs in your environment variables are correct
- Your network/firewall allows connections to these services
- The script processes ministers and citizens sequentially
- Datasets are validated before ingestion; null values in rows are automatically converted to empty strings
- The script handles time period calculations for attributes based on parent entity time ranges and dataset years
- Categories and subcategories are checked for existence before creation to avoid duplicates
- In profiles mode, citizen entities must already exist in the system — if a citizen is not found, it is skipped with an error log
- If multiple citizen entities are found with the same name, a warning is logged and the first result is used