[DCP] Adds JSON-LD Support to the ingestion pipeline#495
[DCP] Adds JSON-LD Support to the ingestion pipeline#495gmechali wants to merge 13 commits intodatacommonsorg:masterfrom
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| BestPractice | 2 medium |
| CodeStyle | 7 minor |
| Complexity | 6 medium |
🟢 Metrics 73 complexity · 6 duplication
Metric Results Complexity 73 Duplication 6
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Code Review
This pull request adds support for JSON-LD and JSON-LD templates to the ingestion pipeline and command-line tools. Key changes include the introduction of JsonLdParser and JsonLdTemplateParser, refactoring FileGroup to support multiple formats, and updating the Beam pipeline to handle JSON-LD inputs. Review feedback identifies critical concerns regarding memory usage with large files, resource leaks from unclosed streams, and the need to handle single JSON-LD objects. Suggestions also include improving logging and ensuring file path checks accommodate sharded files.
pipeline/util/src/main/java/org/datacommons/ingestion/util/PipelineUtils.java
Show resolved
Hide resolved
pipeline/util/src/main/java/org/datacommons/ingestion/util/PipelineUtils.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
pipeline/ingestion/src/main/java/org/datacommons/ingestion/pipeline/GraphIngestionPipeline.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for JSON-LD and JSON-LD Template + CSV files within the ingestion pipeline and linting tool. Key changes include refactoring the FileGroup mechanism using a strategy pattern to handle different input file types, adding new parsers (JsonLdParser, JsonLdTemplateParser), and updating pipeline logic to process these new formats. New tests for JSON-LD parsing are also included. The review identified several issues, including resource leaks in PipelineUtils and Processor when handling file streams, a package declaration mismatch, and the presence of debug/error print statements that should be replaced with proper logging. Additionally, a redundant JsonLdOptions object was noted.
pipeline/util/src/main/java/org/datacommons/ingestion/util/PipelineUtils.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdParser.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdTemplateParser.java
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for JSON-LD and JSON-LD templates across the ingestion pipeline and command-line tools. It includes new parsers, a refactored file grouping logic using a strategy pattern, and updates to the Beam pipeline to process these formats. Feedback highlights a critical issue in the template parser where local IDs in property values are not correctly resolved with row-specific suffixes, which would break node relationships. Other improvements include addressing potential encoding inconsistencies in the processor, removing redundant property assignments, and hardening the namespace stripping logic to prevent invalid empty property names.
util/src/main/java/org/datacommons/util/parser/jsonld/JsonLdTemplateParser.java
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for JSON-LD input files by implementing a new JsonLdParser and refactoring the file grouping logic into a strategy-based pattern. Key changes include updates to the ingestion pipeline to handle .jsonld extensions, the addition of JsonLdFileGroup and McfFileGroup subclasses, and new unit tests for JSON-LD parsing. Feedback focuses on fixing a potential ClassCastException in Processor.java when handling non-MCF groups, correcting package declarations that do not match the directory structure, and restoring the logic to automatically detect tab delimiters for .tsv files.
| } | ||
|
|
||
| GraphSupplier parser; | ||
| McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup; |
There was a problem hiding this comment.
This unconditional cast to McfFileGroup will cause a ClassCastException if the input includes a .jsonld file, as FileGroup.build will return a JsonLdFileGroup in that case. Since JsonLdFileGroup can also contain CSV files (as seen in JsonLdParserStrategy), this method should handle both types or verify the type before casting. When verifying if a file is an MCF file, ensure you use contains('.mcf') instead of endsWith('.mcf') to correctly handle sharded files. Additionally, if a .jsonld file is intended to be used as a template for the CSV, the logic to retrieve it and pass it to the parser is currently missing for JsonLdFileGroup.
References
- To handle sharded files, check for MCF files using
contains('.mcf')on the file path, notendsWith('.mcf').
There was a problem hiding this comment.
We arent yet support .jsonld in the tool processor and unclear that we ever will.
| @@ -0,0 +1,30 @@ | |||
| package org.datacommons.util; | |||
There was a problem hiding this comment.
The package declaration org.datacommons.util does not match the directory structure org/datacommons/util/parser/strategy/. This violates standard Java project structure and can lead to issues with build tools and IDEs. Please update the package declaration to org.datacommons.util.parser.strategy and add the necessary imports in FileGroup.java.
| package org.datacommons.util; | |
| package org.datacommons.util.parser.strategy; |
| } | ||
| } | ||
|
|
||
| char delim = overrideDelimiter == null ? ',' : overrideDelimiter; |
There was a problem hiding this comment.
This change introduces a regression where the default delimiter is always a comma, even if the input files are .tsv. The previous logic in FileGroup.build correctly detected .tsv files and set the default delimiter to a tab character. Note that contains() should be used instead of endsWith() to support sharded files.
| char delim = overrideDelimiter == null ? ',' : overrideDelimiter; | |
| boolean hasTsv = files.stream().anyMatch(f -> f.getName().toLowerCase().contains(".tsv")); | |
| char delim = overrideDelimiter == null ? (hasTsv ? '\t' : ',') : overrideDelimiter; |
References
- To handle sharded files, use contains() for extension checks instead of endsWith().
util/src/main/java/org/datacommons/util/parser/strategy/McfParserStrategy.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/strategy/McfParserStrategy.java
Outdated
Show resolved
Hide resolved
util/src/main/java/org/datacommons/util/parser/strategy/ParserStrategy.java
Outdated
Show resolved
Hide resolved
dwnoble
left a comment
There was a problem hiding this comment.
Nit on the input JSON-LD files: all property names and non-literal values should either be namespaced or have the values explicitly defined in the @context
in the linked example, name , value, and StatVarObservation are missing @context definitions, and the "schema" and "dcid" namespaces should be defined in the context. I also think we should use the datacommons.org definitions for observationDate, observationAbout, variableMeasured, and value because there are slight differences between the schema.org and datacommons.org definitions 😅
the way you define properties like "variableMeasured": "http://schema.org/variableMeasured", is totally fine (minus the nitpick about using datacommons.org instead of schema.org), but personally i like to just define the namespaces and use those everywhere in the @graph definition like this:
{
"@context": {
"schema": "https://schema.org/",
"dcid": "https://datacommons.org/browser/"
},
"@graph": [
{
"@id": "dcid:TestNode",
"@type": "schema:Thing",
"schema:name": "Test Node Name"
},
{
"@id": "dcid:TestObservation_2020",
"@type": "dcid:StatVarObservation",
"dcid:observationAbout": {
"@id": "dcid:TestNode"
},
"dcid:variableMeasured": {
"@id": "dcid:Count_Person"
},
"dcid:value": 1000,
"dcid:observationDate": "2020"
}
]
(you can see the differences between those properties by going to, for example, https://schema.org/observationAbout and https://datacommons.org/browser/observationAbout)
…move the namespace in JSON-LD, note the MCF import does not yet support the namespace. We should ensure these will be supported together.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for JSON-LD input files within the ingestion pipeline and the command-line tool. Key changes include refactoring the FileGroup logic into a strategy-based pattern, adding a new JsonLdParser, and updating the GraphIngestionPipeline and Processor to handle multiple input formats. Review feedback highlights a potential OutOfMemoryError when parsing large JSON-LD files in memory and an unsafe type cast in the Processor class that could lead to a ClassCastException. Additionally, the reviewer suggests using contains instead of endsWith for file path checks to support sharded files, expanding prefix stripping to include non-HTTPS schema URIs, and ensuring that @type and @language tags are preserved during JSON-LD value processing.
| public void processElement( | ||
| @Element FileIO.ReadableFile file, OutputReceiver<McfGraph> receiver) { | ||
| try (InputStream is = Channels.newInputStream(file.open())) { | ||
| McfGraph graph = JsonLdParser.parse(is); |
| } | ||
|
|
||
| GraphSupplier parser; | ||
| McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup; |
There was a problem hiding this comment.
This cast to McfFileGroup is unsafe. If the input files are JSON-LD with associated CSVs, args.fileGroup will be an instance of JsonLdFileGroup, leading to a ClassCastException.
if (!(args.fileGroup instanceof McfFileGroup)) {
throw new IllegalArgumentException("processTable requires an McfFileGroup for TMCF/CSV processing.");
}
McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup;
pipeline/util/src/main/java/org/datacommons/ingestion/util/PipelineUtils.java
Outdated
Show resolved
Hide resolved
| private static final String SCHEMA_PREFIX = "https://schema.org/"; | ||
|
|
||
| private static String stripPrefix(String value) { | ||
| if (value == null) { | ||
| return null; | ||
| } | ||
| if (value.startsWith(DCID_PREFIX)) { | ||
| return value.substring(DCID_PREFIX.length()); | ||
| } | ||
| if (value.startsWith(SCHEMA_PREFIX)) { | ||
| return value.substring(SCHEMA_PREFIX.length()); | ||
| } | ||
| return value; | ||
| } |
There was a problem hiding this comment.
The SCHEMA_PREFIX should also account for http://schema.org/ as many datasets still use the non-HTTPS version of the URI.
private static final String SCHEMA_PREFIX_HTTPS = "https://schema.org/";
private static final String SCHEMA_PREFIX_HTTP = "http://schema.org/";
private static String stripPrefix(String value) {
if (value == null) {
return null;
}
if (value.startsWith(DCID_PREFIX)) {
return value.substring(DCID_PREFIX.length());
}
if (value.startsWith(SCHEMA_PREFIX_HTTPS)) {
return value.substring(SCHEMA_PREFIX_HTTPS.length());
}
if (value.startsWith(SCHEMA_PREFIX_HTTP)) {
return value.substring(SCHEMA_PREFIX_HTTP.length());
}
return value;
}| Map<?, ?> map = (Map<?, ?>) item; | ||
| if (map.containsKey("@value")) { | ||
| Object val = map.get("@value"); | ||
| addProperty(nodeBuilder, property, val.toString(), Mcf.ValueType.TEXT); |
util/src/main/java/org/datacommons/util/parser/strategy/JsonLdParserStrategy.java
Outdated
Show resolved
Hide resolved
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR adds JSON-LD support for the ingestion pipeline. The first stage of the pipeline is responsible for converting from MCF and TMCF to an MCFGraph object. we now also add the ability to convert from JSON-LD and CSV with a .jsonld template to the same MCFGraph object.
We also refactored the MCF parsing logic into a util/parser subdirectory.
Added some unit tests.
We verified the behavior by running the pipeline with following two MCF, JSONLD.
See the data in the tables under this Spanner DB