Skip to content

[DCP] Adds JSON-LD Support to the ingestion pipeline#495

Open
gmechali wants to merge 13 commits intodatacommonsorg:masterfrom
gmechali:jsonld
Open

[DCP] Adds JSON-LD Support to the ingestion pipeline#495
gmechali wants to merge 13 commits intodatacommonsorg:masterfrom
gmechali:jsonld

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

@gmechali gmechali commented Apr 7, 2026

This PR adds JSON-LD support for the ingestion pipeline. The first stage of the pipeline is responsible for converting from MCF and TMCF to an MCFGraph object. we now also add the ability to convert from JSON-LD and CSV with a .jsonld template to the same MCFGraph object.

We also refactored the MCF parsing logic into a util/parser subdirectory.
Added some unit tests.
We verified the behavior by running the pipeline with following two MCF, JSONLD.

See the data in the tables under this Spanner DB

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 7, 2026

Not up to standards ⛔

🔴 Issues 8 medium · 7 minor

Alerts:
⚠ 15 issues (≤ 0 issues of at least minor severity)

Results:
15 new issues

Category Results
BestPractice 2 medium
CodeStyle 7 minor
Complexity 6 medium

View in Codacy

🟢 Metrics 73 complexity · 6 duplication

Metric Results
Complexity 73
Duplication 6

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for JSON-LD and JSON-LD templates to the ingestion pipeline and command-line tools. Key changes include the introduction of JsonLdParser and JsonLdTemplateParser, refactoring FileGroup to support multiple formats, and updating the Beam pipeline to handle JSON-LD inputs. Review feedback identifies critical concerns regarding memory usage with large files, resource leaks from unclosed streams, and the need to handle single JSON-LD objects. Suggestions also include improving logging and ensuring file path checks accommodate sharded files.

@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for JSON-LD and JSON-LD Template + CSV files within the ingestion pipeline and linting tool. Key changes include refactoring the FileGroup mechanism using a strategy pattern to handle different input file types, adding new parsers (JsonLdParser, JsonLdTemplateParser), and updating pipeline logic to process these new formats. New tests for JSON-LD parsing are also included. The review identified several issues, including resource leaks in PipelineUtils and Processor when handling file streams, a package declaration mismatch, and the presence of debug/error print statements that should be replaced with proper logging. Additionally, a redundant JsonLdOptions object was noted.

@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for JSON-LD and JSON-LD templates across the ingestion pipeline and command-line tools. It includes new parsers, a refactored file grouping logic using a strategy pattern, and updates to the Beam pipeline to process these formats. Feedback highlights a critical issue in the template parser where local IDs in property values are not correctly resolved with row-specific suffixes, which would break node relationships. Other improvements include addressing potential encoding inconsistencies in the processor, removing redundant property assignments, and hardening the namespace stripping logic to prevent invalid empty property names.

@gmechali gmechali requested a review from dwnoble April 7, 2026 21:24
@gmechali
Copy link
Copy Markdown
Contributor Author

gmechali commented Apr 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for JSON-LD input files by implementing a new JsonLdParser and refactoring the file grouping logic into a strategy-based pattern. Key changes include updates to the ingestion pipeline to handle .jsonld extensions, the addition of JsonLdFileGroup and McfFileGroup subclasses, and new unit tests for JSON-LD parsing. Feedback focuses on fixing a potential ClassCastException in Processor.java when handling non-MCF groups, correcting package declarations that do not match the directory structure, and restoring the logic to automatically detect tab delimiters for .tsv files.

}

GraphSupplier parser;
McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This unconditional cast to McfFileGroup will cause a ClassCastException if the input includes a .jsonld file, as FileGroup.build will return a JsonLdFileGroup in that case. Since JsonLdFileGroup can also contain CSV files (as seen in JsonLdParserStrategy), this method should handle both types or verify the type before casting. When verifying if a file is an MCF file, ensure you use contains('.mcf') instead of endsWith('.mcf') to correctly handle sharded files. Additionally, if a .jsonld file is intended to be used as a template for the CSV, the logic to retrieve it and pass it to the parser is currently missing for JsonLdFileGroup.

References
  1. To handle sharded files, check for MCF files using contains('.mcf') on the file path, not endsWith('.mcf').

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We arent yet support .jsonld in the tool processor and unclear that we ever will.

@@ -0,0 +1,30 @@
package org.datacommons.util;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The package declaration org.datacommons.util does not match the directory structure org/datacommons/util/parser/strategy/. This violates standard Java project structure and can lead to issues with build tools and IDEs. Please update the package declaration to org.datacommons.util.parser.strategy and add the necessary imports in FileGroup.java.

Suggested change
package org.datacommons.util;
package org.datacommons.util.parser.strategy;

}
}

char delim = overrideDelimiter == null ? ',' : overrideDelimiter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change introduces a regression where the default delimiter is always a comma, even if the input files are .tsv. The previous logic in FileGroup.build correctly detected .tsv files and set the default delimiter to a tab character. Note that contains() should be used instead of endsWith() to support sharded files.

Suggested change
char delim = overrideDelimiter == null ? ',' : overrideDelimiter;
boolean hasTsv = files.stream().anyMatch(f -> f.getName().toLowerCase().contains(".tsv"));
char delim = overrideDelimiter == null ? (hasTsv ? '\t' : ',') : overrideDelimiter;
References
  1. To handle sharded files, use contains() for extension checks instead of endsWith().

@gmechali gmechali requested a review from vish-cs April 8, 2026 13:37
Copy link
Copy Markdown
Contributor

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on the input JSON-LD files: all property names and non-literal values should either be namespaced or have the values explicitly defined in the @context

in the linked example, name , value, and StatVarObservation are missing @context definitions, and the "schema" and "dcid" namespaces should be defined in the context. I also think we should use the datacommons.org definitions for observationDate, observationAbout, variableMeasured, and value because there are slight differences between the schema.org and datacommons.org definitions 😅

the way you define properties like "variableMeasured": "http://schema.org/variableMeasured", is totally fine (minus the nitpick about using datacommons.org instead of schema.org), but personally i like to just define the namespaces and use those everywhere in the @graph definition like this:

{
  "@context": {
    "schema": "https://schema.org/",
    "dcid": "https://datacommons.org/browser/"
  },
  "@graph": [
    {
      "@id": "dcid:TestNode",
      "@type": "schema:Thing",
      "schema:name": "Test Node Name"
    },
    {
      "@id": "dcid:TestObservation_2020",
      "@type": "dcid:StatVarObservation",
      "dcid:observationAbout": {
        "@id": "dcid:TestNode"
      },
      "dcid:variableMeasured": {
        "@id": "dcid:Count_Person"
      },
      "dcid:value": 1000,
      "dcid:observationDate": "2020"
    }
]

(you can see the differences between those properties by going to, for example, https://schema.org/observationAbout and https://datacommons.org/browser/observationAbout)

gmechali and others added 4 commits April 14, 2026 17:13
…move the namespace in JSON-LD, note the MCF import does not yet support the namespace. We should ensure these will be supported together.
@gmechali
Copy link
Copy Markdown
Contributor Author

/gemini review

@gmechali gmechali requested a review from dwnoble April 14, 2026 21:56
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for JSON-LD input files within the ingestion pipeline and the command-line tool. Key changes include refactoring the FileGroup logic into a strategy-based pattern, adding a new JsonLdParser, and updating the GraphIngestionPipeline and Processor to handle multiple input formats. Review feedback highlights a potential OutOfMemoryError when parsing large JSON-LD files in memory and an unsafe type cast in the Processor class that could lead to a ClassCastException. Additionally, the reviewer suggests using contains instead of endsWith for file path checks to support sharded files, expanding prefix stripping to include non-HTTPS schema URIs, and ensuring that @type and @language tags are preserved during JSON-LD value processing.

public void processElement(
@Element FileIO.ReadableFile file, OutputReceiver<McfGraph> receiver) {
try (InputStream is = Channels.newInputStream(file.open())) {
McfGraph graph = JsonLdParser.parse(is);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Parsing the entire JSON-LD file into memory using JsonLdParser.parse(is) can lead to OutOfMemoryError for large input files. Consider implementing a streaming parser or processing the file in chunks if possible.

}

GraphSupplier parser;
McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This cast to McfFileGroup is unsafe. If the input files are JSON-LD with associated CSVs, args.fileGroup will be an instance of JsonLdFileGroup, leading to a ClassCastException.

    if (!(args.fileGroup instanceof McfFileGroup)) {
      throw new IllegalArgumentException("processTable requires an McfFileGroup for TMCF/CSV processing.");
    }
    McfFileGroup mcfGroup = (McfFileGroup) args.fileGroup;

Comment on lines +24 to +37
private static final String SCHEMA_PREFIX = "https://schema.org/";

private static String stripPrefix(String value) {
if (value == null) {
return null;
}
if (value.startsWith(DCID_PREFIX)) {
return value.substring(DCID_PREFIX.length());
}
if (value.startsWith(SCHEMA_PREFIX)) {
return value.substring(SCHEMA_PREFIX.length());
}
return value;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SCHEMA_PREFIX should also account for http://schema.org/ as many datasets still use the non-HTTPS version of the URI.

  private static final String SCHEMA_PREFIX_HTTPS = "https://schema.org/";
  private static final String SCHEMA_PREFIX_HTTP = "http://schema.org/";

  private static String stripPrefix(String value) {
    if (value == null) {
      return null;
    }
    if (value.startsWith(DCID_PREFIX)) {
      return value.substring(DCID_PREFIX.length());
    }
    if (value.startsWith(SCHEMA_PREFIX_HTTPS)) {
      return value.substring(SCHEMA_PREFIX_HTTPS.length());
    }
    if (value.startsWith(SCHEMA_PREFIX_HTTP)) {
      return value.substring(SCHEMA_PREFIX_HTTP.length());
    }
    return value;
  }

Map<?, ?> map = (Map<?, ?>) item;
if (map.containsKey("@value")) {
Object val = map.get("@value");
addProperty(nodeBuilder, property, val.toString(), Mcf.ValueType.TEXT);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When processing @value nodes, the parser currently ignores @type (datatype) and @language tags. This may result in loss of information for typed literals or multi-language strings.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants