Draft
Conversation
- Add `click>=8.0` dependency and hatchling build system to pyproject.toml - Add `[project.scripts]` entry point: `umls-report = "src.scripts.umls_report:main"` - Add `src/scripts/` package with shared CLI utilities (`common.py`) and `umls_report.py` CLI that queries Edge.parquet + Clique.parquet via DuckDB, filtering on `curie_prefix = 'UMLS'` for efficient lookup, and outputs a CSV with UMLS ID, URL, filename, clique leader, clique leader prefix, and Biolink type Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of inserting Edge rows and then doing two ALTER TABLE and a full-table UPDATE to populate clique_leader_prefix and curie_prefix, compute them in a CTE during the initial INSERT. This avoids a three-pass scan and matches how curie_prefix is already handled in the Node table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of inserting Edge rows and then doing two ALTER TABLE and a full-table UPDATE to populate clique_leader_prefix and curie_prefix, compute them in a CTE during the initial INSERT. This avoids a three-pass scan and matches how curie_prefix is already handled in the Node table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Draft
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+6
to
+8
| from src.util import get_config, get_logger | ||
|
|
||
| logger = get_logger(__name__) |
Comment on lines
+40
to
+86
| db = setup_duckdb(duckdb_file, {"memory_limit": memory_limit}) | ||
|
|
||
| edges = db.read_parquet(os.path.join(parquet_dir, "**/Edge.parquet"), hive_partitioning=True) | ||
| cliques = db.read_parquet(os.path.join(parquet_dir, "**/Clique.parquet"), hive_partitioning=True) | ||
|
|
||
| logger.info("Building UMLS report table...") | ||
| db.sql(""" | ||
| CREATE TABLE umls_report AS | ||
| SELECT | ||
| e.curie AS umls_id, | ||
| 'https://uts.nlm.nih.gov/uts/umls/concept/' || split_part(e.curie, ':', 2) AS url, | ||
| e.filename AS filename, | ||
| e.clique_leader AS clique_leader, | ||
| e.clique_leader_prefix AS clique_leader_prefix, | ||
| c.biolink_type AS biolink_type | ||
| FROM edges e | ||
| JOIN cliques c | ||
| ON e.clique_leader = c.clique_leader | ||
| AND e.filename = c.filename | ||
| WHERE e.curie_prefix = 'UMLS' | ||
| AND e.conflation = 'None' | ||
| ORDER BY e.curie, e.filename | ||
| """) | ||
|
|
||
| logger.info(f"Writing UMLS report to {output}...") | ||
| db.sql("SELECT * FROM umls_report").write_csv(output) | ||
|
|
||
| total_rows = db.sql("SELECT COUNT(*) FROM umls_report").fetchone()[0] | ||
| unique_ids = db.sql("SELECT COUNT(DISTINCT umls_id) FROM umls_report").fetchone()[0] | ||
| duplicates = db.sql(""" | ||
| SELECT COUNT(*) FROM ( | ||
| SELECT umls_id | ||
| FROM umls_report | ||
| GROUP BY umls_id | ||
| HAVING COUNT(*) > 1 | ||
| ) | ||
| """).fetchone()[0] | ||
|
|
||
| click.echo(f"Output written to: {output}") | ||
| click.echo(f"Total UMLS ID occurrences (rows): {total_rows:,}") | ||
| click.echo(f"Unique UMLS IDs: {unique_ids:,}") | ||
| click.echo(f"UMLS IDs in more than one clique: {duplicates:,}") | ||
|
|
||
| edges.close() | ||
| cliques.close() | ||
| db.close() | ||
|
|
|
|
||
| logger.info("Building UMLS report table...") | ||
| db.sql(""" | ||
| CREATE TABLE umls_report AS |
Comment on lines
+69
to
+81
| duplicates = db.sql(""" | ||
| SELECT COUNT(*) FROM ( | ||
| SELECT umls_id | ||
| FROM umls_report | ||
| GROUP BY umls_id | ||
| HAVING COUNT(*) > 1 | ||
| ) | ||
| """).fetchone()[0] | ||
|
|
||
| click.echo(f"Output written to: {output}") | ||
| click.echo(f"Total UMLS ID occurrences (rows): {total_rows:,}") | ||
| click.echo(f"Unique UMLS IDs: {unique_ids:,}") | ||
| click.echo(f"UMLS IDs in more than one clique: {duplicates:,}") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP
TODO: