Skip to content

Add UMLS report generator#683

Draft
gaurav wants to merge 7 commits intomasterfrom
add-umls-report-generator
Draft

Add UMLS report generator#683
gaurav wants to merge 7 commits intomasterfrom
add-umls-report-generator

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Mar 5, 2026

WIP

TODO:

gaurav and others added 7 commits February 23, 2026 18:12
- Add `click>=8.0` dependency and hatchling build system to pyproject.toml
- Add `[project.scripts]` entry point: `umls-report = "src.scripts.umls_report:main"`
- Add `src/scripts/` package with shared CLI utilities (`common.py`) and
  `umls_report.py` CLI that queries Edge.parquet + Clique.parquet via DuckDB,
  filtering on `curie_prefix = 'UMLS'` for efficient lookup, and outputs a CSV
  with UMLS ID, URL, filename, clique leader, clique leader prefix, and Biolink type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of inserting Edge rows and then doing two ALTER TABLE and a full-table
UPDATE to populate clique_leader_prefix and curie_prefix, compute them in a CTE
during the initial INSERT. This avoids a three-pass scan and matches how
curie_prefix is already handled in the Node table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of inserting Edge rows and then doing two ALTER TABLE and a full-table
UPDATE to populate clique_leader_prefix and curie_prefix, compute them in a CTE
during the initial INSERT. This avoids a three-pass scan and matches how
curie_prefix is already handled in the Node table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-project-automation github-project-automation Bot moved this to Backlog in Babel sprints Mar 5, 2026
@gaurav gaurav requested a review from Copilot March 5, 2026 22:34
@gaurav gaurav moved this from Backlog to In progress in Babel sprints Mar 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/scripts/common.py
Comment on lines +6 to +8
from src.util import get_config, get_logger

logger = get_logger(__name__)
Comment on lines +40 to +86
db = setup_duckdb(duckdb_file, {"memory_limit": memory_limit})

edges = db.read_parquet(os.path.join(parquet_dir, "**/Edge.parquet"), hive_partitioning=True)
cliques = db.read_parquet(os.path.join(parquet_dir, "**/Clique.parquet"), hive_partitioning=True)

logger.info("Building UMLS report table...")
db.sql("""
CREATE TABLE umls_report AS
SELECT
e.curie AS umls_id,
'https://uts.nlm.nih.gov/uts/umls/concept/' || split_part(e.curie, ':', 2) AS url,
e.filename AS filename,
e.clique_leader AS clique_leader,
e.clique_leader_prefix AS clique_leader_prefix,
c.biolink_type AS biolink_type
FROM edges e
JOIN cliques c
ON e.clique_leader = c.clique_leader
AND e.filename = c.filename
WHERE e.curie_prefix = 'UMLS'
AND e.conflation = 'None'
ORDER BY e.curie, e.filename
""")

logger.info(f"Writing UMLS report to {output}...")
db.sql("SELECT * FROM umls_report").write_csv(output)

total_rows = db.sql("SELECT COUNT(*) FROM umls_report").fetchone()[0]
unique_ids = db.sql("SELECT COUNT(DISTINCT umls_id) FROM umls_report").fetchone()[0]
duplicates = db.sql("""
SELECT COUNT(*) FROM (
SELECT umls_id
FROM umls_report
GROUP BY umls_id
HAVING COUNT(*) > 1
)
""").fetchone()[0]

click.echo(f"Output written to: {output}")
click.echo(f"Total UMLS ID occurrences (rows): {total_rows:,}")
click.echo(f"Unique UMLS IDs: {unique_ids:,}")
click.echo(f"UMLS IDs in more than one clique: {duplicates:,}")

edges.close()
cliques.close()
db.close()


logger.info("Building UMLS report table...")
db.sql("""
CREATE TABLE umls_report AS
Comment on lines +69 to +81
duplicates = db.sql("""
SELECT COUNT(*) FROM (
SELECT umls_id
FROM umls_report
GROUP BY umls_id
HAVING COUNT(*) > 1
)
""").fetchone()[0]

click.echo(f"Output written to: {output}")
click.echo(f"Total UMLS ID occurrences (rows): {total_rows:,}")
click.echo(f"Unique UMLS IDs: {unique_ids:,}")
click.echo(f"UMLS IDs in more than one clique: {duplicates:,}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants