Skip to content

feat(scripts): Add dependency version scanner tool#16867

Open
chalmerlowe wants to merge 44 commits into
mainfrom
feat/add-version-scanner
Open

feat(scripts): Add dependency version scanner tool#16867
chalmerlowe wants to merge 44 commits into
mainfrom
feat/add-version-scanner

Conversation

@chalmerlowe
Copy link
Copy Markdown
Contributor

@chalmerlowe chalmerlowe commented Apr 29, 2026

This adds a utility with the ability to scan for common references to dependencies (Python runtimes and package dependencies) to facilitate updating code when runtimes and dependencies change.

  • It can be run against an entire repo OR against specific packages within a monorepo
  • It is customizable with regex patterns and examples here
  • The test suite checks each regex against the examples to ensure the efficacy of the patterns
  • The current patterns account for edge cases such as finding < 3.8 when searching for references to 3.7 since they are semantically equivalent even if syntactically different.
  • The scanner produces a CSV report with:
path/filename, package name, line number, matching pattern, full line for context, etc.

@chalmerlowe chalmerlowe changed the title feat(scripts): Add dependency version scanner tool feat(scripts): [WIP] Add dependency version scanner tool Apr 29, 2026
@chalmerlowe chalmerlowe added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new dependency version scanner, including a configuration-driven regex scanner, a benchmarking tool, and comprehensive unit and integration tests. The review feedback highlights several areas for improvement: optimizing regex compilation in the scanner to avoid performance bottlenecks, using the tempfile module in the benchmark script to prevent race conditions, removing redundant code, improving test robustness by checking subprocess exit codes, and adhering to PEP 8 by moving imports to the top of files.

Comment thread scripts/version_scanner/version_scanner.py Outdated
Comment thread scripts/version_scanner/benchmark.py Outdated
Comment thread scripts/version_scanner/benchmark.py Outdated
Comment thread scripts/version_scanner/tests/integration/test_scanner_integration.py Outdated
Comment thread scripts/version_scanner/tests/unit/test_version_scanner.py Outdated
Comment thread scripts/version_scanner/tests/unit/test_version_scanner.py Outdated
Comment thread scripts/version_scanner/version_scanner.py Outdated
@chalmerlowe chalmerlowe marked this pull request as ready for review May 5, 2026 13:03
@chalmerlowe chalmerlowe requested a review from a team as a code owner May 5, 2026 13:03
@chalmerlowe chalmerlowe removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 5, 2026
@chalmerlowe chalmerlowe changed the title feat(scripts): [WIP] Add dependency version scanner tool feat(scripts): Add dependency version scanner tool May 5, 2026
@chalmerlowe chalmerlowe added this to the Drop support for 3.7-3.9 milestone May 5, 2026
@parthea parthea self-assigned this May 6, 2026
@@ -0,0 +1,34 @@
import csv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the copytight header is missing (applies to all code files)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a license header to code files.
Per convention, did not include license header in:

  • config files such as .gitignore, requirements.txt
  • OR data files used for testing

Run the script from the repository root:

```bash
python3 scripts/version_scanner/version_scanner.py -d <dependency> -v <version> [options]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this, I gt a ModuleNotFound error. is there a requirements.txt or anything that captures the dependencies?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added requirements.txt

This plan outlines the approach to update Python packages to drop support for end-of-life Python runtimes (3.7, 3.8, 3.9) OR for deprecated dependencies, and ensure the packages are configured for modern Python.

#### High-Level Strategy
- **One Branch Per Package**: To keep PRs manageable and isolated, we suggest a dedicated worktree and branch for each package (e.g., `feat/drop-<dependency>-<version>-<package-name>` i.e. `feat/drop-protobuf-4.25.8-google-cloud-bigquery`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for hand-written packages, right? I assume others would get their updates through the generator?

Should we recommend doing a generator update first, to clean up most of the packages?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a note to the effect that the if the templates in the gapic-generator are update, then the changes will trickle down to generated packages. This not is in the README.md in the vicinity of lines 34 - 38.

@@ -0,0 +1,5 @@
packages/google-cloud-access-context-manager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you are asking about google-cloud-access-context-manager or about the file.

The file small_package_list.txt is a way to present a list of packages to the scanner versus scanning only one package OR scanning all packages. The use of this file is explained in the README.md near line 19.

The specific packages listed here are just packages chosen at random for example purposes. I asked Gemini to grab a couple package names to help me do a test of functionality.

self.variables = self._compute_variables()

def _compute_variables(self) -> Dict[str, str]:
"""Compute variables for interpolation from version string."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: more detailed comments/examples could be helpful for future maintainers. I'm not sure what a variable is, or the expected version string format

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional comments.

try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
skip_next = False
for line_num, line in enumerate(f, 1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any issues with statements that span lines?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a limitation in the README.md clarifying that this is solely a single-line scanning engine. Mostly because references to version numbers tend to happen on a single line and to keep complexity low. If we determine that not having multi-line versioning is an issue, we can include that as a feature in future updates.

def upload_to_drive(csv_path: str, matches: List[Dict[str, str]], github_repo: str = None, branch: str = "main") -> str:
"""
Upload matches to a Google Sheet in Drive.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It seems to add extra complexity, dependencies and test surface area, when Google Sheets makes it pretty easy to import a csv natively already

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! During dependency cleanup sessions, I typically ran the scanner multiple times:

  • comparing 'before and after' results
  • rescanning a library if I detected a new regex pattern should be used
  • and doing manual QA cycles

Having an automated --upload feature to instantly publish a shared, readable Google Sheet saves significant toil compared to repeatedly exporting and manually importing CSVs into Sheets.

parts = rel_root.split(os.sep)

# Monorepo filtering
if target_packages and parts[0] == "packages":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's talk of separating the packages directory into separate ones for generated and handwritten libraries. Will that be easy to address here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the ability to handle any aggregating folder from this list:

  • packages/
  • handwritten/
  • generated/
  • third-party/

other folder names can be added simply enough to the version_scanner if we end up using different naming conventions during a future mono-repo upgrade.

This update can be found near version_scanner.pyL#515


package_group.add_argument(
"--package",
help="Specific subdirectory filter (useful for monorepos)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to the structure of the monorepo's package directory? Os is this more of a generic subdirectory filter?

Copy link
Copy Markdown
Contributor Author

@chalmerlowe chalmerlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-sanche

I responded to the comments with some code updates and with some explanations. Please take a look.

@@ -0,0 +1,34 @@
import csv
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a license header to code files.
Per convention, did not include license header in:

  • config files such as .gitignore, requirements.txt
  • OR data files used for testing

Run the script from the repository root:

```bash
python3 scripts/version_scanner/version_scanner.py -d <dependency> -v <version> [options]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added requirements.txt

This plan outlines the approach to update Python packages to drop support for end-of-life Python runtimes (3.7, 3.8, 3.9) OR for deprecated dependencies, and ensure the packages are configured for modern Python.

#### High-Level Strategy
- **One Branch Per Package**: To keep PRs manageable and isolated, we suggest a dedicated worktree and branch for each package (e.g., `feat/drop-<dependency>-<version>-<package-name>` i.e. `feat/drop-protobuf-4.25.8-google-cloud-bigquery`).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a note to the effect that the if the templates in the gapic-generator are update, then the changes will trickle down to generated packages. This not is in the README.md in the vicinity of lines 34 - 38.

@@ -0,0 +1,5 @@
packages/google-cloud-access-context-manager
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you are asking about google-cloud-access-context-manager or about the file.

The file small_package_list.txt is a way to present a list of packages to the scanner versus scanning only one package OR scanning all packages. The use of this file is explained in the README.md near line 19.

The specific packages listed here are just packages chosen at random for example purposes. I asked Gemini to grab a couple package names to help me do a test of functionality.

self.variables = self._compute_variables()

def _compute_variables(self) -> Dict[str, str]:
"""Compute variables for interpolation from version string."""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional comments.

try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
skip_next = False
for line_num, line in enumerate(f, 1):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a limitation in the README.md clarifying that this is solely a single-line scanning engine. Mostly because references to version numbers tend to happen on a single line and to keep complexity low. If we determine that not having multi-line versioning is an issue, we can include that as a feature in future updates.

def upload_to_drive(csv_path: str, matches: List[Dict[str, str]], github_repo: str = None, branch: str = "main") -> str:
"""
Upload matches to a Google Sheet in Drive.
"""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! During dependency cleanup sessions, I typically ran the scanner multiple times:

  • comparing 'before and after' results
  • rescanning a library if I detected a new regex pattern should be used
  • and doing manual QA cycles

Having an automated --upload feature to instantly publish a shared, readable Google Sheet saves significant toil compared to repeatedly exporting and manually importing CSVs into Sheets.

parts = rel_root.split(os.sep)

# Monorepo filtering
if target_packages and parts[0] == "packages":
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the ability to handle any aggregating folder from this list:

  • packages/
  • handwritten/
  • generated/
  • third-party/

other folder names can be added simply enough to the version_scanner if we end up using different naming conventions during a future mono-repo upgrade.

This update can be found near version_scanner.pyL#515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants