feat(scripts): Add dependency version scanner tool#16867
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new dependency version scanner, including a configuration-driven regex scanner, a benchmarking tool, and comprehensive unit and integration tests. The review feedback highlights several areas for improvement: optimizing regex compilation in the scanner to avoid performance bottlenecks, using the tempfile module in the benchmark script to prevent race conditions, removing redundant code, improving test robustness by checking subprocess exit codes, and adhering to PEP 8 by moving imports to the top of files.
…boundaries for explicit_version_string
… to .scannerignore
| @@ -0,0 +1,34 @@ | |||
| import csv | |||
There was a problem hiding this comment.
It looks like the copytight header is missing (applies to all code files)
There was a problem hiding this comment.
Added a license header to code files.
Per convention, did not include license header in:
- config files such as
.gitignore,requirements.txt - OR data files used for testing
| Run the script from the repository root: | ||
|
|
||
| ```bash | ||
| python3 scripts/version_scanner/version_scanner.py -d <dependency> -v <version> [options] |
There was a problem hiding this comment.
When I ran this, I gt a ModuleNotFound error. is there a requirements.txt or anything that captures the dependencies?
There was a problem hiding this comment.
Added requirements.txt
| This plan outlines the approach to update Python packages to drop support for end-of-life Python runtimes (3.7, 3.8, 3.9) OR for deprecated dependencies, and ensure the packages are configured for modern Python. | ||
|
|
||
| #### High-Level Strategy | ||
| - **One Branch Per Package**: To keep PRs manageable and isolated, we suggest a dedicated worktree and branch for each package (e.g., `feat/drop-<dependency>-<version>-<package-name>` i.e. `feat/drop-protobuf-4.25.8-google-cloud-bigquery`). |
There was a problem hiding this comment.
This is only for hand-written packages, right? I assume others would get their updates through the generator?
Should we recommend doing a generator update first, to clean up most of the packages?
There was a problem hiding this comment.
There is a note to the effect that the if the templates in the gapic-generator are update, then the changes will trickle down to generated packages. This not is in the README.md in the vicinity of lines 34 - 38.
| @@ -0,0 +1,5 @@ | |||
| packages/google-cloud-access-context-manager | |||
There was a problem hiding this comment.
I don't know if you are asking about google-cloud-access-context-manager or about the file.
The file small_package_list.txt is a way to present a list of packages to the scanner versus scanning only one package OR scanning all packages. The use of this file is explained in the README.md near line 19.
The specific packages listed here are just packages chosen at random for example purposes. I asked Gemini to grab a couple package names to help me do a test of functionality.
| self.variables = self._compute_variables() | ||
|
|
||
| def _compute_variables(self) -> Dict[str, str]: | ||
| """Compute variables for interpolation from version string.""" |
There was a problem hiding this comment.
nit: more detailed comments/examples could be helpful for future maintainers. I'm not sure what a variable is, or the expected version string format
There was a problem hiding this comment.
Added additional comments.
| try: | ||
| with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: | ||
| skip_next = False | ||
| for line_num, line in enumerate(f, 1): |
There was a problem hiding this comment.
are there any issues with statements that span lines?
There was a problem hiding this comment.
Added a limitation in the README.md clarifying that this is solely a single-line scanning engine. Mostly because references to version numbers tend to happen on a single line and to keep complexity low. If we determine that not having multi-line versioning is an issue, we can include that as a feature in future updates.
| def upload_to_drive(csv_path: str, matches: List[Dict[str, str]], github_repo: str = None, branch: str = "main") -> str: | ||
| """ | ||
| Upload matches to a Google Sheet in Drive. | ||
| """ |
There was a problem hiding this comment.
Is this necessary? It seems to add extra complexity, dependencies and test surface area, when Google Sheets makes it pretty easy to import a csv natively already
There was a problem hiding this comment.
Great question! During dependency cleanup sessions, I typically ran the scanner multiple times:
- comparing 'before and after' results
- rescanning a library if I detected a new regex pattern should be used
- and doing manual QA cycles
Having an automated --upload feature to instantly publish a shared, readable Google Sheet saves significant toil compared to repeatedly exporting and manually importing CSVs into Sheets.
| parts = rel_root.split(os.sep) | ||
|
|
||
| # Monorepo filtering | ||
| if target_packages and parts[0] == "packages": |
There was a problem hiding this comment.
There's talk of separating the packages directory into separate ones for generated and handwritten libraries. Will that be easy to address here?
There was a problem hiding this comment.
We have the ability to handle any aggregating folder from this list:
packages/handwritten/generated/third-party/
other folder names can be added simply enough to the version_scanner if we end up using different naming conventions during a future mono-repo upgrade.
This update can be found near version_scanner.pyL#515
|
|
||
| package_group.add_argument( | ||
| "--package", | ||
| help="Specific subdirectory filter (useful for monorepos)" |
There was a problem hiding this comment.
Is this specific to the structure of the monorepo's package directory? Os is this more of a generic subdirectory filter?
…and package-file details
…ayout-agnostic package naming
chalmerlowe
left a comment
There was a problem hiding this comment.
I responded to the comments with some code updates and with some explanations. Please take a look.
| @@ -0,0 +1,34 @@ | |||
| import csv | |||
There was a problem hiding this comment.
Added a license header to code files.
Per convention, did not include license header in:
- config files such as
.gitignore,requirements.txt - OR data files used for testing
| Run the script from the repository root: | ||
|
|
||
| ```bash | ||
| python3 scripts/version_scanner/version_scanner.py -d <dependency> -v <version> [options] |
There was a problem hiding this comment.
Added requirements.txt
| This plan outlines the approach to update Python packages to drop support for end-of-life Python runtimes (3.7, 3.8, 3.9) OR for deprecated dependencies, and ensure the packages are configured for modern Python. | ||
|
|
||
| #### High-Level Strategy | ||
| - **One Branch Per Package**: To keep PRs manageable and isolated, we suggest a dedicated worktree and branch for each package (e.g., `feat/drop-<dependency>-<version>-<package-name>` i.e. `feat/drop-protobuf-4.25.8-google-cloud-bigquery`). |
There was a problem hiding this comment.
There is a note to the effect that the if the templates in the gapic-generator are update, then the changes will trickle down to generated packages. This not is in the README.md in the vicinity of lines 34 - 38.
| @@ -0,0 +1,5 @@ | |||
| packages/google-cloud-access-context-manager | |||
There was a problem hiding this comment.
I don't know if you are asking about google-cloud-access-context-manager or about the file.
The file small_package_list.txt is a way to present a list of packages to the scanner versus scanning only one package OR scanning all packages. The use of this file is explained in the README.md near line 19.
The specific packages listed here are just packages chosen at random for example purposes. I asked Gemini to grab a couple package names to help me do a test of functionality.
| self.variables = self._compute_variables() | ||
|
|
||
| def _compute_variables(self) -> Dict[str, str]: | ||
| """Compute variables for interpolation from version string.""" |
There was a problem hiding this comment.
Added additional comments.
| try: | ||
| with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: | ||
| skip_next = False | ||
| for line_num, line in enumerate(f, 1): |
There was a problem hiding this comment.
Added a limitation in the README.md clarifying that this is solely a single-line scanning engine. Mostly because references to version numbers tend to happen on a single line and to keep complexity low. If we determine that not having multi-line versioning is an issue, we can include that as a feature in future updates.
| def upload_to_drive(csv_path: str, matches: List[Dict[str, str]], github_repo: str = None, branch: str = "main") -> str: | ||
| """ | ||
| Upload matches to a Google Sheet in Drive. | ||
| """ |
There was a problem hiding this comment.
Great question! During dependency cleanup sessions, I typically ran the scanner multiple times:
- comparing 'before and after' results
- rescanning a library if I detected a new regex pattern should be used
- and doing manual QA cycles
Having an automated --upload feature to instantly publish a shared, readable Google Sheet saves significant toil compared to repeatedly exporting and manually importing CSVs into Sheets.
| parts = rel_root.split(os.sep) | ||
|
|
||
| # Monorepo filtering | ||
| if target_packages and parts[0] == "packages": |
There was a problem hiding this comment.
We have the ability to handle any aggregating folder from this list:
packages/handwritten/generated/third-party/
other folder names can be added simply enough to the version_scanner if we end up using different naming conventions during a future mono-repo upgrade.
This update can be found near version_scanner.pyL#515
This adds a utility with the ability to scan for common references to dependencies (Python runtimes and package dependencies) to facilitate updating code when runtimes and dependencies change.
< 3.8when searching for references to3.7since they are semantically equivalent even if syntactically different.