Add tokenizer for BigSMILES representation of polymers#8
Conversation
awadell1
left a comment
There was a problem hiding this comment.
- missing rust level tests
- missing tests for expected unknowns (ie 🤗)
- is there a spec for this? The test set should run against every BigSMILE in the spec
awadell1
left a comment
There was a problem hiding this comment.
Code looks good, some comments on tests and serialization. But my main concern is how are labels handled? The spec seems pretty unbounded, and as is, the current tokenizer is open
Do you plan on handling the special fragment labels?
You need way more tests. Things like higher digit counts for bonding descriptors, weird labels and every facet of the spec. There's not test for labels or the abstract fragment spec
| return sorted(out) | ||
|
|
||
|
|
||
| def merge_tokens_grouped(tokens): |
There was a problem hiding this comment.
| is a valid character in the weighted BigSMILES representation. The expression output from merge_tokens would capture this e.g. A| could be a token. I didn't want to alter the original merge_tokens function. Also ended up not supporting the G-BigSMILES extension where the | character occurs but probably still better to now have that be ambiguous.
| r"\(|\)|", | ||
| r"\{|\}|", // Stochastic object delimiters | ||
| r",|;|", // Repeat unit separator and end group separator | ||
| r"[A-Z][A-Za-z0-9']*|", // Fragment and abstract spec labels |
There was a problem hiding this comment.
This can match basically any name.
- not covered by the test set
- demands an unbounded vocabulary
There was a problem hiding this comment.
- how does this interact with the rest of the spec. For example, what's the tokenization of
AbCl - aren't labels bracketed?
| {[<][>]NC(C)C(=O)[<],[>]NCC(=O)[<][>]}O | ||
| {[<]NC(C)C(=O),NCC(=O)[>]}O | ||
| {[][$]CC(C)([#R])[$][]}.{#R=C(=O)OCC12CC(C3)CC(C1)CC3C2} | ||
| C([#Arm])([#Arm])([#Arm])[#Arm].{#Arm=CO{[<][>]CCO[<][>]}} |
There was a problem hiding this comment.
This produces unknowns right? Cause Arm isn't a token?
There was a problem hiding this comment.
At the Python level this should be used to check for no unknowns
awadell1
left a comment
There was a problem hiding this comment.
This is shaping up nicely, I'm planning on just hitting merge when you re-request the review
Do you want a release soonish?
| type_field = Some(map.next_value()?); | ||
| } | ||
| "bigsmiles_version" => { | ||
| let _: serde::de::IgnoredAny = map.next_value()?; |
There was a problem hiding this comment.
serde::de::IgnoredAny kinda undoes the point of a version tag, no?
Can it either:
- Store a Version (probably excessive)
- Reject strings that aren't
1.1(this is fine)
Code to handle a version mismatch is out of scope, but we do want to be set up for that
There was a problem hiding this comment.
Implementing a fix where the deserialize will rejected anything except "1.1" from the json, unless you are referring to checking the syntax of the input strings
Bumps [rdkit](https://github.com/rdkit/rdkit) from 2024.9.5 to 2026.3.1. - [Release notes](https://github.com/rdkit/rdkit/releases) - [Changelog](https://github.com/rdkit/rdkit/blob/master/ReleaseNotes.md) - [Commits](https://github.com/rdkit/rdkit/commits) --- updated-dependencies: - dependency-name: rdkit dependency-version: 2026.3.1 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 7. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v5...v7) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance) from 3 to 4. - [Release notes](https://github.com/actions/attest-build-provenance/releases) - [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md) - [Commits](actions/attest-build-provenance@v3...v4) --- updated-dependencies: - dependency-name: actions/attest-build-provenance dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 4 to 5. - [Release notes](https://github.com/actions/upload-pages-artifact/releases) - [Commits](actions/upload-pages-artifact@v4...v5) --- updated-dependencies: - dependency-name: actions/upload-pages-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5. - [Release notes](https://github.com/actions/deploy-pages/releases) - [Commits](actions/deploy-pages@v4...v5) --- updated-dependencies: - dependency-name: actions/deploy-pages dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [furo](https://github.com/pradyunsg/furo) from 2024.8.6 to 2025.12.19. - [Release notes](https://github.com/pradyunsg/furo/releases) - [Changelog](https://github.com/pradyunsg/furo/blob/main/docs/changelog.md) - [Commits](pradyunsg/furo@2024.08.06...2025.12.19) --- updated-dependencies: - dependency-name: furo dependency-version: 2025.12.19 dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [datasets](https://github.com/huggingface/datasets) to permit the latest version. - [Release notes](https://github.com/huggingface/datasets/releases) - [Commits](huggingface/datasets@3.3.0...4.5.0) --- updated-dependencies: - dependency-name: datasets dependency-version: 4.5.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.48.2...v4.57.6) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.57.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.27.0...v0.28.3) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.28.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers) to permit the latest version. - [Release notes](https://github.com/huggingface/tokenizers/releases) - [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md) - [Commits](huggingface/tokenizers@v0.21.0...v0.23.1) --- updated-dependencies: - dependency-name: tokenizers dependency-version: 0.23.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 8. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v6...v8) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 5 to 7. - [Release notes](https://github.com/astral-sh/setup-uv/releases) - [Commits](astral-sh/setup-uv@v5...v7) --- updated-dependencies: - dependency-name: astral-sh/setup-uv dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [myst-nb](https://github.com/executablebooks/myst-nb) to permit the latest version. - [Release notes](https://github.com/executablebooks/myst-nb/releases) - [Changelog](https://github.com/executablebooks/MyST-NB/blob/main/CHANGELOG.md) - [Commits](executablebooks/MyST-NB@v1.2.0...v1.3.0) --- updated-dependencies: - dependency-name: myst-nb dependency-version: 1.3.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pre-commit](https://github.com/pre-commit/pre-commit) to permit the latest version. - [Release notes](https://github.com/pre-commit/pre-commit/releases) - [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md) - [Commits](pre-commit/pre-commit@v4.1.0...v4.3.0) --- updated-dependencies: - dependency-name: pre-commit dependency-version: 4.3.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 3 to 6. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v3...v6) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
No description provided.