Skip to content

Add tokenizer for BigSMILES representation of polymers#8

Open
anoushka2000 wants to merge 53 commits into
mainfrom
feat/big-smiles
Open

Add tokenizer for BigSMILES representation of polymers#8
anoushka2000 wants to merge 53 commits into
mainfrom
feat/big-smiles

Conversation

@anoushka2000
Copy link
Copy Markdown
Collaborator

No description provided.

@anoushka2000 anoushka2000 requested a review from awadell1 April 23, 2026 23:50
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • missing rust level tests
  • missing tests for expected unknowns (ie 🤗)
  • is there a spec for this? The test set should run against every BigSMILE in the spec

Comment thread pyproject.toml Outdated
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments. Also this needs an entry in the changelog

Comment thread python/smirk/__init__.py
Comment thread python/smirk/__init__.py
Comment thread python/smirk/vocab_bigsmiles.json
Comment thread docs/big_smirk_demo.ipynb
Comment thread src/pre_tokenizers/split_bigsmiles.rs
Comment thread src/wrapper.rs
@anoushka2000 anoushka2000 requested a review from awadell1 April 26, 2026 17:40
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, some comments on tests and serialization. But my main concern is how are labels handled? The spec seems pretty unbounded, and as is, the current tokenizer is open

Do you plan on handling the special fragment labels?

You need way more tests. Things like higher digit counts for bonding descriptors, weird labels and every facet of the spec. There's not test for labels or the abstract fragment spec

Comment thread opt/build_vocab.py
return sorted(out)


def merge_tokens_grouped(tokens):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like merge_tokens

Copy link
Copy Markdown
Collaborator Author

@anoushka2000 anoushka2000 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| is a valid character in the weighted BigSMILES representation. The expression output from merge_tokens would capture this e.g. A| could be a token. I didn't want to alter the original merge_tokens function. Also ended up not supporting the G-BigSMILES extension where the | character occurs but probably still better to now have that be ambiguous.

https://github.com/InnocentBug/G-BigSMILES/blob/2fbbeb7879dc9c15c67178d1399b0a9bc9a21f38/README.md?plain=1#L11

Comment thread src/pre_tokenizers/split_bigsmiles.rs Outdated
r"\(|\)|",
r"\{|\}|", // Stochastic object delimiters
r",|;|", // Repeat unit separator and end group separator
r"[A-Z][A-Za-z0-9']*|", // Fragment and abstract spec labels
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can match basically any name.

  1. not covered by the test set
  2. demands an unbounded vocabulary

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. how does this interact with the rest of the spec. For example, what's the tokenization of AbCl
  2. aren't labels bracketed?

Comment thread test/bigsmiles.smi
{[<][>]NC(C)C(=O)[<],[>]NCC(=O)[<][>]}O
{[<]NC(C)C(=O),NCC(=O)[>]}O
{[][$]CC(C)([#R])[$][]}.{#R=C(=O)OCC12CC(C3)CC(C1)CC3C2}
C([#Arm])([#Arm])([#Arm])[#Arm].{#Arm=CO{[<][>]CCO[<][>]}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces unknowns right? Cause Arm isn't a token?

Comment thread test/bigsmiles.smi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the Python level this should be used to check for no unknowns

Comment thread src/tokenizer.rs
Comment thread test/test_tokenize_bigsmiles.py
Comment thread opt/build_vocab.py Outdated
Comment thread opt/build_vocab.py
Copy link
Copy Markdown
Member

@awadell1 awadell1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely, I'm planning on just hitting merge when you re-request the review

Do you want a release soonish?

Comment thread docs/big_smirk_demo.ipynb Outdated
Comment thread src/pre_tokenizers/bigsmirk.rs Outdated
type_field = Some(map.next_value()?);
}
"bigsmiles_version" => {
let _: serde::de::IgnoredAny = map.next_value()?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde::de::IgnoredAny kinda undoes the point of a version tag, no?
Can it either:

  • Store a Version (probably excessive)
  • Reject strings that aren't 1.1 (this is fine)

Code to handle a version mismatch is out of scope, but we do want to be set up for that

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing a fix where the deserialize will rejected anything except "1.1" from the json, unless you are referring to checking the syntax of the input strings

Comment thread python/smirk/__init__.py
anoushka2000 and others added 28 commits May 6, 2026 14:34
Bumps [rdkit](https://github.com/rdkit/rdkit) from 2024.9.5 to 2026.3.1.
- [Release notes](https://github.com/rdkit/rdkit/releases)
- [Changelog](https://github.com/rdkit/rdkit/blob/master/ReleaseNotes.md)
- [Commits](https://github.com/rdkit/rdkit/commits)

---
updated-dependencies:
- dependency-name: rdkit
  dependency-version: 2026.3.1
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v5...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance) from 3 to 4.
- [Release notes](https://github.com/actions/attest-build-provenance/releases)
- [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md)
- [Commits](actions/attest-build-provenance@v3...v4)

---
updated-dependencies:
- dependency-name: actions/attest-build-provenance
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](actions/upload-pages-artifact@v4...v5)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5.
- [Release notes](https://github.com/actions/deploy-pages/releases)
- [Commits](actions/deploy-pages@v4...v5)

---
updated-dependencies:
- dependency-name: actions/deploy-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [furo](https://github.com/pradyunsg/furo) from 2024.8.6 to 2025.12.19.
- [Release notes](https://github.com/pradyunsg/furo/releases)
- [Changelog](https://github.com/pradyunsg/furo/blob/main/docs/changelog.md)
- [Commits](pradyunsg/furo@2024.08.06...2025.12.19)

---
updated-dependencies:
- dependency-name: furo
  dependency-version: 2025.12.19
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [datasets](https://github.com/huggingface/datasets) to permit the latest version.
- [Release notes](https://github.com/huggingface/datasets/releases)
- [Commits](huggingface/datasets@3.3.0...4.5.0)

---
updated-dependencies:
- dependency-name: datasets
  dependency-version: 4.5.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.48.2...v4.57.6)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.57.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.27.0...v0.28.3)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-version: 0.28.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers) to permit the latest version.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.21.0...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.23.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@v6...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 5 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](astral-sh/setup-uv@v5...v7)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [myst-nb](https://github.com/executablebooks/myst-nb) to permit the latest version.
- [Release notes](https://github.com/executablebooks/myst-nb/releases)
- [Changelog](https://github.com/executablebooks/MyST-NB/blob/main/CHANGELOG.md)
- [Commits](executablebooks/MyST-NB@v1.2.0...v1.3.0)

---
updated-dependencies:
- dependency-name: myst-nb
  dependency-version: 1.3.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Updates the requirements on [pre-commit](https://github.com/pre-commit/pre-commit) to permit the latest version.
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](pre-commit/pre-commit@v4.1.0...v4.3.0)

---
updated-dependencies:
- dependency-name: pre-commit
  dependency-version: 4.3.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 3 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v3...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
@anoushka2000 anoushka2000 requested a review from awadell1 May 13, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants