When a ligand is provided through LigandInput(smiles=...), the local ESMFold2 input preparation does not preserve the molecular bond graph parsed by RDKit. This can produce a token_bonds matrix with no correct intra-ligand edges and can lead to severely distorted ligand geometries in predicted structures.
The CCD input path for the same molecule produces the expected ligand bond graph.
The behavior appears to originate in:
esm/models/esmfold2/prepare_input.py
tokenize_ligand_smiles() parses the SMILES with RDKit and generates a conformer, but returns only TokenInfo and AtomInfo. The bonds available from mol_no_h.GetBonds() are not propagated.
All SMILES ligands are then assigned:
Later, compute_token_bonds() attempts to recover intra-ligand bonds using:
res_name = tokens[atom_list[0][1]].residue_name
ccd_bonds = get_ligand_ccd_bonds(res_name)
For a SMILES ligand, this queries the CCD entry named LIG, which is unrelated
to the submitted molecule. Its atom names generally do not match the
SMILES-generated canonical atom names, so no correct edges are added.
If no CCD bonds are found, the current fallback creates a fully connected
intra-residue graph, which also does not represent the submitted ligand
chemistry.
Environment
esm: 3.3.0
transformers: 4.57.6
Python: 3.12
When a ligand is provided through
LigandInput(smiles=...), the local ESMFold2 input preparation does not preserve the molecular bond graph parsed by RDKit. This can produce atoken_bondsmatrix with no correct intra-ligand edges and can lead to severely distorted ligand geometries in predicted structures.The CCD input path for the same molecule produces the expected ligand bond graph.
The behavior appears to originate in:
tokenize_ligand_smiles()parses the SMILES with RDKit and generates a conformer, but returns onlyTokenInfoandAtomInfo. The bonds available frommol_no_h.GetBonds()are not propagated.All SMILES ligands are then assigned:
Later,
compute_token_bonds()attempts to recover intra-ligand bonds using:For a SMILES ligand, this queries the CCD entry named
LIG, which is unrelatedto the submitted molecule. Its atom names generally do not match the
SMILES-generated canonical atom names, so no correct edges are added.
If no CCD bonds are found, the current fallback creates a fully connected
intra-residue graph, which also does not represent the submitted ligand
chemistry.
Environment