Skip to content

SMILES ligand bond graph is lost during ESMFold2 input preparation #339

@chenbq18

Description

@chenbq18

When a ligand is provided through LigandInput(smiles=...), the local ESMFold2 input preparation does not preserve the molecular bond graph parsed by RDKit. This can produce a token_bonds matrix with no correct intra-ligand edges and can lead to severely distorted ligand geometries in predicted structures.

The CCD input path for the same molecule produces the expected ligand bond graph.
The behavior appears to originate in:

esm/models/esmfold2/prepare_input.py

tokenize_ligand_smiles() parses the SMILES with RDKit and generates a conformer, but returns only TokenInfo and AtomInfo. The bonds available from mol_no_h.GetBonds() are not propagated.

All SMILES ligands are then assigned:

residue_name="LIG"

Later, compute_token_bonds() attempts to recover intra-ligand bonds using:

res_name = tokens[atom_list[0][1]].residue_name
ccd_bonds = get_ligand_ccd_bonds(res_name)

For a SMILES ligand, this queries the CCD entry named LIG, which is unrelated
to the submitted molecule. Its atom names generally do not match the
SMILES-generated canonical atom names, so no correct edges are added.

If no CCD bonds are found, the current fallback creates a fully connected
intra-residue graph, which also does not represent the submitted ligand
chemistry.

Environment

esm: 3.3.0
transformers: 4.57.6
Python: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions