Skip to content

436 integrate ptms from evidence into alphafold predictions#449

Open
tE3m wants to merge 14 commits into
361-include-ptms-in-existing-cifsfrom
436-integrate-ptms-from-evidence-into-alphafold-predictions
Open

436 integrate ptms from evidence into alphafold predictions#449
tE3m wants to merge 14 commits into
361-include-ptms-in-existing-cifsfrom
436-integrate-ptms-from-evidence-into-alphafold-predictions

Conversation

@tE3m
Copy link
Copy Markdown
Collaborator

@tE3m tE3m commented May 29, 2026

Description

fixes #436

Introduces a new step to enable insertion of modified residues into a cif_df where they are detected in a supplied psm_df. Requires outputs of a alphafold-import step

Changes

  • backend/user_data/external/ptm/*: location of the modified residue structure files
  • backend/protzilla/constants/cif_columns.py: add required columns and a KnownPTM enum, which holds all PTM combinations we currently have structure files for
  • backend/protzilla/constants/peptide_columns.py: refactor column lists to enums to allow reuse
    • backend/protzilla/importing/peptide_import.py: slight adaptations due to change above
  • backend/protzilla/data_analysis/crosslinking_validation.py: refactor former local get_crosslink_positions_in_protein method to allow reuse outside of the crosslinking use case, rename to reflect more general use-case
  • backend/protzilla/data_integration/cif_ptm.py: add calc method for new step
  • backend/protzilla/importing/alphafold_protein_structure_load.py: move the mapping from y/n/. strings to native booleans to constants file
  • backend/protzilla/methods/importing.py: add a raw cif import for debugging purposes, to be extended by Integrate PTMs from evidence into conventional protein structure files #437
  • backend/protzilla/utilities/ptm_helpers.py: helper functions for parsing PTM names from a psm_df

Testing

  1. Import a monomer with (or multimer containing a) protein ID for which we have PTMs in an evidence-file, I used Q16555 because the evidence.txt from the MaxQuant_data in the nextcloud folder is well-formed, unlike the one from our example dataset (see Allow abbreviated PTM identifiers in evidence #438).
  2. Import the evidence.txt mentioned above
  3. Connect the relevant inputs to the new step
  4. Observe the result in the visualisations tab, where the non-standard molecules can be highlighted by selecting "non-standard" from the components in the bottom right

PR checklist

Development

  • If necessary, I have updated the documentation (README, docstrings, etc.)
  • If necessary, I have created / updated tests.

Mergeability

  • main-branch has been merged into local branch to resolve conflicts
  • The tests and linter have passed AFTER local merge
  • The backend code has been formatted with black
  • The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

  • I have self-reviewed my code.
  • At least one other developer reviewed and approved the changes

@tE3m tE3m force-pushed the 436-integrate-ptms-from-evidence-into-alphafold-predictions branch from 7f1e50b to ecd4429 Compare June 1, 2026 11:20
why do we even bother with this
@tE3m tE3m requested review from AnnaPolensky and Elena-kal June 1, 2026 11:43
Copy link
Copy Markdown
Collaborator

@AnnaPolensky AnnaPolensky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I especially liked your comments as they were very helpful for understanding what you do when you filter and change the dataframes.

But please provide a description on how to test your changes. I tried this:

Image

And got this error:

Image

Maybe I have uploaded a wrong evidence file? I used the one we got first right at the beginning of our project. For the AlphaFold step I used O43242.

Also, black fails.

return matches.iloc[0]


def get_residue_positions_in_protein(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to maybe extract this function to some helper file now that it is used in the new step in data integration as well?

Comment on lines +67 to +76
# and ignore duplicates between different samples
[
PSM_DF_COLUMNS.SEQUENCE,
PSM_DF_COLUMNS.MODIFICATIONS,
PSM_DF_COLUMNS.MODIFIED_SEQUENCE,
PSM_DF_COLUMNS.PROTEIN_ID,
]
].drop_duplicates(
ignore_index=True
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of duplicates do we ignore here? Appear the same modifications of a specific residue several times?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can happen if the same modification is detected in different samples in the evidence

Comment on lines +80 to +90
psm_df["mod_tuple"] = psm_df.apply(
lambda row: [
(
mod,
residue,
# caution: position within protein is 1-indexed
protein_location,
)
for mod, mod_locations in extract_mods(
row[PSM_DF_COLUMNS.MODIFIED_SEQUENCE],
clean_mod_list_of_numbers(row[PSM_DF_COLUMNS.MODIFICATIONS].split(",")),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might just be me, but I was at first a bit confused what mod means in this context because I thought you meant modulo. I think modification might be a bit clearer, but we can keep it like it is

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, I just feel that it's cumbersome to always write modification, and like one might use i instead of index if the scope is limited it felt right here to me

Comment on lines +223 to +225
atom_site_df[ATOM_SITE_COLUMNS.LABEL_ALT_ID] = "."
atom_site_df[ATOM_SITE_COLUMNS.OCCUPANCY] = 1.0
atom_site_df[ATOM_SITE_COLUMNS.B_ISO_OR_EQUIV] = pd.NA
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what these values stand for. So just to be sure: We want to hardcode these values and this is not just some remains from debugging?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two are always constant from what I saw in the cif files I've handled, where the third is usually a number but one that we have no data for, so there we write NA.

Replaces the residue at a specific location with a modified residue

:param cif_df: The cif_df to modify
:param index: The 1-based index in the protein of the resiude to change
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: residue

Comment on lines +1 to +3
"""
PTM helper functions, taken and adapted from https://github.com/usiGrabber/usiGrabber
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to state this at the beginning of the file, or should we give credit in each adapted function, since someone who only looks at one specific function might miss these credits?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool I referenced was developed at our chair, so this is more of a pointer for developers than proper attribution. Nonetheless, licensing stuff would always be stated at the top of the file in a comment in my experience, especially since I slightly changed some stuff

cleaned_mods = []
for mod in mod_list:
mod = str(mod)
cleaned_mods.append(mod.lstrip(digits).lstrip(" "))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first removes leading digits and then leading spaces. If we had something like " 45something" only the spaces would be removed but not the digits. Is that ok/intended? If so, I might make the docstring a bit more specific about the order in which things are removed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is taken straight from MQ and processed, so we don't have to deal with user input here. As such, the format is <amount if more than 1> , so what you describe could be caught, there is no need for it

str: The simplified modification name (i.e. Oxidation).
"""
mod_name = str(mod_name)
return mod_name.lstrip(digits).lstrip(" ").split(" ")[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines +202 to +210
"""
Simplify a modification name by removing leading numbers and spaces.
This function takes a modification name string and removes any leading
numeric characters and spaces to return a cleaner version of the name.
Args:
mod_name (str): The original modification name (i.e. Oxidation (M)).
Returns:
str: The simplified modification name (i.e. Oxidation).
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description and example do not fully match.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to credit the source from where we got these PTMs? (same question for the other cif files as well)

@tE3m
Copy link
Copy Markdown
Collaborator Author

tE3m commented Jun 2, 2026

Looks good overall. I especially liked your comments as they were very helpful for understanding what you do when you filter and change the dataframes.

But please provide a description on how to test your changes. I tried this:
Image

And got this error:
Image

Maybe I have uploaded a wrong evidence file? I used the one we got first right at the beginning of our project. For the AlphaFold step I used O43242.

Also, black fails.

sorry about that, as this PR is still marked as draft I didn't expect you to get pinged - while writing up the testing, I encountered similar issues to you (the one you show happens when there are no PTMs for the protein ID in the evidence), some of which only happened after rebasing onto the current crosslinking state. As those issues are not yet fixed, I hadn't intended for them to be reviewed already. Sorry again!

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  backend/protzilla
  all_steps.py
  backend/protzilla/data_analysis
  crosslinking_validation.py
  backend/protzilla/data_integration
  cif_ptm.py 88, 205-246, 348-396
  backend/protzilla/importing
  alphafold_protein_structure_load.py 175
  peptide_import.py 56-57
  backend/protzilla/methods
  data_integration.py 1008
  importing.py 441
  backend/protzilla/utilities
  ptm_helpers.py 38, 40, 87, 127-130, 178-179, 211-212
Project Total  

This report was generated by python-coverage-comment-action

@tE3m tE3m marked this pull request as ready for review June 2, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants