436 integrate ptms from evidence into alphafold predictions by tE3m · Pull Request #449 · cschlaffner/PROTzilla

tE3m · 2026-05-29T15:38:03Z

Description

fixes #436

Introduces a new step to enable insertion of modified residues into a cif_df where they are detected in a supplied psm_df. Requires outputs of a alphafold-import step

Changes

backend/user_data/external/ptm/*: location of the modified residue structure files
backend/protzilla/constants/cif_columns.py: add required columns and a KnownPTM enum, which holds all PTM combinations we currently have structure files for
backend/protzilla/constants/peptide_columns.py: refactor column lists to enums to allow reuse
- backend/protzilla/importing/peptide_import.py: slight adaptations due to change above
backend/protzilla/data_analysis/crosslinking_validation.py: refactor former local get_crosslink_positions_in_protein method to allow reuse outside of the crosslinking use case, rename to reflect more general use-case
backend/protzilla/data_integration/cif_ptm.py: add calc method for new step
backend/protzilla/importing/alphafold_protein_structure_load.py: move the mapping from y/n/. strings to native booleans to constants file
backend/protzilla/methods/importing.py: add a raw cif import for debugging purposes, to be extended by Integrate PTMs from evidence into conventional protein structure files #437
backend/protzilla/utilities/ptm_helpers.py: helper functions for parsing PTM names from a psm_df

Testing

Import a monomer with (or multimer containing a) protein ID for which we have PTMs in an evidence-file, I used Q16555 because the evidence.txt from the MaxQuant_data in the nextcloud folder is well-formed, unlike the one from our example dataset (see Allow abbreviated PTM identifiers in evidence #438).
Import the evidence.txt mentioned above
Connect the relevant inputs to the new step
Observe the result in the visualisations tab, where the non-standard molecules can be highlighted by selecting "non-standard" from the components in the bottom right

PR checklist

Development

If necessary, I have updated the documentation (README, docstrings, etc.)
If necessary, I have created / updated tests.

Mergeability

main-branch has been merged into local branch to resolve conflicts
The tests and linter have passed AFTER local merge
The backend code has been formatted with black
The frontend code has been formatted with pnpm format and checked with pnpm lint

Code review

I have self-reviewed my code.
At least one other developer reviewed and approved the changes

why do we even bother with this

AnnaPolensky

Looks good overall. I especially liked your comments as they were very helpful for understanding what you do when you filter and change the dataframes.

But please provide a description on how to test your changes. I tried this:

And got this error:

Maybe I have uploaded a wrong evidence file? I used the one we got first right at the beginning of our project. For the AlphaFold step I used O43242.

Also, black fails.

AnnaPolensky · 2026-06-02T07:46:27Z

    return matches.iloc[0]


+def get_residue_positions_in_protein(


Would it make sense to maybe extract this function to some helper file now that it is used in the new step in data integration as well?

AnnaPolensky · 2026-06-02T08:01:12Z

+        # and ignore duplicates between different samples
+        [
+            PSM_DF_COLUMNS.SEQUENCE,
+            PSM_DF_COLUMNS.MODIFICATIONS,
+            PSM_DF_COLUMNS.MODIFIED_SEQUENCE,
+            PSM_DF_COLUMNS.PROTEIN_ID,
+        ]
+    ].drop_duplicates(
+        ignore_index=True
+    )


What kind of duplicates do we ignore here? Appear the same modifications of a specific residue several times?

Yes, this can happen if the same modification is detected in different samples in the evidence

AnnaPolensky · 2026-06-02T08:05:42Z

+    psm_df["mod_tuple"] = psm_df.apply(
+        lambda row: [
+            (
+                mod,
+                residue,
+                # caution: position within protein is 1-indexed
+                protein_location,
+            )
+            for mod, mod_locations in extract_mods(
+                row[PSM_DF_COLUMNS.MODIFIED_SEQUENCE],
+                clean_mod_list_of_numbers(row[PSM_DF_COLUMNS.MODIFICATIONS].split(",")),


nit: might just be me, but I was at first a bit confused what mod means in this context because I thought you meant modulo. I think modification might be a bit clearer, but we can keep it like it is

I know what you mean, I just feel that it's cumbersome to always write modification, and like one might use i instead of index if the scope is limited it felt right here to me

AnnaPolensky · 2026-06-02T08:13:41Z

+    atom_site_df[ATOM_SITE_COLUMNS.LABEL_ALT_ID] = "."
+    atom_site_df[ATOM_SITE_COLUMNS.OCCUPANCY] = 1.0
+    atom_site_df[ATOM_SITE_COLUMNS.B_ISO_OR_EQUIV] = pd.NA


I don't know what these values stand for. So just to be sure: We want to hardcode these values and this is not just some remains from debugging?

The first two are always constant from what I saw in the cif files I've handled, where the third is usually a number but one that we have no data for, so there we write NA.

AnnaPolensky · 2026-06-02T08:14:22Z

+    Replaces the residue at a specific location with a modified residue
+
+    :param cif_df: The cif_df to modify
+    :param index: The 1-based index in the protein of the resiude to change


Typo: residue

AnnaPolensky · 2026-06-02T08:56:01Z

+"""
+PTM helper functions, taken and adapted from https://github.com/usiGrabber/usiGrabber
+"""


Is it enough to state this at the beginning of the file, or should we give credit in each adapted function, since someone who only looks at one specific function might miss these credits?

The tool I referenced was developed at our chair, so this is more of a pointer for developers than proper attribution. Nonetheless, licensing stuff would always be stated at the top of the file in a comment in my experience, especially since I slightly changed some stuff

AnnaPolensky · 2026-06-02T09:06:11Z

+    cleaned_mods = []
+    for mod in mod_list:
+        mod = str(mod)
+        cleaned_mods.append(mod.lstrip(digits).lstrip(" "))


This first removes leading digits and then leading spaces. If we had something like " 45something" only the spaces would be removed but not the digits. Is that ok/intended? If so, I might make the docstring a bit more specific about the order in which things are removed.

The data is taken straight from MQ and processed, so we don't have to deal with user input here. As such, the format is <amount if more than 1> , so what you describe could be caught, there is no need for it

AnnaPolensky · 2026-06-02T09:07:06Z

+        str: The simplified modification name (i.e. Oxidation).
+    """
+    mod_name = str(mod_name)
+    return mod_name.lstrip(digits).lstrip(" ").split(" ")[0]


Same as above.

AnnaPolensky · 2026-06-02T09:07:47Z

+    """
+    Simplify a modification name by removing leading numbers and spaces.
+    This function takes a modification name string and removes any leading
+    numeric characters and spaces to return a cleaner version of the name.
+    Args:
+        mod_name (str): The original modification name (i.e. Oxidation (M)).
+    Returns:
+        str: The simplified modification name (i.e. Oxidation).
+    """


The description and example do not fully match.

AnnaPolensky · 2026-06-02T09:08:33Z

Do we need to credit the source from where we got these PTMs? (same question for the other cif files as well)

tE3m · 2026-06-02T10:20:40Z

Looks good overall. I especially liked your comments as they were very helpful for understanding what you do when you filter and change the dataframes.

But please provide a description on how to test your changes. I tried this:

And got this error:

Maybe I have uploaded a wrong evidence file? I used the one we got first right at the beginning of our project. For the AlphaFold step I used O43242.

Also, black fails.

sorry about that, as this PR is still marked as draft I didn't expect you to get pinged - while writing up the testing, I encountered similar issues to you (the one you show happens when there are no PTMs for the protein ID in the evidence), some of which only happened after rebasing onto the current crosslinking state. As those issues are not yet fixed, I hadn't intended for them to be reviewed already. Sorry again!

github-actions · 2026-06-02T12:01:41Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
backend/protzilla
all_steps.py
backend/protzilla/data_analysis
crosslinking_validation.py
backend/protzilla/data_integration
cif_ptm.py					88, 205-246, 348-396
backend/protzilla/importing
alphafold_protein_structure_load.py					175
peptide_import.py					56-57
backend/protzilla/methods
data_integration.py					1008
importing.py					441
backend/protzilla/utilities
ptm_helpers.py					38, 40, 87, 127-130, 178-179, 211-212
Project Total

_{This report was generated by python-coverage-comment-action}

tE3m added 5 commits May 28, 2026 21:52

feat: convert evidence and peptide column constants to enums

85e2cd7

refactor: pull peptide location search within protein up

df6ef13

feat: add PTM cif files

8a04d03

feat: add insertion of PTMs into alphafold structures

f2a5ce2

chore: formatting

ecd4429

tE3m force-pushed the 436-integrate-ptms-from-evidence-into-alphafold-predictions branch from 7f1e50b to ecd4429 Compare June 1, 2026 11:20

chore: fix all steps test

3e1a357

why do we even bother with this

tE3m requested review from AnnaPolensky and Elena-kal June 1, 2026 11:43

AnnaPolensky requested changes Jun 2, 2026

View reviewed changes

tE3m added 2 commits June 2, 2026 13:28

chore: formatting

04f2d66

feat: add visualization output to PTM insertion

ebe0b2c

tE3m added 6 commits June 2, 2026 18:19

fix: multimer handling

ef981b6

rename step category

19a5cd4

fix: respect native boolean value

69d5c55

fix: return early if no PTMs are found

87af14b

fix: correctly re-index changed id column

41b0228

feat: add tests

0532c0f

tE3m marked this pull request as ready for review June 2, 2026 17:45

Conversation

tE3m commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

PR checklist

Uh oh!

AnnaPolensky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tE3m commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tE3m commented May 29, 2026 •

edited

Loading

AnnaPolensky left a comment •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading