436 integrate ptms from evidence into alphafold predictions#449
Conversation
7f1e50b to
ecd4429
Compare
why do we even bother with this
There was a problem hiding this comment.
Looks good overall. I especially liked your comments as they were very helpful for understanding what you do when you filter and change the dataframes.
But please provide a description on how to test your changes. I tried this:
And got this error:
Maybe I have uploaded a wrong evidence file? I used the one we got first right at the beginning of our project. For the AlphaFold step I used O43242.
Also, black fails.
| return matches.iloc[0] | ||
|
|
||
|
|
||
| def get_residue_positions_in_protein( |
There was a problem hiding this comment.
Would it make sense to maybe extract this function to some helper file now that it is used in the new step in data integration as well?
| # and ignore duplicates between different samples | ||
| [ | ||
| PSM_DF_COLUMNS.SEQUENCE, | ||
| PSM_DF_COLUMNS.MODIFICATIONS, | ||
| PSM_DF_COLUMNS.MODIFIED_SEQUENCE, | ||
| PSM_DF_COLUMNS.PROTEIN_ID, | ||
| ] | ||
| ].drop_duplicates( | ||
| ignore_index=True | ||
| ) |
There was a problem hiding this comment.
What kind of duplicates do we ignore here? Appear the same modifications of a specific residue several times?
There was a problem hiding this comment.
Yes, this can happen if the same modification is detected in different samples in the evidence
| psm_df["mod_tuple"] = psm_df.apply( | ||
| lambda row: [ | ||
| ( | ||
| mod, | ||
| residue, | ||
| # caution: position within protein is 1-indexed | ||
| protein_location, | ||
| ) | ||
| for mod, mod_locations in extract_mods( | ||
| row[PSM_DF_COLUMNS.MODIFIED_SEQUENCE], | ||
| clean_mod_list_of_numbers(row[PSM_DF_COLUMNS.MODIFICATIONS].split(",")), |
There was a problem hiding this comment.
nit: might just be me, but I was at first a bit confused what mod means in this context because I thought you meant modulo. I think modification might be a bit clearer, but we can keep it like it is
There was a problem hiding this comment.
I know what you mean, I just feel that it's cumbersome to always write modification, and like one might use i instead of index if the scope is limited it felt right here to me
| atom_site_df[ATOM_SITE_COLUMNS.LABEL_ALT_ID] = "." | ||
| atom_site_df[ATOM_SITE_COLUMNS.OCCUPANCY] = 1.0 | ||
| atom_site_df[ATOM_SITE_COLUMNS.B_ISO_OR_EQUIV] = pd.NA |
There was a problem hiding this comment.
I don't know what these values stand for. So just to be sure: We want to hardcode these values and this is not just some remains from debugging?
There was a problem hiding this comment.
The first two are always constant from what I saw in the cif files I've handled, where the third is usually a number but one that we have no data for, so there we write NA.
| Replaces the residue at a specific location with a modified residue | ||
|
|
||
| :param cif_df: The cif_df to modify | ||
| :param index: The 1-based index in the protein of the resiude to change |
| """ | ||
| PTM helper functions, taken and adapted from https://github.com/usiGrabber/usiGrabber | ||
| """ |
There was a problem hiding this comment.
Is it enough to state this at the beginning of the file, or should we give credit in each adapted function, since someone who only looks at one specific function might miss these credits?
There was a problem hiding this comment.
The tool I referenced was developed at our chair, so this is more of a pointer for developers than proper attribution. Nonetheless, licensing stuff would always be stated at the top of the file in a comment in my experience, especially since I slightly changed some stuff
| cleaned_mods = [] | ||
| for mod in mod_list: | ||
| mod = str(mod) | ||
| cleaned_mods.append(mod.lstrip(digits).lstrip(" ")) |
There was a problem hiding this comment.
This first removes leading digits and then leading spaces. If we had something like " 45something" only the spaces would be removed but not the digits. Is that ok/intended? If so, I might make the docstring a bit more specific about the order in which things are removed.
There was a problem hiding this comment.
The data is taken straight from MQ and processed, so we don't have to deal with user input here. As such, the format is <amount if more than 1> , so what you describe could be caught, there is no need for it
| str: The simplified modification name (i.e. Oxidation). | ||
| """ | ||
| mod_name = str(mod_name) | ||
| return mod_name.lstrip(digits).lstrip(" ").split(" ")[0] |
| """ | ||
| Simplify a modification name by removing leading numbers and spaces. | ||
| This function takes a modification name string and removes any leading | ||
| numeric characters and spaces to return a cleaner version of the name. | ||
| Args: | ||
| mod_name (str): The original modification name (i.e. Oxidation (M)). | ||
| Returns: | ||
| str: The simplified modification name (i.e. Oxidation). | ||
| """ |
There was a problem hiding this comment.
The description and example do not fully match.
There was a problem hiding this comment.
Do we need to credit the source from where we got these PTMs? (same question for the other cif files as well)
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


Description
fixes #436
Introduces a new step to enable insertion of modified residues into a
cif_dfwhere they are detected in a suppliedpsm_df. Requires outputs of a alphafold-import stepChanges
backend/user_data/external/ptm/*: location of the modified residue structure filesbackend/protzilla/constants/cif_columns.py: add required columns and aKnownPTMenum, which holds all PTM combinations we currently have structure files forbackend/protzilla/constants/peptide_columns.py: refactor column lists to enums to allow reusebackend/protzilla/importing/peptide_import.py: slight adaptations due to change abovebackend/protzilla/data_analysis/crosslinking_validation.py: refactor former localget_crosslink_positions_in_proteinmethod to allow reuse outside of the crosslinking use case, rename to reflect more general use-casebackend/protzilla/data_integration/cif_ptm.py: add calc method for new stepbackend/protzilla/importing/alphafold_protein_structure_load.py: move the mapping from y/n/. strings to native booleans to constants filebackend/protzilla/methods/importing.py: add a raw cif import for debugging purposes, to be extended by Integrate PTMs from evidence into conventional protein structure files #437backend/protzilla/utilities/ptm_helpers.py: helper functions for parsing PTM names from apsm_dfTesting
Q16555because the evidence.txt from the MaxQuant_data in the nextcloud folder is well-formed, unlike the one from our example dataset (see Allow abbreviated PTM identifiers in evidence #438).PR checklist
Development
Mergeability
blackpnpm formatand checked withpnpm lintCode review