Helper Scripts

To use your own PDB files with MAGPIE GoogleColab or Local-Version, you will need to clean, standardize, and align the input PDB files, by using helper scripts MAGPIE_input_prep.py, align_protein_chain.py, and align_small_molecule.py which can be downloaded from the GitHub repository if you are using the GoogleColab version of MAGPIE. If you have downloaded the Local-Version of MAGPIE, these scripts are included in the downloaded repository. Use of the MAGPIE helper scripts requires python to be installed on your local machine. For information on how to install Python, please visit https://www.python.org/downloads/.

I. `MAGPIE_input_prep.py`

This script allows the user to input files or directories with options to define target ligands and protein binders by chain, protein sequence (with percent sequence identity), ligand name (for small molecule targets), and search radius around the target. The alignment scripts align the protein-ligand complexes on the target ligand for use with MAGPIE, returning subsets of complexes for small molecule ligands as determined by user-defined all-atom RMSD.

To list the arguments of the MAGPIE_input_prep.py script, run the following command in the terminal:

python MAGPIE_input_prep.py help

Usage

MAGPIE_input_prep.py [-h] [-S BINDER_SEQS] [-s TARGET_PROTEIN_SEQS] [-F BINDER_SEQ_FA] [-f TARGET_PROTEIN_SEQ_FA] [-L TARGET_SM_3_NAMES] [-I TARGET_SM_INDEX] [-C BINDER_CHAINS] [-c TARGET_PROTEIN_CHAINS] [-l TARGET_SM_CHAINS] [-U {chains,sequences}] [-u {chains,name}] [-d SEQ_IDENTITY] -i INPUT_PATH -o OUTPUT_PATH [-N TARGET_PROTEIN_CHAIN_RENAME] [-n TARGET_SM_CHAIN_RENAME] [-b BINDER_CHAIN_RENAME] [-t] [-T] [-r SM_LIGAND_REFERENCE_PATH] [-B BOND_LENGTH] [-m SEARCH_RADIUS] [-M MESH_SEARCH] [-A {1v1,MSA}] [-a SEQ_TARGET_REF_PDB]

Argument definitions

-h, --help

show this help message and exit
-S BINDER_SEQS, --binder_seqs BINDER_SEQS

The sequence(s) of the binder used for identification.
-s TARGET_PROTEIN_SEQS, --target_protein_seqs TARGET_PROTEIN_SEQS

The sequence(s) of the target protein used for identification.
-F BINDER_SEQ_FA, --binder_seq_fa BINDER_SEQ_FA

The sequence file (fasta format) of the binder used for identification.
-f TARGET_PROTEIN_SEQ_FA, --target_protein_seq_fa TARGET_PROTEIN_SEQ_FA

The sequence file (fasta format) of the target protein used for identification.
-L TARGET_SM_3_NAMES, --target_sm_3_names TARGET_SM_3_NAMES

The three letter code of the small molecule(s).
-I TARGET_SM_INDEX, --target_sm_index TARGET_SM_INDEX

The index of the small molecule(s). Must be used in tandem with chain or name.
-C BINDER_CHAINS, --binder_chains BINDER_CHAINS

The chain(s) of the binder used for identification.
-c TARGET_PROTEIN_CHAINS, --target_protein_chains TARGET_PROTEIN_CHAINS

The chains of the target protein used for identification.
-l TARGET_SM_CHAINS, --target_sm_chains TARGET_SM_CHAINS

The chain(s) of the small molecule used for identification.
-U {chains,sequences}, –search_first_protein {chains,sequences}
If using both chains and sequences search, what should be used to be filtered first? Choices: “chains” or “sequences” Default: “chains”
-u {chains,name}, –search_first_sm {chains,name}
If using both chains and sequences search, what should be used to be filtered first? Choices: “chains” or “sequences” Default: “chains”
-n TARGET_SM_CHAIN_RENAME, --target_sm_chain_rename TARGET_SM_CHAIN_RENAME

What the target small molecule output chain should be named. Default: “B”
-N TARGET_PROTEIN_CHAIN_RENAME, --target_protein_chain_rename TARGET_SM_CHAIN_RENAME

What the target protein output chain should be named. Default: “A”
-b BINDER_CHAIN_RENAME, --binder_chain_rename BINDER_CHAIN_RENAME

What the binder output chain should be named. Default: “C”
-t TAKE_FIRST_SM_ONLY, --take_first_sm_only TAKE_FIRST_SM_ONLY

Should we only take the first instance of the ligand?
-T NAME_SM_ATOMS_SAME, --name_sm_atoms_same NAME_SM_ATOMS_SAME

Should we rename all matching ligands with the same atom names? Uses the first file as a reference.
-r SM_LIGAND_REFERENCE_PATH, --sm_ligand_reference_path SM_LIGAND_REFERENCE_PATH

Path of the reference file for renaming the small molecule ligands. Default is the first file found with Python’s list directory function.
-B BOND_LENGTH, --bond_length BOND_LENGTH

Distance that defines a bond between 2 atoms for chemical graphs. Default: 2.1
-m SEARCH_RADIUS, --search_radius SEARCH_RADIUS

Distance that is considered for finding neighboring atoms in the mesh search. Default: 8
-M MESH_SEARCH, --mesh_search MESH_SEARCH

The chains, sequence(s), small molecule name(s), small molecule index(es) for the mesh filter. Example: ‘A,B;AWTRWARE,AWAWAWAW;TPA,ATP;1,2’
-A SEQ_TARGET_ALIGN, --seq_target_align SEQ_TARGET_ALIGN

Should we align the target protein in sequence space? Results in PDB numbering via alignment. Do not use it for small molecule ligands. Choices: “1v1”, “MSA”
-a SEQ_TARGET_REF_PDB, --seq_target_ref_pdb SEQ_TARGET_REF_PDB

Reference structure for target protein in seq_target_align.
-d SEQ_IDENTITY, --seq_identity SEQ_TARGET_REF_PDB

Sequence identity threshold for finding similar chains. Default: 95%

Example

python ~/MAGPIE/MAGPIE_input_prep.py -i <input_directory> -o <output_directory> -L COA -M 'A,B;;COA;'

For more examples see the MAGPIE supplementary material in Rodriguez et al. 2023

II. `align_protein_chain.py`

This script is used to align protein chains from different PDB files based on a specified chain identifier. The alignment is done using PyMOL’s cealign command, which performs a sequence-independent alignment of two objects based on their shapes. The function takes four arguments. The script works by iterating over all PDB files in the specified directory, loading each file into PyMOL, and aligning the specified chain in the current file to the same chain in the first file. The RMSD value of each alignment is calculated and written to a CSV file along with the name of the PDB file. If the RMSD value is less than or equal to the threshold, the aligned structure is saved as a new PDB file in the output directory.

Usage

align_protein_chain.py [-h] -c CHAIN_TO_ALIGN [-T RMSD_THRESHOLD] -i INPUT_PATH -o OUTPUT_PATH

Argument definitions

-h, --help

show this help message and exit
-c CHAIN_TO_ALIGN, --chain_to_align CHAIN_TO_ALIGN

chain identifier for chains to align
-T RMSD_THRESHOLD, --rmsd_threshold RMSD_THRESHOLD

RMSD Threshold for filtering.
-i INPUT_PATH, --input_path INPUT_PATH

path of the input directory
-o OUTPUT_PATH, --output_path OUTPUT_PATH

path of the output directory

Example

python ~/MAGPIE/align_protein_chain.py -c B -T 2.5 -i <input_directory> -o <output_directory>

III. `align_small_molecule.py`

This script is used to align small molecules in different PDB files based on a specified chain identifier. The alignment is done using PyMOL’s align command, which performs a sequence-dependent alignment of two objects based on their atom types and bond connectivity. The function works by iterating over all PDB files in the specified directory, loading each file into PyMOL, and aligning the specified chain in the current file to the same chain in the first file. The RMSD value of each alignment is calculated. If the RMSD value is less than or equal to the threshold, the aligned structure is saved as a new PDB file in the output directory. If the RMSD value is greater than the threshold, a new reference structure is created and the process continues with the next PDB file. The function also creates a directory for each reference structure and saves the aligned structures in the corresponding directory. If the pair_fit argument is set to True, the function will use the pair_fit command for alignment, which aligns two objects based on a set of atom pairs. This can be useful when the atom names or numbers differ too much between the two structures.

The pair_fit function is only recommended to be used if MAGPIE_input_prep.py is used first.

Usage

align_small_molecule.py [-h] -c CHAIN_TO_ALIGN [-T RMSD_THRESHOLD] -i INPUT_PATH -o OUTPUT_PATH [-p {True,False}]

Argument definitions

-h, --help

show this help message and exit
-c CHAIN_TO_ALIGN, --chain_to_align CHAIN_TO_ALIGN

chain identifier for chains to align
-T RMSD_THRESHOLD, --rmsd_threshold RMSD_THRESHOLD

RMSD Threshold Difference
-i INPUT_PATH, --input_path INPUT_PATH

path of the input directory
-o OUTPUT_PATH, --output_path OUTPUT_PATH

path of the output directory
-p {True,False}, –pair_fit {True,False}
use pair_fit instead

Example

python ~/MAGPIE/align_small_molecule.py -c B -T 2.5 -i <input_directory> -o <output_directory> -p True

IV. `MAGPIE_protein_relax.py`

Because most structural models are not pre-optimized for analysis by the Rosetta energy function when they are deposited in the PDB, we provide an optional Python helper script for relaxing the structures to do interface energy calculations. The script relies on PyRosetta, an optional MAGPIE dependency. As input, it takes a folder of structural models. As output, it produces relaxed models in an output folder (with the suffix _relaxed.pdb). The user can optionally provide a thread number for the job. This script, MAGPIE_protein_relax.py, is available in the MAGPIE local Github repository. We recommend running it in the MAGPIE Conda environment described in section 11.

Usage

MAGPIE_protein_relax.py [-h] -i INPUT_PATH -o OUTPUT_PATH -n THREADS

Argument definitions

-h, --help

show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH

path of the input directory
-o OUTPUT_PATH, --output_path OUTPUT_PATH

path of the output directory
-n THREADS, --threads THREADS

number of threads to use. Default: 1

Example

python MAGPIE_protein_relax.py -i input -o relaxed_outputs -n 1

Helper Scripts

I. MAGPIE_input_prep.py

Usage

Argument definitions

Example

II. align_protein_chain.py

Usage

Argument definitions

Example

III. align_small_molecule.py

Usage

Argument definitions

Example

IV. MAGPIE_protein_relax.py

Usage

Argument definitions

Example

I. `MAGPIE_input_prep.py`

II. `align_protein_chain.py`

III. `align_small_molecule.py`

IV. `MAGPIE_protein_relax.py`