Visualization and analysis of proteins in Biopython

Human biology is an incredibly complex science. Even considering that every year we discover more and more secrets of the human body, the answers we receive raise an increasing number of questions. The completion of the Human Genome Project has given many scientists the confidence that genomics will help humankind solve important biological problems. However, the more biological secrets we reveal, the more clearly we understand that other factors influence the use of the organism’s genome. Accordingly, new lines of research were created to solve problems in these interconnected areas, including transcriptomics (the study of mRNA) and proteomics (the study of proteins), in which Python began to be used.

Today, especially for the start of new streams for courses Python for data analysis, Fullstack Python developer and Python for web development in this article we will tell you about another application of this beautiful language – in scientific research.


The Biopython project – a Python toolkit for computational biology and bioinformatics – is used to visualize and analyze DNA and RNA sequences. Protein structure analysis can also be done with the Biopython toolbox! Let’s get to know him in more detail.

Protein Structure Bank (PDB) – a unified database for studying and downloading protein sequences. To work with PDB, a special file format was created, naturally called .pdb. But as scientists had to analyze larger and more complex protein structures, other formats were developed – CIF and mmCIF. The Crystallographic Information File (CIF) has been developed for archiving small molecule crystallographic data. In the framework of such studies, the arrangement of atoms in crystalline solids is studied. Over time, the CIF format began to be used for the analysis of larger molecules (macromolecules, hence the designation mm), received the name mmCIF and eventually replaced the PDB format. [1]

Data visualization using PDB format

Although the currently accepted standard is mmCIF, many systems still support older PDB files.

Consider Phytohemagglutinin-L – a lectin found in some legumes, such as green beans.

Import the required packages:

from Bio.PDB import *
import nglview as nv
import ipywidgets

Now let’s create a PDBParser Biopython instance and use the nglview library to create an interactive visualization. We can pan, scale and rotate the molecule, and even display certain information about the atoms on the screen.

pdb_parser = PDBParser()
structure = pdb_parser.get_structure("PHA-L", "Data/1FAT.pdb")
view = nv.show_biopython(structure)

Data visualization using CIF format

The procedure for CIF files is almost the same, except that you need to use an instance of MMCIF Parser! Here we visualize a larger protein structure. 6EBK, or channel paddle chimera Kv1.2-2.1 in lipid nanodiscs (even difficult to pronounce …).

cif_parser = MMCIFParser()
structure = cif_parser.get_structure("6EBK", "fa/6ebk.cif")
view = nv.show_biopython(structure)

Access to information about protein structure

Heading

The fastest way to access protein structure information is through the header, a metadata dictionary, available in both PDB and CIF format.

mmcif_dict = MMCIFDict.MMCIFDict("fa/1fat.cif")
len(mmcif_dict) # 689

The result is a large vocabulary of protein structure information, including a citation identifying the protein sequence, information about the structure, positions and angles of atoms, and chemical composition. As you can see, the dictionary consists of 689 entries.

Residue sequences

One of the most important pieces of information analyzed is the sequence of protein or polypeptide residues (amino acids). Since proteins can be composed of several polypeptides, we use a structural approach to understand the studied level of organization. From the general structure to individual atoms.

The Structure object in our file is built in the SMCRA architecture (SMSCA) in accordance with the “parent-child” scheme:

  • FROMthe structure consists of models.

  • Mthe dress consists of chains.

  • CA chain consists of residues (amino acids).

  • ABOUTstatok consists of ANDvolumes.

There are many ways to parse structure metadata to obtain sequences of protein residues. Consider three options:

# .get_residues() method in a loop
for model in structure:
    for residue in model.get_residues():
        print(residue)
# .get_residues() method as generator object
residues = structure.get_residues() # returns a generator object
[item for item in residues]
# .unfold_entities - keyword for each level of the SMCRA structure
Selection.unfold_entities(structure, "R") # R is for residues

Creation of polypeptides

Obtaining the above sequence of residues returns the sequence for the entire protein structure, however proteins are often composed of several smaller polypeptides, which may be worth analyzing separately. The Biopython toolbox allows you to do this with polypeptide builders that generate individual polypeptides.

polypeptide_builder = CaPPBuilder()
counter = 1
for polypeptide in polypeptide_builder.build_peptides(structure):
    seq = polypeptide.get_sequence()
    print(f"Sequence: {counter}, Length: {len(seq)}")
    print(seq)
    counter += 1
# Sequence: 1, Length: 36
# SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNLN
# Sequence: 2, Length: 196
# NGEPRVGSLGRAFYSAPIQIWDNTTGTVASFATSFT...ASKLS
# Sequence: 3, Length: 233
# SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNLN...ASKLS
# Sequence: 4, Length: 36
# SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNLN
# Sequence: 5, Length: 196
# NGEPRVGSLGRAFYSAPIQIWDNTTGTVASFATSFT...ASKLS
# Sequence: 6, Length: 35
# SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNL
# Sequence: 7, Length: 196
# NGEPRVGSLGRAFYSAPIQIWDNTTGTVASFATSFT...ASKLS

Analysis of residue sequences

So now we have sequences of residues for these 7 strings, and such sequences can be analyzed by a variety of methods.

from Bio.SeqUtils.ProtParam import ProteinAnalysis

The only caveat is that calling the .get_sequences () procedure returns a Biopython Seq () object, see previous post on my blogfor a more detailed description of Seq () objects and their functionality – ProteinAnalysis requires a string value.

analyzed_seq = ProteinAnalysis(str(seq))

We are now ready to run the following methods to understand our sequence!

Molecular weight

We can calculate the molecular weight of the polypeptide.

analyzed_seq.molecular_weight()
# 4176.51669

GRAVY

The GRAVY protein returns the GRAVY (total mean hydropathy) value for the entered protein sequences. The GRAVY value is calculated by adding the hydropathy value for each residue and dividing by the sequence length (Kyte and Doolittle; 1982). [2]

A higher value indicates more hydrophobicity. Less is more hydrophilic. We will discuss later how to generate a residue based on the hydrophobicity of the residue.

analyzed_seq.gravy()
# -0.5611

Calculation of the number of amino acids

The amount of amino acids of each type can be easily calculated.

analyzed_seq.count_amino_acids()
# {'A': 1,
 'C': 0,
 'D': 2,
 'E': 1,
 'F': 3,
 'G': 1,
 'H': 0,
 'I': 2,
 'K': 0,
 'L': 5,
 'M': 0,
 'N': 6,
 'P': 0,
 'Q': 3,
 'R': 3,
 'S': 5,
 'T': 2,
 'V': 1,
 'W': 0,
 'Y': 1}

Percentage of amino acids

And also the percentage of each amino acid in the sequence! Insert with code

Secondary structure

A very useful method – .secondary_structure_fraction () – returns the fraction of amino acids that can be found in the three classic secondary structures. These are beta-fold structures, alpha-helices and loops (on which the residues change direction).

analyzed_seq.secondary_structure_fraction() # helix, turn, sheet
# (0.3333333333333333, 0.3333333333333333, 0.19444444444444445)

Protein scales

Protein scales are a way to measure specific residue attributes along the length of a peptide sequence using a sliding window. The scales are graduated with values ​​for each amino acid. Each value is based on different physical and chemical properties such as hydrophobicity, secondary structure trends and surface accessibility. Unlike some units of measure at the chain level, such as the general behavior of molecules, the balance allows for a more detailed understanding of how the smaller sections of the sequence will behave.

from Bio.SeqUtils.ProtParam import ProtParamData

Here are some common scales:

Documentation for some common weights can be found here

As an example, consider the hydrophobicity index (kd). Balances are presented in which each residue has an associated value representing its level of hydrophobicity.

kd = {"A": 1.8, "R": -4.5, "N": -3.5, "D": -3.5, "C": 2.5, 
      "Q": -3.5, "E": -3.5, "G": -0.4, "H": -3.2, "I": 4.5, 
      "L": 3.8, "K": -3.9, "M": 1.9, "F": 2.8, "P": -1.6, 
      "S": -0.8, "T": -0.7, "W": -0.9, "Y": -1.3, "V": 4.2}

Positive values ​​indicate hydrophobicity. Isoleucine (I) and valine (V) are the most hydrophobic, while arginine (R) and lysine (K) are the most hydrophilic. Hydrophobic residues are usually located inside the polypeptide, and hydrophilic residues are outside it, so these scales also give an idea of ​​how such a polypeptide might fold.

To perform analysis based on protein weights, you need to specify the size of the window in which the average value will be calculated. Using the “edge” keyword, you can also measure the importance of adjacent residuals, mainly by determining their importance relative to the window mean.

analysed_seq.protein_scale(window=7, param_dict=ProtParamData.kd)
# [-0.7571428571428572,
 -0.2428571428571429,
 -0.24285714285714288,
 -0.38571428571428573,
 -0.6285714285714287,
 -0.942857142857143,
 -1.842857142857143,
 -1.442857142857143,
 -2.3428571428571425,
 -1.3000000000000003,
 -0.01428571428571433,
 0.1285714285714285,
 0.1285714285714285,
 -0.014285714285714235,
 -0.4142857142857143,
 0.3428571428571428,
 -0.31428571428571417,
 -0.35714285714285715,
 -1.014285714285714,
 -0.6285714285714284,
 -0.10000000000000002,
 0.3428571428571429,
 -0.4142857142857142,
 0.24285714285714285,
 -1.0,
 -0.34285714285714286,
 -0.32857142857142857,
 -0.7142857142857143,
 -0.1142857142857144,
 -0.11428571428571435]

Let’s try together

Let’s combine all of our methods and create a script that iterates through each chain of our structure and runs some standard analysis. Let’s create an empty container, fill it with a dictionary of key information for each sequence. After creating such a nested structure, we can get slices, as in any Python container, of individual records.

# Create empty list for chains
all_seqs = []
counter = 1
# For each polypeptide in the structure, run protein analysis methods and store in dict
for pp in ppb.build_peptides(structure):
    seq_info = {} # create an empty dict
    seq = pp.get_sequence() # get the sequence like above
    analyzed_seq = ProteinAnalysis(str(seq)) # needs to be a str 
    # Specify dict keys and values    
    seq_info['Sequence Number'] = counter # set sequence id
    seq_info['Sequence'] = seq # store BioPython Seq() object
    seq_info['Sequence Length'] = len(seq) # length of seq
    seq_info['Molecular Weight'] = analyzed_seq.molecular_weight()
    seq_info['GRAVY'] = analyzed_seq.gravy() # hydrophobicity 
    seq_info['AA Count'] = analyzed_seq.count_amino_acids() 
    seq_info['AA Percent'] = analyzed_seq.get_amino_acids_percent()
    # tuple of (helix, turn, sheet)
    seq_info['Secondary Structure'] = 
        analyzed_seq.secondary_structure_fraction()
    
    # Update all_seqs list and increase counter
    all_seqs.append(seq_info)
    counter += 1

Selecting the first sequence returns a dictionary with our analyzes and values!

all_seqs[0] # select the first sequence
# {'Sequence Number': 1,
 'Sequence': Seq('SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNLN'),
 'Sequence Length': 36,
 'Molecular Weight': 4176.52,
 'GRAVY': -0.5611,
 'Amino Acid Count': {'A': 1,
  'C': 0,
  'D': 2,
  'E': 1,
  'F': 3,
  'G': 1,
  'H': 0,
  'I': 2,
  'K': 0,
  'L': 5,
  'M': 0,
  'N': 6,
  'P': 0,
  'Q': 3,
  'R': 3,
  'S': 5,
  'T': 2,
  'V': 1,
  'W': 0,
  'Y': 1},
 'Amino Acid Percent': {'A': 0.027777777777777776,
  'C': 0.0,
  'D': 0.05555555555555555,
  'E': 0.027777777777777776,
  'F': 0.08333333333333333,
  'G': 0.027777777777777776,
  'H': 0.0,
  'I': 0.05555555555555555,
  'K': 0.0,
  'L': 0.1388888888888889,
  'M': 0.0,
  'N': 0.16666666666666666,
  'P': 0.0,
  'Q': 0.08333333333333333,
  'R': 0.08333333333333333,
  'S': 0.1388888888888889,
  'T': 0.05555555555555555,
  'V': 0.027777777777777776,
  'W': 0.0,
  'Y': 0.027777777777777776},
 'Secondary Structure': (0.3333333333333333,
  0.3333333333333333,
  0.19444444444444445)}

Specific values ​​can be easily selected.

all_seqs[0]['Sequence']
# Seq('SNDIYFNFQRFNETNLILQRDASVSSSGQLRLTNLN')
all_seqs[0]['Molecular Weight']
# 4176.52

Conclusion

The Biopython toolkit not only makes it easier to work with DNA sequences, but it can also be used for proteomics purposes for visualization and analysis of proteins. Biopython implements powerful and flexible standard protein analysis methods, the results of which can be used to design customized processes according to your specific needs. Over time, I will no doubt become aware of some impressive new functionality in Biopython, and I will publish many articles on this interesting topic.

As always, all codes and dependencies described in this article can be found in this repositorywhich I will continue to update as I explore the Biopython toolbox. Hope this tutorial helps you start your own bioinformatics projects with Biopython!

And if your needs are far from bioinformatics, it doesn’t matter, because Python is very common in other directions and is just a must-have for data scientists and analysts, for example. For mastering Python, we have three whole courses, Python for data analysis, Fullstack Python developer and Python for web development, each of which has its own specifics, you can choose which one best suits your goals. And the promo code HABR will give you a 50% discount.

find outhow to level up in other specialties or master them from scratch:

Other professions and courses

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *