Introduction: Coronavirus Mutation Modeling - Monte Carlo Simulations

About: Sophmore in high school

This project is about modeling mutations in the spike protein of SARS-CoV-2. New deadly variants of the coronavirus emerge such as the recent OMICRON variant, and new variants are caused by random sporadic mutations of amino acids in the proteins of the SARS-CoV-2 virus. This project will aim to use Monte Carlo simulations to simulate random mutations in the spike protein of SARS-CoV-2, and this type of modeling may help us identify variants before they occur in real life.

Background terms:

Spike protein - This is the protein that composes the crown shaped points in the outside of the virus. When the spike protein of the virus comes in contact with a human cell, they bind to the Angiotensin-Converting Enzyme 2 (ACE-2) receptor on human cell, and then use endocytosis to get inside the cell. The mRNA of the coronavirus is assembled inside the human cell, which will be used to create new coronaviruses. This is how the virus replicates - by hijacking processes of mRNA translation in human cells.

Monte Carlo simulations - Used to model different outcomes in a process that cannot
easily be determined due to the intervention of random variables. The simulation focuses on repeating random samples many times to produce certain outcomes.

Step 1: Identify 3 Important Mutations

Using the above table (from April 2020), 3 of the most frequent mutations were modeled: D614G,G476S, and V483A

Each letter in the sequences is an amino acid. They make up proteins and they are changed by mutations in the mRNA.

What does D614G mean? D is the amino acid letter before the mutation, 614 is the location of the mutation in the spike protein sequence(see second image), and G is the amino acid that was changed (the mutation). This is the same for the other two mutations. With UNIPROT, a 9-letter amino acid sequence for all 4 mutations was determined, and the first 4 letters before and after the point of each mutation were used for the model. The second image shows the whole spike protein sequence of SARS-CoV-2 and the colored sequences are where mutations occurred, and these will be used for modeling.

D614G - VLYQDVNCT (the letter D in this sequence is mutated into G in the D614G mutation)

G476S - IYQAGSTPC

V483A- PCNGVEGFN

Step 2: Mutation Modeling

Above are the code and flowchart for the mutation modeling. I will be explaining the code only for D614G, since the other 2 mutations work the same way but the sequences are changed. Code is also on this replit: https://replit.com/@AdityaKoushik/D614G#main.py

The code is fairly simple and short, and it uses nested "For" loops to perform Monte Carlo Simulations. Modules used were random (to simulate a mutation), matplotlib, and numpy.

Basic overview:

First, I started with the sequence after the mutation. The goal is to see how many random mutations it takes in order for the non mutated sequence to match the target mutation sequence (how many random mutations does it take to change VLYQDVNCT to VLYQGVNCT). Each time an artificial mutation is simulated, 1 cycle passes (one iteration). The nested loop will break once a match is found, and the number of cycles required to find the match is recorded. This whole process is repeated inside another for loop, where each iteration is the number of runs.

To simulate mutations:

First, the original sequence before the mutation is put in a list. There are two random variables involved in simulating mutation - the location where an amino acid needs to be altered, and which amino acid to change it to. First a random number from 0 to 8 would determine the index that would be changed in the list, and then using random.choice, a random amino acid is put in that index. The new list is converted to a string, and using a conditional, I checked whether the new mutated sequence matches the target sequence. If there is a match, the number of iterations of the loop for a match to occur is appended into a list.

Step 3: Graphs Using MATPLOTLIB, and Data Analysis

The mutation D614G was accurately simulated in an average of 7 cycles, while V483A and G476S were simulated in an average of 6 cycles (see other graphs and charts below). Mutation D614G also had a higher arithmetic median and range than the other mutations, with more scattered cycle numbers. Each spike in the MATPLOTLIB graphs represent the number of cycles required to match the target sequence for 1000 iterations. The MATPLOTLIB graphs for each mutation (D614G, V483A, and G476S) show a wide range of values. In some runs, the number of cycles required to match the mutated sequence can be up to 50 cycles (average of all 1000 runs is 6-7 cycles) while others can be achieved in only one cycle.

The chart below shows the mutation frequency for the 3 mutations (L54F can be disregarded, I was just trying something out).

Step 4: G476S and V483A Graphs and Charts

Step 5: Bonus: T-cell Epitopes

After doing the mutations, I also was curious to simulate if our body's defense system can attack the SARS-CoV-2 virus. T cells are part of the immune system and develop from stem cells in the bone marrow. They help protect the body from infection and may help fight cancer. T-cell epitopes are sequences in the spike protein of SARS-CoV-2 that the T-cell can recognize and then kill the virus. I wanted to see if a non-T-cell epitope part of the virus could mutate into a T-cell epitope, so T-cells can attack the virus. The image above is my methodology for both mutation modeling and T-cell modeling.

Step 6: T-cell Epitope Code

From the IEDB database, I found 48 T-cell epitopes for SARS-CoV-2. Next I took the D614G sequence from before and I wanted to see if I could randomly mutate that sequence to match any of the 48 t-cell epitopes. If a match is found, then the T-cell would be able to kill the virus. In order to give the highest chance for a match, I set the loop to iterate 10 million times. I found out that even after all those iterations, I could not mutate the non T-cell epitope sequence into any of the T-cell epitopes.

Program overview:

This program works similarly to the previous program, but this time there is an additional random variable involved - how many mutations occur. Because we don't know how many letters are mutated (unlike before, where it is just a single letter), I had to let a random variable determine that. Using the same process, mutations were randomly simulated in the non t-cell epitope sequence. If the newly mutated sequence matches any of the T-cell epitopes, the program would print the number of iterations required to do so.

Step 7: Discussion/Conclusions

This project found the following: (a) T-cell resistant sequences in the SARS-CoV-2 spike protein rarely mutate into T-cell epitope sequences even after millions of iterations (equivalent viral life cycle), and (b) single point mutations occur much more frequently (6-7 cycles on average to get a match). In the future, I want try to use a different approach (maybe machine learning) to simulate the mutations, and try this again with sequences from the OMICRON variant.

Hour of Code Speed Challenge

Participated in the
Hour of Code Speed Challenge