Abstract
In this report I demonstrate how I have simulated targeted random mutagenesis over several generations of cell divisions (generations). I show how what a theoretical population of mutants would encompass. And prove that my simulation using MATLAB indeed formed random mutations. Successful simulation of mutating colony was demonstrated. For the error rate tested (Rerror= 0.001) the first mutation appears after the 7th generation of cells has been created, which is consistent with statistics. The first few generations of mutants encompasses the vast majority of the overall culture population. It was also determined the speed of the code is ~ 2N where N is the number of generations.
Simulating In Vivo Random Mutagenesis Introduction:
In vivo mutagenesis is a natural phenomenon which occurs in all living organisms. It is when portions of an organism’s genome undergo changes in their base pairs. The most common types of mutations are nonsense, missense, insertion, deletion and frameshift. Here, I will be using MATLAB® to model missense and nonsense mutation as they are most common and most practical. This type of mutation is a swap in one DNA pair with another. The results are a possible change the amino acid made by that gene or a pre-mature stop in the protein synthesis. I will be modeling how a probability of a mutation’s occurrence changes the genotype of a growing population of cells. Modeling mutations is important to understand organism evolution, and cancer development. Lately, targeted in vivo mutations have become a hot topic for screening useful protein type pharmaceuticals and other chemicals.
A program that can model cell mutations can potentially be used to determine the time or total divisions which can lead to cancer or a protein which has useful medicinal purposes. The key mutation is one that significantly changes the properties of a protein. This can be a change in only one base as is in some cancer inducing antigens. Studying how frequently a key mutation occurs can be done by engineering signal proteins to signal once the key mutation has been made. For example, a non-florescent green florescent protein (GFP) can be made active by a key mutation. The key mutation will form the functional fluorescing GPF. It can also show how changes in initial sequences, mutation rates and simultaneously cell sorting can affect the outcome of mutations over generations.
The difficulty modeling mutations over generations of cell divisions is due to the exponential growth of cells. To get an accurate simulations one must follow each cell and its genomic sequence over generations of divisions and mutations. Many assumptions can be made to mitigate the computational expense of the program. However, these may have negative outcomes on the accuracy of the simulation. One must account for mutations happening in each cell, then study the colony as a whole to get the most accurate simulation. Looking at every cell in the culture is very computationally expensive. There can be roughly 106 cells after only 20 generations of cell divisions starting from a single cell. Some assumptions must be made in order to simplify the model and reduce the algorithmic complexity.
Developing code to simulate in vivo cell mutations in a growing culture was more a problem of speed than mathematical rigor. In the next few pages I will provide the story on the evolution of my code and how it has increased in speed by orders of magnitude. I will also provide my methods for solving this problem and show my results. Finally, I will discuss other methods that could be used to solve this problem that I have not attempted.
Theory
First, instead of manipulating and mutating string of A,T,G and Cs we will instead assign A=1, T=2, G=3 and C=4. MATLAB® is much faster at analyzing single digit numbers than strings. Genes with characters using ATGC can be converted before the simulation and again after. For example, a target gene containing a sequence ATGCCGTA would equal 12344321 in MATLAB®.
Second, we will run the simulation starting with only one cell, with one gene of interest. If we started a simulation with a culture of clones we can assume the cell growth is
Where Nf is the final number of cells in the colony. N0 is the initial number of cells, and Ngen is the number of generations of cells. If the starting colony is assumed to all have identical genes, then the number of mutants accumulate for the colony after successive divisions is
Where Nmut is the number of mutant cells, N1 is the number of cells that are mutants when the simulation was ran starting with only one cell.
One assumption that we make is that the DNA has a region with in it most important or interesting (gene of interest (GOI)). For example, perhaps only a 10, 20 or 50 base pair region in the genome may be responsible for your desired protein’s characteristics. This is the case for antibodies variable regions. We can reduce computations by only focusing on those regions. Another assumption is that we can start with a single cell for cultures for which the cells are all clones. This type of program will require use of random number generations, knowledge of statistics, cell growth, and matrix manipulations.
An important concept for genetic mutations is the mutation rate. The mutation rate the decimal percent chance a mutation will occur at any base during DNA replication. A common method for in vitro mutations is to use Error Prone DNA Polymerase (EP DNApol) or EP Taq polymerase in error prone PCR. EP DNApol has roughly a 1% error rate or 0.01. That is, each base has a 1% chance of being different than the parent strand after replication. Error rates in in natural bacteria polymerases can be as a low as 10-9 mistakes per nucleotide. With repair enzymes like in humans, this mutation rate is dropped to 10-10-10-11 . Even with these seemingly low rates of mutations they rapidly build up with cell division. Which is why many antibiotic resistant bacteria have sprouted in recent years.
To be useful for genetic engineering, scientific screening and selection purposes mutation rates of 0.01 to 0.0001 are desired. For this simulation we will be testing mutation and rates in those ranges. Instead of determining whether or not a mutation will happen at each nucleotide I will be determine whether or not a mutation will occur in the targeted stand. This reduces the number of while, if or for loops required. It is also an okay method to use because the likelihood of a mutation happening at any base is very low and the likely hood of a mutation happening on the targeted gene is also low. The probability of a random mutation happening on a gene of length L is given by
rmut is the decimal mutation rate, Pmut is the probability of a mutation on the target genome, and L number of bases in the target genome. Thus, a target genome of 10 bases and 0.001 mutation rate would have approximately one percent chance of obtaining a mutation after one cell division. I will be using random number generator to generate a number between 0 and 1. If Pmut is greater than the random number generated, then a mutation will occur. Another random number generator will choose a whole integer between 1 and the L to determine where the mutation will occur. Finally, another random number generator will choose an integer between 1 and 4 (corresponding to A,T,G and C) to choose the new mutant base to replace the original base at the randomly selected location. A check has been put into place to ensure a base has been chosen that is different from the previous, otherwise it will choose another. The entire process mentioned in this paragraph I call the mutate function.
How the cells divide, propagating the mutations is an important concept in modeling mutations over generations. In my simulation I will assume a negligible amount of cell death. This assumption can be made because I am only running the simulation over a maximum of 20 generations. This correlates to about 10 hours in a yeast culture. In this time we can assume that the first few generations of cells are still alive over a short time period and even if they did die they would encompass a small percentage of the population. Figure (3) shows how cells divide and potentially propagating mutations. The cell on the upper left corner is the generation zero or original gene.
Storage and manipulation of the exponentially increasing size of the cell colony is by far the most challenging aspect of the simulation. A sequence of nucleotide bases 1,2,3 and 4 (A,T,G & C) is stored in a row a matrix of MATLAB® called celli. The previous paragraph mentions how a base is randomly chosen and placed in a random location. I will call the mutation function or mutate(). Each generation undergoes the mutate function and is stored in new rows of a column. The new location for the mutation is shown in the equation below. The number inside the parenthesis is the matrix row. An example of how the mutant library is stored in the matrix is shown below.
Results and Discussion
Since error rates are very low typically there will not actually be mutations in every generation or cell division. Figure(4) shows an example where one mutation is made every generation. Even if a mutation is not made the newly divided clone will still be stored in its allocated matric row. This has drawbacks its drawbacks, as the matrix is now storing redundant information. This creates problems when using a plot function to plot the mutant changes over generations. Simply plotting the celli matrix, which contains the genetic information of every cell including repeats means that you will be plotting potentially millions of data points. Most of these will be the exact same points. I use the unique function in MATLAB to eliminate copied rows, and count the number of copies of the unique rows. After this I plot the unique rows of genes. The results are shown below for various generations of a target genome of L=10 base pairs and a mutation rate of 0.001.
Plotting the unique function reduced the computation time ten times compared to plotting the celli function. Notice how there are no cells with mutations after seven generations. After seven generations mutations begin to occur frequently. This is constant with our prediction that we will see a mutation roughly every 100 cell divisions and that the overall population of cells after eight divisions is 256 cells. A benefit using the unique function is that is allows us to also plot the frequency of repeats. Position 30 in Figure (6) shows the original genome and how it is by far the most common in the culture. The unique genes in less abundance (100-101) appear in later generations, but generally contain a greater number of mutations. It is nice to see the actual relative abundance of unique genes over generations, even if the results are intuitive.
After 20 generations the simulations becomes bogged down by the immense matrix that is created. 20 generations creates a matrix 106 rows with a length equal to the target genome size. It is easy to create a matrix containing 107-108 elements after 20 generations. I have theorized a possible mechanism for vastly reducing the matrix size using a function similar to unique. In the previous simulation after 10 generations there are 1024 cells and only 30 unique sequences. I believe that it is possible to store only unique genes that are created through the mutation process and in another vector store the number of copies of unique genes. In scenario there would be a matrix of 30 unique genes and another vector containing the corresponding sequence. I briefly attempted to try this method but ran into the problem how to store unique genes when they are randomly generated. It involved a complete redesign of my code. I believe this code will be slower for a small number of generations but faster for larger numbers. One potential problem with this new method would be that performing the mutation function would still be required for the amount of times a gene is repeated. For example if a unique gene is repeated 1000 times it would still require 1000 iterations though the mutation function to create that unique gene’s unique mutants.
In conclusion, a culture mutation simulation in combination with flow cytometry has the potential to determine when a protein of desired qualities may be made, if it can be made. It could also show how cultures evolve with selected screening. Modern selected screening methods use flow cytometry sorting such as florescent activated sorting (FACS). Next generation sequencing could confirm the accuracy of the simulation. This can be done by verifying the final cell’s gene variation from the original colony. Developing efficient code to sort through large amounts of DNA sequences may also be a product of this research.
%Mutate2
%cell growth loop
tic %times code
k=1;N=1;i=1;genN=13; %initial conditions
genesize=10; %How many bases in the target gene
originG = randi([1 4],1,genesize); %creates a gene of random bases ATGC (1234) of length Genesize
celli(1,:)=originG; %initalizes the colony matrix first cell to the randomly generated gene
mutation_rate=0.001; %the mutation rate
P=1-(1-mutation_rate).^genesize; %Probability of a mutation
for N = 1:genN
k=1;
for i=(2^(N-1)+1):(2^N)
celli(i,:)=celli(k,:);
willithappen=rand(1,1); %determine if a mutation will happen on this string
while P >= willithappen; %if it does happen choose a random mutaion location on the string
muteloc=randi([1 genesize]);
mutation=randi([1 4]); %pick a random mutation to happen in that location
while celli(k,muteloc)== mutation; %if the mutation is the same as the existing string choose another
mutation=randi([1 4]); %choose another
end;
celli(i,muteloc)=mutation; %place that mutation into the string
willithappen=rand(1,1); %determine if another mutation will happen on the same string, if not end.
end;
k=k+1;
i=i+1;
end
figure
[a,b,c] = unique(celli,'rows');
ucount=accumarray(c,1);
hold on
plot(a')
axis([1 genesize -10 15])
xlabel('distance down DNA strand')
ylabel('Base A=1, T=2, G=3, C=4')
N=N+1
end
originG
[ucells,b,c] = unique(celli,'rows'); %finds unique genes
% The last column lists the counts
B=[ucells accumarray(c,1)];
B= sortrows(B,genesize+1);%sorts the genes by occurance frequency
figure %create new figure
semilogy(B(:,genesize+1)) %creates a semilog plot of the unique gene occurance frequency
xlabel('Unique Targeted Genome')
ylabel('Frequency of occurance in culture after N cell generations')
celli;
toc %end simulation timer